The concept of a matched case control study is discussed in Chapter 16 with matched tabulation for 1:1 and 1:n matching. In chapter 23 the focus is on analyzing datasets involving attitudes, such as those encountered in the social sciences.

C hapter 1: Starting to use R

## Installation

English so that the responses on your computer will be the same as those in this book. To use this book efficiently, a specialised text editor such as Crimson Editor or Tinn-R must be installed on your computer.

Text Editors Crimson Editor

Starting R Program

Any previous commands that have been typed at the R console will be saved into a file called '.Rhistory' while the current workspace will be saved into a file called ".Rdata". In the next session of computing, when R is started in this folder, the image of the working environment of the last saved R session will be retrieved automatically, together with the command history.

## R libraries & packages

After typing the above command, you may then install the new version of the package as mentioned in the previous section. By including the command library(epicalc) in the RProfile.site file, every time R is run, the Epicalc package will be automatically loaded and ready for use.

On-line help

Having the Epicalc package in the search path means we can use all commands or functions in that package. This is to make sure that the active dataset will be in the second search position.

Using R

For Epicalc users, it is recommended that any additional library should be called early in the session of R, i.e. In the above simple calculation, the results are immediately shown on the screen and are not stored.

Exercises

C hapter 2: Vectors

## Concatenation

The function can be executed with at least two parameters, 'from' and 'to', since the 'by' parameter has a default value of 1 (or -1 if 'to' is less than 'from'). The order of the arguments 'from', 'to' and 'by' is assumed if the words are omitted.

Subsetting a vector with an index vector

Note that the minimum and maximum of the arguments in cut are the outer most boundaries. The unclassed value of a factor is used when the numeric (or integer) values of the factor are required.

Missing values

For example, if we are have a dataset containing a 'sex' variable, classed as a factor, and we want to draw a scatter plot in which the colours of the dots are to be classified by the different levels of 'sex', the colour argument to the plot function would be 'col = unclass(sex)'. Identify the persons who have the lowest and highest BMI and calculate the standard deviation of the BMI.

C hapter 3: Arrays, Matrices and Tables

## Arrays

Individual elements of an array may be referenced by giving the name of the array followed by two subscripts separated by commas inside the square brackets. Note that when a character vector is binded with a numeric vector, the numeric vector is coerced into a character vector, since all elements of an array must be of the same type.

Matrices

Similarly, if we characterize the ages of the patients as being either young or old and the first three patients are young, the next two are old and the last one is young, and the codes for this age classification are 1 for young and 2 for old, then we can create this in R by typing. Note that table1 gives counts of each combination of the vectors sex and age while 'table2' (below) gives the sum of the number of visits based on the four different combinations of sex and age.

Lists

Removing objects from the computer memory also requires a list as the argument to the function rm. A list may also be returned from the results of an analysis, but appears under a special class.

## C hapter 4: Data Frames

Note that the characters are enclosed in quotes and the delimiters (variable separators) are commas. Note firstly that the object 'a' has class data frame and secondly that the names of the variables within the data frame 'a' must be referenced using the dollar sign notation.

Data entry and analysis

The width of each variable and the column names must be specified by the user.

Datasets included in Epicalc

## Reading in data

Note that the commands to select a subset do not have any permanent effect on the data frame. Assigning a NULL value to a variable in the data frame is equivalent to removing that variable.

Attaching the data frame to the search path

The data frame Familydata in the search path occupies the same amount of memory as the one in the current workspace. Thirdly, do not define a new object (say vector or matrix) that may have the same name as the data frame in the search path.

The 'use' command in Epicalc

A number of other commands from the Epicalc package work based on this strategy of making .data the default data frame and exclusively attached to the search path (all other data frames will be detached, unless the argument 'clear=FALSE' is specified in the use function). Then detach from the old .data and re-attach to the most updated one in the search path.

Exercises________________________________________________

C hapter 5: Simple Data Exploration

## Data exploration using Epicalc

The output can be used to write a table of baseline data of the manuscript coming out from the data frame. A dot chart has one axis (in this case the X-axis) representing the range of the variable.

Exercise_________________________________________________

C hapter 6: Date and Time

Computation functions related to date

R has the 'locale' or working location set by the operating system, which varies from country to country. Epicalc displays the results of the summ function in ISO format to avoid country biases.

## Reading in a date variable

When the year value is omitted, R automatically adds the current year of the system in the computer.

Dealing with time variables

If the participant went to bed between 12pm (midday) and 12am (midnight), then the day should be December 13th, otherwise the day should be the 14th, the day of the workshop. The object 'woke.up.time' looks normal, although one or two participants woke up quite early in the morning.

C hapter 7: An Outbreak Investigation

Describing Time

## Case definition

The day of exposure was 25th of August for all records (ignoring the 39 missing values). For simplicitiy we will make sure that the 'onset' variable is exclusively used for cases only.

Paired plot

The point characters and colours of the legend are specified in accordance with those inside the graph. The colours of the points and the lines are corresponding to that in the graph.

C hapter 8: An Outbreak Investigation

Risk Assessment

## Recoding missing values

We will use the distribution of these proportions to guide our grouping of eclair consumption. The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.

Exploration of age and sex

Finally, both the table and age group can be saved as R objects for future use. We have spent some time learning these features of Epicalc for data exploration (a topic of the next chapter).

Comparison of risk: Risk ratio and attributable risk

The risk difference, on the other hand, measures direct health burden and the need for health services. From the protective efficacy value, the exposure to the prevention program would have reduced the risk of the eclair eater (unexposed under this hypothetical condition) by 90.9%.

## Dose-response relationship

It is equal to 1 - RR-1, and is therefore just another way to express the risk ratio. We now explore the relationship between the risk of getting the disease and the number of eclairs consumed.

C hapter 9: Odds Ratios, Confounding and Interaction

Odds and odds ratio

For a cohort study we may compute the ratios of the odds of being a case among the exposed vs the odds among the non-exposed. This is the same value as the ratio of the odds of being exposed among cases and among non-cases.

## Confounding and its mechanism

The value of the odds ratio is not as high but is of statistical significance. The centre of the left-hand side therefore tends to lie closer to the lower box.

Interaction and effect modification

The effect of beef curry among those not eating eclairs tends to be protective but without statistical significance. The stratification factor eclair has modified the effect of beef curry from a non-significant protective factor to a significant risk factor.

C hapter 10: Basic Data Management

Data cleaning

## Identifying duplication ID

In this dataset, the value '9' represents a missing code for religion (3rd variable), patient education (4th variable), income group (5th variable) and reason for family planning (7th variable). There are three essential arguments to the replace function; the target vector, the index vector and the value.

Recoding values using Epicalc

All the maxima have been corrected but the minima of 0 are also missing values for the last four variables plus 'ped'. We can use recode to turn all the zeros into missing values in one step.

Labelling variables with 'label.var'

It is advised to keep each label short since it will be frequently used in the process of automatic graphical display and tabulation. Each label needs to be enclosed in double quotes if it contains a space, otherwise it is optional.

## Adding a variable to a data frame

A sorted table and bar chart are easier to read and viewed when there is no order of category.

Collapsing categories

Finally, the best way to update the data frame with new or modified variable(s) is to use label.var. This command not only labels the variable for further use but also updates and incorporates the data frame with the variable outside.

C hapter 11: Scatter Plots & Linear Regression

The records have been sorted in ascending order of 'worm' (number of worms) ranging from 32 in the first subject to 1,929 in the last one. The 13th record has the highest blood loss of 86 ml per day, which is very high.

## Scatter plots

For a small sample size, putting the identification of each dot can improve the information conveyed in the graph. In order to draw a regression line, a linear model using the above two variables should be fit to the data.

Components of a linear model

The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument 'lower.tail' is set to FALSE to obtain the right margin of the area under the curve of the F distribution.

Regression line, fitted values and residuals

The sum of residuals is close to zero whereas the sum of their squares is the value previously displayed in the summary of the model.

## Checking normality of residuals

Finally, the residuals are plotted against the fitted values to see if there is a pattern. With this and the above findings from the 'qqnorm' command we may conclude that the residuals are randomly and normally distributed.

C hapter 12: Stratified linear regression

For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has 'saltadd' being 0.

C hapter 13: Curvilinear Relationship

This is confirmed by the poor fit of the regression line in the previous graph. However, the values of the high residuals are in the middle of the range of the fitted values, indicating that perhaps we need to include a quadratic term of age in the model.

## Stratified curvilinear model

To illustrate the change of log(money) by age, a series of box plots are drawn with the statistical parameters stored in a new object 'a'. Then lines are drawn to join the median of log(money) of the age groups, which are in the third row of 'a'.

Modelling with a categorical independent variable

Note that he coefficient of 'Child' is the negative of that of 'Adult' from model 'lm5'.

C hapter 14: Generalized Linear Models

Model attributes

In this setting, the 'deviance' from the glm is equal to the sum of squares of the residuals. Similarly, the 'null.deviance' is equal to the total sum of squares of the difference of individual amount of blood loss from the mean blood loss.

## Attributes of model summary

The two sets of attributes are similar with more sub-elements for the 'bloodloss.glm. Some of the attributes in of the 'glm' are rarely used but some, such as 'aic', are very helpful.

Covariance matrix

The interpretation of the error is the same as from the linear model; a larger deviance indicates a poorer fit. The value of the maximum likelihood is small because it is the product of probabilities.

C hapter 15: Logistic Regression

## Distribution of binary outcome

There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU. Another vector of the same name 'log10.strep' is created in the form of a data frame for plotting a fitted line on the same graph.

Logistic regression with a binary independent variable

The above part of the display is actually a matrix from the object 'coef(summary(glm0))'. The Mantel- Haenszel method only gives the odds ratio of the variable of main interest.

Interaction

This variable is not a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from 'glm4'). The odds ratios and their confidence intervals must still be calculated regardless of the statistical significance.

## Interpreting the odds ratio

The odds ratio of 'eclair.eat' depends on the value of 'beefcurry' and vice versa. Similar to the preceding section, logistic.display can be used in order to obtain the odds ratio and 95% confidence interval of the exposure to induced abortion.

C hapter 16: Matched Case Control Study

In the remaining five pairs (right upper corner), the controls smoked while the cases did not. The level of contrast of history of smoking between the two based on matched pairs is called a conditional odds ratio.

1:n matching

The command gives six tables based on the matched sets of the same size (cases per controls). However, the effect of smoking on the outcome is still not statistically significant as the 95% confidence interval of the odds ratio contains the value 1.

## Logistic regression for 1:1 matching

In the above glm model, the difference of the outcome (which is always 1 for the above reason) is predicted by the difference in smoking habit. In conditional logistic regression, there is no such intercept because the difference of the outcome is fixed to 1, the logit of which is 0.

Conditional logistic regression

The conditional logistic regression model gives neither the log likelihood nor AIC value but it does give the conditional log likelihood, which also indicates the level of fit. The first sub-element, which is the conditional likelihood of the null model, is the same for all the conditional logistic regression models.

C hapter 17: Polytomous Logistic Regression

Two-way tabulation reveals the highest proportion (74.7%) of ever IA in the EP group compared to 54.4% and 34.4% in the IA and Deli groups, respectively. The distribution of gravidity among the EP and IA groups are more or less the same, i.e.

Polytomous logistic regression using R