James P Howard

Cardiology trainee in London

I'm a NIHR Academic clinical fellow in cardiology at Imperial College London.

Some of my key publications are listed below; others are visible on my Google Scholar profile and CV.

R - a quick guide


This is a quick guide to performing common statistical tests and expressions in R

Basic stats - R

Basic stats

A series e.g. series <- c(0,1,2,3)
SD of a series - sd(series)
Mean - mean(series)
Converting an odds ratio to a probability - OR / (1 + OR)

Probability

The choosing function - "What are the chances in getting a certain 2 cards from 2 draws of a 52 card deck?"
choose(52,2)

Binomials - coin flip experiments

Variance of a binomial
np(1-p) ; Number of trials * probability * (1 - probability)
SD of a binomial
sqrt(np(1-p))
"What are the chances of getting exactly 1 head from 5 coin tosses?"
dbinom(1,5,0.5)
"What are the chances in a couple having at least 1 boy if they have 5 children?"
1-pbinom(0,5,0.5)
Or
pbinom(5,5,0.5)-pbinom(0,5,0.5)
"What is the standard deviation for 500 coin flips?"
sqrt(500*0.5*(1-0.5))

Normal distributions/Z scores

Proportion of values with a Z score of -1 (or 1 SD below mean)
pnorm(-1)
Approximating binomials to normals, e.g. N=100 and p=.2; calculate the probability of getting 10 successes or fewer
1. Calculate mean (n * p) = 20
2. Calculate SD = sqrt(100 * 0.2 * (1 - 0.2)) = 4
3. Calculate Z score = (result - mean) / SD = -10/4 = -.25
4. pnorm(Z-score)
...Or you could just do pbinom(10,100,0.2) of course and not approximate the binomial to a normal...
E.g. What is the p value for 100 volunteers flipping 65 100 coins and getting 65 (or more) heads?
(1-pbinom(64,100,0.5))
And a two-sided test:
(1-pbinom(64,100,0.5))*2



Statistical tests - continuous data

Parametric*

Non-parametric

Independent

Correlated

T-test**

(For 2 groups)

Two sided:
t.test(series1, series2)

One sided:
t.test(series1,series2, alternative="<greater or less>")

One sample:
t.test(series1,mu=<expected mean>)

Unequal variances:
t.test(series1,series2, var.equal=FALSE)

Or if given difference, SD, n calculate the T score:
Tscore = -Diff / ( (pooledSD2 / n) + (pooledSD2 / n))
or
Tscore = Diff / (standarderror)
then calculate p value:
pt(-abs(Tscore),df=<df>)

ANOVA**

(2+ groups; "Is the variance between groups greater than the variance within them?")

y = c(ser1, ser2, ser3)
n = rep(<n in ser>, <n sers e.g. 3>)
group = rep(1:<e.g. 3>, n)
data = data.frame (y = y, group = factor(group))
fit = lm(y ~ group, data)
anova(fit)

For 95% CIs:
anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), <df>,lower.tail = FALSE)
(Where <df> derived from Residuals from 1st test above)

Pearson's correlation coefficient

(1 continuous linear predictor)>
cor.test(series1,series2,method="pearson",use="complete.obs")

Linear regression

Simple linear regression:
Create a data frame:
dataframe = data.frame(outcome = series1, predictor1 = series2, predictor2 = series3, categoricalpredictor = series4)
df.mod1 = lm(outcome ~ predictor1 + predictor2 + factor(categoricalpredictor), data = dataframe)
summary(df.mod1)

Paired T-test**

- 2 groups or time points Two sided:
t.test(series1, series2, paired=TRUE)

Repeated-measures ANOVA**

2+ groups or time points
e.g. 2 groups are followed up over multiple time points, not just simple before and after (T-test)

Mixed models/GEE modelling

MV regression

Wilcoxon rank-sum AKA Mann-Whitney U

(Alternative to T-test; small samples)
wilcox.test(series1,series2)

Wilcoxon sign-rank

(Alternative to paired T-test; small samples)
wilcox.test(series1,series2,paired=TRUE)

Kruskal-Wallis

(Alternative to ANOVA)

Spearman rank correlation coefficient

(Alternative to Pearson's)

Statistical tests - categorical data

Independent

Correlated

Alternatives

Risk difference/relative risks

For a 2x2 table (where n*p>5; otherwise can't assume normal; use non-parametric)

mymatrix <- matrix(c(25,20,20,25),ncol=2,byrow=TRUE)
mytable <- as.table(mymatrix)
summary(mytable)

Chi-square

RxC table
Sane as 2x2 table above

Logistic regression

See instructions for assembling the data frame for 'Linear regression'
Then:
df.logit = glm(outcome ~ predictor1 + factor(predictor2), data = dataframe, family = "binomial")
summary(df.logit)

N.B. Estimate for predictors will be given as logits (natural log of the OR; ln(p/1-p)). These can be converted to an OR using exp(booksmartlogit$coefficients). If e.g. 0.96, this means for every 1 increase in your predictor variable, your odds decrease by 4%. For confidence intervals of the coefficients:
confint(df.logit)
And for confidence intervals of the ORs:
exp(cbind(OR = coef(df.logit), confint(df.logit)))
If a categorical variable (factor; e.g. predictor2) is used, each level will be split as a 'term' seen in the summary(). An overall effect of the categorical predictor can be assessed using the wald.test() function of the aod library:
wald.test(b = coef(df.logit), Sigma = vcov(df.logit), Terms = 3:5)
...Where 3:5 refers to themes 3 to 5 (starting at 1; in this example the first level of predictor2 would be theme 3)
If you want to look for an interaction (i.e. is predictor1 affecting the outcome significantly differently in males vs female):
df.logit = glm(outcome ~ predictor1 * factor(male), data = dataframe, family = "binomial")
(You can test all combinations of interactions between a, b, and c with a * b * c, or only a subset of the interactions using a + b + c + a:b (which would not test for a:c or a:b:c)

McNemar's Chi-Square test

For a 2x2 table with correlated data

Conditional logistic regression

Like logistic regression, but for correlated data

Fisher's exact test

Alternative to Chi-square for sparse non-correlated data (expected in any cell < 5)
Perform as per 2x2 table, but replace summary() with:
fisher.test(mytable)

McNemar's exact test

Alternative to McNemar's test for sparse correlated data
Perform as per 2x2 table, but replace summary() with:
library(exact2x2)
mcnemar.exact(mytable)

Statistical tests - Time-to-event

Independent

Correlated

Alternatives

Rate ratio

(Two groups)

Kaplan-Meier & Log rank

(Two+ groups)
library(survival)
Load data e.g.:
hmohiv<-read.table("http://www.ats.ucla.edu/stat/r/examples/asa/hmohiv.csv", sep=",", header = TRUE)
Assuming right censored data:
surv <- survfit(Surv(time, censor)~ strata(drug), conf.type="none")
plot(surv0drug, xlab="Time", ylab="Survival Probability")
plot(surv, xlab="Time", ylab="Survival Probability", col=c("Red","Black"))
legend(40, 1.0, c("Drug - No", "Drug - Yes") , lty=c(1,3) )
Calculate log rank:
survdiff(Surv(time, censor) ~ drug)

Cox regression

(Multivariate regression technique producing hazard ratios; two+ groups)
library(survival)
Load data e.g. and preview the top rows:
data(lung)
head(lung)
Add a survival object (status == 2 is death)
lung$SurvObj <- with(lung, Surv(time, status == 2))
head(lung)
Calculate KM estimates for whole set and then by sex:
km.as.one <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log")
km.by.sex <- survfit(SurvObj ~ sex, data = lung, conf.type = "log-log")
Fit Cox regression, adjusting for age, sex, Karnofsky performance score, wt loss and print data:
res.cox1 <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
summary(res.cox1)

Check for violation of proportional hazard (constant HR over time) and plot scaled Schoenfeld residuals, along with a smooth curve, in a 2x2 plot
res.zph1 <- cox.zph(res.cox1)
res.zph1
- Look for significant p values
par(mfrow=c(2,2))
plot(res.zph1)
- Observe graphs for obvious non-linearity - sometimes above tests fail to detect

If the above imply non-linearity one can either stratify if a categorical variable e.g. ph.karno in this example:
res.cox1.strata <- coxph(SurvObj ~ age + sex + strata(ph.karno) + wt.loss, data = lung)
or on can use a Time-varying effects model - see right

Frailty model

(MV regression)

Time-varying effects

(Akin to Cox-regression but for violation of the proportional hazard ratio assumption)
library(survival)
data(lung)
head(lung)
lung$event <- (lung$status == 2)
- set a column to 1 or 0 if dead (status==2) or not.

Create a survival data set with cuts at 200 and 400 (based around the appearance of the Schoenfeld residuals where (see Cox regression):
lung.split <- survSplit(data = lung,
cut = c(200,400), # vector of timepoints to cut at
end = "time", # character string with name of event time variable
event = "event", # character string with name of censoring indicator
start = "start", # character string with name of start time variable (created)
id = "id", # character string with name of new id variable to create
zero = 0, # If start doesn't already exist, used as start
episode = NULL # character string with name of new episode variable (optional)
)
lung.split$id <- as.numeric(lung.split$id) # make ID numeric
library(doBy)
lung.split <- orderBy( ~ id + start, lung.split)


Create a survival object column:
lung.split$SurvObj <- with(lung.split, Surv(time = (start), time2 = time, event = event))

Preview your data:
head(lung.split, 10)

Perform the analysis whilst testing for interaction between the violating variable and the cut (start) times and print the results
res.cox1.strata <- coxph(SurvObj ~ age + sex + ph.karno + ph.karno:factor(start) + wt.loss + cluster(id), data = lung.split)
summary(res.cox1.strata)



* Normally distributed outcomes - important for small samples. Large samples are robust against this because of central limit theorem; means are normally distributed, even if underlying trait is not).

** Homogeneity of variances - though with T-test you can unpool the variances. Can test this using Fisher's F-test: var.test(a,b) where a and b are an array of values. If p > 0.05 then assume the variances are homogenous. Also, can compare the value for F for alpha = 0.05 with qf(0.95, <degrees of freedom numerator>, <degrees of freedom demonimator>), where e.g. DOF both = 9 for comparing 2 groups of 10 each.