# James P Howard

## Cardiology trainee in London

I'm a NIHR Academic clinical fellow in cardiology at Imperial College London.

Some of my key publications are listed below; others are visible on my Google Scholar profile and CV.

R - a quick guide

This is a quick guide to performing common statistical tests and expressions in R

## Basic stats - R

### Basic stats

A series e.g. `series <- c(0,1,2,3)`
SD of a series - `sd(series)`
Mean - `mean(series)`
Converting an odds ratio to a probability - `OR / (1 + OR)`

### Probability

The choosing function - "What are the chances in getting a certain 2 cards from 2 draws of a 52 card deck?"
`choose(52,2)`

### Binomials - coin flip experiments

Variance of a binomial
`np(1-p) ; Number of trials * probability * (1 - probability)`
SD of a binomial
`sqrt(np(1-p))`
"What are the chances of getting exactly 1 head from 5 coin tosses?"
`dbinom(1,5,0.5)`
"What are the chances in a couple having at least 1 boy if they have 5 children?"
`1-pbinom(0,5,0.5)`
Or
`pbinom(5,5,0.5)-pbinom(0,5,0.5)`
"What is the standard deviation for 500 coin flips?"
`sqrt(500*0.5*(1-0.5))`

### Normal distributions/Z scores

Proportion of values with a Z score of -1 (or 1 SD below mean)
`pnorm(-1)`
Approximating binomials to normals, e.g. N=100 and p=.2; calculate the probability of getting 10 successes or fewer
1. Calculate mean `(n * p)` = 20
2. Calculate SD = `sqrt(100 * 0.2 * (1 - 0.2))` = 4
3. Calculate Z score = `(result - mean) / SD` = -10/4 = -.25
4. `pnorm(Z-score)`
...Or you could just do `pbinom(10,100,0.2)` of course and not approximate the binomial to a normal...
E.g. What is the p value for 100 volunteers flipping 65 100 coins and getting 65 (or more) heads?
`(1-pbinom(64,100,0.5))`
And a two-sided test:
`(1-pbinom(64,100,0.5))*2`

## Statistical tests - continuous data

### T-test**

(For 2 groups)

Two sided:
`t.test(series1, series2)`

One sided:
`t.test(series1,series2, alternative="<greater or less>")`

One sample:
`t.test(series1,mu=<expected mean>)`

Unequal variances:
`t.test(series1,series2, var.equal=FALSE)`

Or if given difference, SD, n calculate the T score:
`Tscore = -Diff / ( (pooledSD2 / n) + (pooledSD2 / n))`
or
`Tscore = Diff / (standarderror)`
then calculate p value:
`pt(-abs(Tscore),df=<df>)`

### ANOVA**

(2+ groups; "Is the variance between groups greater than the variance within them?")

``` y = c(ser1, ser2, ser3) n = rep(<n in ser>, <n sers e.g. 3>) group = rep(1:<e.g. 3>, n) data = data.frame (y = y, group = factor(group)) fit = lm(y ~ group, data) anova(fit) ```
For 95% CIs:
``` anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), <df>,lower.tail = FALSE)```
(Where <df> derived from Residuals from 1st test above)

### Pearson's correlation coefficient

(1 continuous linear predictor)>
`cor.test(series1,series2,method="pearson",use="complete.obs")`

### Linear regression

Simple linear regression:
Create a data frame:
```dataframe = data.frame(outcome = series1, predictor1 = series2, predictor2 = series3, categoricalpredictor = series4) df.mod1 = lm(outcome ~ predictor1 + predictor2 + factor(categoricalpredictor), data = dataframe) summary(df.mod1)```

### Paired T-test**

- 2 groups or time points Two sided:
`t.test(series1, series2, paired=TRUE)`

### Repeated-measures ANOVA**

2+ groups or time points
e.g. 2 groups are followed up over multiple time points, not just simple before and after (T-test)

MV regression

### Wilcoxon rank-sum AKA Mann-Whitney U

(Alternative to T-test; small samples)
`wilcox.test(series1,series2)`

### Wilcoxon sign-rank

(Alternative to paired T-test; small samples)
`wilcox.test(series1,series2,paired=TRUE)`

### Kruskal-Wallis

(Alternative to ANOVA)

### Spearman rank correlation coefficient

(Alternative to Pearson's)

## Statistical tests - categorical data

### Risk difference/relative risks

For a 2x2 table (where n*p>5; otherwise can't assume normal; use non-parametric)

```mymatrix <- matrix(c(25,20,20,25),ncol=2,byrow=TRUE) mytable <- as.table(mymatrix) summary(mytable)```

### Chi-square

RxC table
Sane as 2x2 table above

### Logistic regression

See instructions for assembling the data frame for 'Linear regression'
Then:
```df.logit = glm(outcome ~ predictor1 + factor(predictor2), data = dataframe, family = "binomial") summary(df.logit)```
N.B. Estimate for predictors will be given as logits (natural log of the OR; ln(p/1-p)). These can be converted to an OR using `exp(booksmartlogit\$coefficients)`. If e.g. 0.96, this means for every 1 increase in your predictor variable, your odds decrease by 4%. For confidence intervals of the coefficients:
`confint(df.logit)`
And for confidence intervals of the ORs:
`exp(cbind(OR = coef(df.logit), confint(df.logit)))`
If a categorical variable (factor; e.g. predictor2) is used, each level will be split as a 'term' seen in the summary(). An overall effect of the categorical predictor can be assessed using the `wald.test()` function of the `aod` library:
`wald.test(b = coef(df.logit), Sigma = vcov(df.logit), Terms = 3:5)`
...Where 3:5 refers to themes 3 to 5 (starting at 1; in this example the first level of predictor2 would be theme 3)
If you want to look for an interaction (i.e. is predictor1 affecting the outcome significantly differently in males vs female):
`df.logit = glm(outcome ~ predictor1 * factor(male), data = dataframe, family = "binomial")`
(You can test all combinations of interactions between `a`, `b`, and `c` with `a * b * c`, or only a subset of the interactions using `a + b + c + a:b` (which would not test for a:c or a:b:c)

### McNemar's Chi-Square test

For a 2x2 table with correlated data

### Conditional logistic regression

Like logistic regression, but for correlated data

### Fisher's exact test

Alternative to Chi-square for sparse non-correlated data (expected in any cell < 5)
Perform as per 2x2 table, but replace `summary()` with:
`fisher.test(mytable)`

### McNemar's exact test

Alternative to McNemar's test for sparse correlated data
Perform as per 2x2 table, but replace `summary()` with:
```library(exact2x2) mcnemar.exact(mytable)```

## Statistical tests - Time-to-event

(Two groups)

### Kaplan-Meier & Log rank

(Two+ groups)
`library(survival)`
`hmohiv<-read.table("http://www.ats.ucla.edu/stat/r/examples/asa/hmohiv.csv", sep=",", header = TRUE)`
Assuming right censored data:
```surv <- survfit(Surv(time, censor)~ strata(drug), conf.type="none") plot(surv0drug, xlab="Time", ylab="Survival Probability") plot(surv, xlab="Time", ylab="Survival Probability", col=c("Red","Black")) legend(40, 1.0, c("Drug - No", "Drug - Yes") , lty=c(1,3) )``` Calculate log rank:
`survdiff(Surv(time, censor) ~ drug)`

### Cox regression

(Multivariate regression technique producing hazard ratios; two+ groups)
`library(survival)`
Load data e.g. and preview the top rows:
```data(lung) head(lung)``` Add a survival object (status == 2 is death)
```lung\$SurvObj <- with(lung, Surv(time, status == 2)) head(lung)``` Calculate KM estimates for whole set and then by sex:
``` km.as.one <- survfit(SurvObj ~ 1, data = lung, conf.type = "log-log") km.by.sex <- survfit(SurvObj ~ sex, data = lung, conf.type = "log-log")``` Fit Cox regression, adjusting for age, sex, Karnofsky performance score, wt loss and print data:
```res.cox1 <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung) summary(res.cox1)```
Check for violation of proportional hazard (constant HR over time) and plot scaled Schoenfeld residuals, along with a smooth curve, in a 2x2 plot
``` res.zph1 <- cox.zph(res.cox1) res.zph1``` - Look for significant p values
```par(mfrow=c(2,2)) plot(res.zph1)```- Observe graphs for obvious non-linearity - sometimes above tests fail to detect

If the above imply non-linearity one can either stratify if a categorical variable e.g. `ph.karno` in this example:
`res.cox1.strata <- coxph(SurvObj ~ age + sex + strata(ph.karno) + wt.loss, data = lung)`
or on can use a Time-varying effects model - see right

(MV regression)

### Time-varying effects

(Akin to Cox-regression but for violation of the proportional hazard ratio assumption)
``` library(survival) data(lung) head(lung) lung\$event <- (lung\$status == 2)``` - set a column to 1 or 0 if dead (status==2) or not.

Create a survival data set with cuts at 200 and 400 (based around the appearance of the Schoenfeld residuals where (see Cox regression):
``` lung.split <- survSplit(data = lung, cut = c(200,400), # vector of timepoints to cut at end = "time", # character string with name of event time variable event = "event", # character string with name of censoring indicator start = "start", # character string with name of start time variable (created) id = "id", # character string with name of new id variable to create zero = 0, # If start doesn't already exist, used as start episode = NULL # character string with name of new episode variable (optional) ) lung.split\$id <- as.numeric(lung.split\$id) # make ID numeric library(doBy) lung.split <- orderBy( ~ id + start, lung.split)```

Create a survival object column:
`lung.split\$SurvObj <- with(lung.split, Surv(time = (start), time2 = time, event = event))`

`head(lung.split, 10)`
``` res.cox1.strata <- coxph(SurvObj ~ age + sex + ph.karno + ph.karno:factor(start) + wt.loss + cluster(id), data = lung.split) summary(res.cox1.strata)```
** Homogeneity of variances - though with T-test you can unpool the variances. Can test this using Fisher's F-test: `var.test(a,b)` where a and b are an array of values. If p > 0.05 then assume the variances are homogenous. Also, can compare the value for F for alpha = 0.05 with `qf(0.95, <degrees of freedom numerator>, <degrees of freedom demonimator>)`, where e.g. DOF both = 9 for comparing 2 groups of 10 each.