### PART1: Analyzing provided genotype and phenotype
data.
Prepare the data.
Read in the genotype and
phenotype matrices.
genos = as.matrix(read.table("./genos.txt"))
phenos = as.matrix(read.table("./phenos.txt"))
Make a histogram of the phenotypes. Do they look normally distributed?
hist(phenos)
Yes. This approximates a normal distribution.
How are the
genotypes encoded?
table(genos)
## genos
## 0 1 2
## 4773842 5447131 4779027
Genotypes are numerical variables encoded as 0, 1, or 2. These
represent homozygous major allele, heterozygous, or homozygous minor
allele. It is possible that homozygote minor allele is either 0 or 2,
and this could depend on the SNP (need to check).
How many individuals are there in the dataset and how many SNPs?
(Save them in N
and M
, respectively.)
dim(genos)
## [1] 1500 10000
dim(phenos)
## [1] 1500 1
N = nrow(genos)
M = ncol(genos)
There are 1,500 individuals and 10,000 SNPs in the data. We know
this because we expect one phenotype for each individual.
Compute the minor allele frequency for every SNP. Check MAFs are <0.5.
MAFs = array(0,M)
for(i in 1:M) {
gi = genos[,i]
MAFs[i] = min(sum(gi)/(2*N),sum(2-gi)/(2*N))
}
MAFs[1:10]
## [1] 0.1516667 0.3226667 0.4126667 0.2156667 0.4626667 0.4826667 0.4863333
## [8] 0.1516667 0.2990000 0.3310000
max(MAFs)
## [1] 0.5
There are multiple ways to do the MAF calculation. The key ideas
are to identify if 0 or 2 is the minor allele (using minimum or maximum
functions, for example), to count homozygous minor allele twice, to
count heterozygous once, and to divide by twice the number of
individuals as they represent \(2*N\)
chromosomes.
Run a GWAS under an additive model and save
the p-values, z-scores, and effect sizes.
pvalues = array(0,M)
zscores = array(0,M)
betas = array(0,M)
for(i in 1:M) {
g = genos[,i]
res = summary(lm(phenos~g))
zscores[i] = res$coefficients[2,3]
pvalues[i] = res$coefficients[2,4]
betas[i] = res$coefficients[2,1]
}
Summarize the effect sizes.
summary(betas)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.400728 -0.105822 0.002475 0.004204 0.113524 2.245525
hist(betas)
Are there any significantly associated SNPs? If so, which SNPs are they?
assoc = which(pvalues<0.05/M)
assoc
## [1] 1 2 3 4 5 6 7 8 9 10
SNPs 1-10 are significantly associated with the phenotype.
How big are their effect sizes? How significant are they?
betas[assoc]
## [1] -3.4007282 2.1340404 1.5238919 0.7930685 -1.4787008 1.9852459
## [7] -1.2225227 2.2455246 1.8605128 0.7607997
zscores[assoc]
## [1] -21.021900 16.013141 11.615036 5.003065 -11.436804 16.128209
## [7] -9.115941 12.602187 13.341784 5.383964
pvalues[assoc]
## [1] 3.488001e-86 2.148840e-53 6.341996e-30 6.308608e-07 4.244077e-29
## [6] 4.411319e-54 2.434694e-19 1.103317e-34 1.884171e-38 8.447687e-08
The effect sizes and p-values vary but all represent fairly
strong associations. The first SNP has a particularly significant
phenotype association.
Draw a QQ plot for log10(p) values.
obsLogPvs = sort(-log10(pvalues))
expLogPvs = sort(-log10(seq(1/M,1,1/M)))
plot(expLogPvs,obsLogPvs,main='QQ plot')
abline( a=0, b=1 )
#label the significant SNPs red
points(expLogPvs[(M-length(assoc)):M],obsLogPvs[(M-length(assoc)):M],col="red")
Is there inflation? Use the chi-square statistics to check.
chis = zscores^2
lambdaGC = median(chis)/0.454 # why .454?
median(rnorm(10000000)^2)
## [1] 0.4552651
lambdaGC
## [1] 1.007873
Given that the calculated lambdaGC
value (1.008) is
close to 1, there is not a significant amount of inflation. This is
corroborated by the QQ plot showing that SNP p-values do not deviate
from the null expectation except for a few highly significant SNPs.
The value 0.454 is used because it is the expected median chi-square
statistic (df=1) under the null hypothesis that no SNPs are
significantly associated. We can check this empirically by generating a
large number of N(0,1) variables, squaring them, and taking the median.
It can also be shown analytically.
Plot the phenotype predictions for the most significant SNP.
topSNP = genos[,order(pvalues)[1]]
plot(topSNP,phenos)
abline(lm(phenos~topSNP)$coeff,col="red")
The predictions go through the middle of the phenotype values for
each genotype group, but are not precisely equal to the mean for each
group since we fit an additive model that constrains the difference
between the prediction for the 0 vs. 1 genotype to be the same as that
of the 1 vs. 2 genotype. If we encoded the genotype as a factor, we
could have estimated a separate prediction for each of the three
genotypes (technically, an intercept representing the 0 genotype and two
offsets from the intercept representing the 1 and 2 genotypes). In this
case, the prediction for each genotype group is the mean value of the
phenotypes for individuals with that genotype.
Build a linear predictor of the phenotype using the associated SNPs.
ypred = array(0,N)
for(i in 1:N) {
ypred[i] = genos[i,assoc] %*% betas[assoc]
}
plot(ypred,phenos)
There is some variability, but most phenotype predictions are close
to the observed phenotypes.
What is the correlation between the predicted phenotype and the true phenotype?
cor(ypred,phenos)
This correlation is high and indicates good prediction by the
model.
BONUS: Test each of the associated SNPs for non-linearity.
hp = array(0,length(assoc))
for (i in 1:length(assoc)) {
g = genos[,assoc[i]]
h = g
h[h==2]=0
#Hint: can use anova(lm(?),lm(?)) or summary(lm(?))
hp[i] <- anova( lm(phenos~g+h), lm(phenos~g) )$Pr[2]
#skip multiple test correction for now
}
hp
## [1] 0.0008859446 0.9243902273 0.4974970958 0.1514667494 0.4854254373
## [6] 0.1959153850 0.1201895538 0.1931039794 0.8697644427 0.4132013052
These are p-values from a series of ANOVA tests, one for each of
the significant SNPs. SNP 1 is the only SNP with significant evidence of
a non-linear relationship with phenos
, from this test. The
ANOVA shows that adding h
to the model significantly
changes the fit. Including h
allows the change in mean for
0 vs. 1 genotype to be different than the change in mean for 1 vs. 2
genotype.
BONUS: Visualize a linear SNP and a non-linear SNP.
par( mfrow=c(1,2) )
plot( genos[,assoc[1]], phenos, xlab="Genotype", ylab="Phenotype", main="SNP 1 (non-line
ar)" )
points( c(0,1,2), tapply( phenos, genos[,assoc[1]], mean ), col=2, pch=16, cex=3 )
lines( c(0,1,2), tapply( phenos, genos[,assoc[1]], mean ), col=2, lwd=2 )
plot( genos[,assoc[2]], phenos, xlab="Genotype", ylab="Phenotype", main="SNP 2 (linear)"
)
points( c(0,1,2), tapply( phenos, genos[,assoc[2]], mean ), col=2, pch=16, cex=3 )
lines( c(0,1,2), tapply( phenos, genos[,assoc[2]], mean ), col=2, lwd=2 )
The first SNP is non-linear, and hence there is not a straight line
through its predicted values. The linear SNP does have a straight line
through the predictions.
Repeat the GWAS to test for recessive rather than additive genetic effects.
genos2 = genos
genos2[genos<2]=1
pvalues2 = array(0,M)
zscores2 = array(0,M)
betas2 = array(0,M)
for(i in 1:M) {
g = genos2[,i]
res = summary(lm(phenos~g))
zscores2[i] = res$coefficients[2,3]
pvalues2[i] = res$coefficients[2,4]
betas2[i] = res$coefficients[2,1]
}
Are the same SNPs significant or not?
assoc2 = which(pvalues2<0.05/M)
assoc2
## [1] 1 2 3 4 5 6 7 8 9
Nine of the SNPs are the same, but the tenth one is no longer
significant. This may be because adding the phenotypes for homozygous 0
individuals to the genotype 1 group increased the variability in this
group.
How did the effect sizes change?
plot(betas,betas2)
points(betas[1:10],betas2[1:10],col="red")
For the unassociated SNPs, the effect sizes change randomly. For
the associated SNPs, the effect sizes are all bigger in absolute value.
We know that for these SNPs, the original data showed differences in
mean phenotype across all three values of genotype. In the recessive
model, we have set genotype 0 to be the same as genotype 1. This brought
the mean of genotype 1 down, and hence increased the expected difference
between genotypes 1 and 2 (i.e., the effect sizes).
Establish some important simulation parameters
N = 1000 #number of individuals
M = 30 #number of non-causal SNPs
gs = matrix(0,nrow=N,ncol=M)
Simulate a GWAS data set. First, simulate the causal variant.
set.seed = (42) #set random seed so we all get the same numbers
MAF = 0.5
gC = rbinom(N,1,MAF) #causal variant
Then, simulate the phenotypes given the causal variant.
beta = 0.3 #association of causal variant
pheno = gC*beta + rnorm(N)
Generate 10 SNPS in tight LD with the causal SNP.
rho = 0.9
for(i in 1:10) {
idx = rbinom(N,1,rho)
gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
# test they have the right LD empirically
cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
# Bonus: prove they have the right LD theoretically
}
## Observed LD = 0.8998779
## Observed LD = 0.9098648
## Observed LD = 0.8858282
## Observed LD = 0.9098657
## Observed LD = 0.8878622
## Observed LD = 0.9219837
## Observed LD = 0.8858772
## Observed LD = 0.9100792
## Observed LD = 0.8738291
## Observed LD = 0.8938566
Do the same for 10 moderate LD partners (rho=0.6).
rho = 0.6
for(i in 1:10+10) {
idx = rbinom(N,1,rho)
gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
}
## Observed LD = 0.6123675
## Observed LD = 0.5993654
## Observed LD = 0.6093578
## Observed LD = 0.6192982
## Observed LD = 0.5976917
## Observed LD = 0.6541894
## Observed LD = 0.6056998
## Observed LD = 0.6137079
## Observed LD = 0.5315755
## Observed LD = 0.5592475
Do the same for 10 independent SNPs (rho=0).
rho = 0
for(i in 21:30) {
idx = rbinom(N,1,rho)
gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
}
## Observed LD = 0.006769094
## Observed LD = 0.006308758
## Observed LD = -0.04263168
## Observed LD = -0.01370634
## Observed LD = -0.004231132
## Observed LD = -0.004597956
## Observed LD = 0.01401012
## Observed LD = 0.06222271
## Observed LD = 0.01416234
## Observed LD = -0.028511
Run GWAS on the causal variant. Then run GWAS on the other variants. Keep track of the zscores only.
zsC = summary(lm(pheno~gC))$coef[2,3]
zs = sapply( 1:M, function(i) summary(lm(pheno~gs[,i]))$coef[2,3] )
Visualize the relationship between the mean z-scores at the tag SNPs and the z-score at the causal SNP.
par( mfrow=c(2,2) )
breaks = hist(c(0,zsC,zs),plot=F)$breaks
hist(zs[1:10],breaks=breaks, col=1, main='LD partners')
abline(v=zsC)
hist(zs[11:20],breaks=breaks, col=2, main='Low-LD partner SNPs')
abline(v=zsC)
hist(zs[21:30],breaks=breaks, col=3, main='Independent SNPs')
abline(v=zsC)
As LD increases, the z-scores of the tag SNPs get closer to the
z-score of the causal SNP. In real data where the causal SNP is unknown,
this makes it hard to identify the causal SNP from a group of associated
SNPs with high LD.
BONUS: Perform LD score regression. First, calculate the LD scores. There should be M+1 of them.
ldscores = sapply( 1:(M+1), function(m) sum(cor(cbind(gs,gC))[m,]^2))
#another way to do it
lds = cor(cbind(gs,gC))
ldscores = diag( lds %*% lds )
ldscores
##
## 10.629917 10.833282 10.348543 10.873696 10.358878 11.069537 10.345425 10.932940
##
## 10.203679 10.428446 5.698840 5.509622 5.694847 5.734401 5.372572 6.194922
##
## 5.639377 5.760472 4.604696 4.856151 1.023082 1.018806 1.037817 1.030807
## gC
## 1.021018 1.021816 1.022184 1.070219 1.024166 1.025303 12.684445
BONUS: Visualize LD score regression.
chis = c( zsC, zs )^2
plot(ldscores, chis, ylab=expression(chi^2))
#test for inflation
lambdaGC = median(chis)/0.454
lambdaGC
## [1] 27.65272
BONUS: Estimate heritability.
summary( lm( chis - 1 ~ ldscores ) )$coef[2,1] * M/N
## [1] 0.05226158
BONUS: What is the true heritability?
var(gC*beta) / var(pheno)
## [1] 0.02171375