### PART1: Analyzing provided genotype and phenotype data.

Prepare the data.
Read in the genotype and phenotype matrices.

genos = as.matrix(read.table("./genos.txt"))
phenos = as.matrix(read.table("./phenos.txt"))


Make a histogram of the phenotypes. Do they look normally distributed?

hist(phenos)


Yes. This approximates a normal distribution.
How are the genotypes encoded?

table(genos)
## genos
##       0       1       2 
## 4773842 5447131 4779027


Genotypes are numerical variables encoded as 0, 1, or 2. These represent homozygous major allele, heterozygous, or homozygous minor allele. It is possible that homozygote minor allele is either 0 or 2, and this could depend on the SNP (need to check).

How many individuals are there in the dataset and how many SNPs? (Save them in N and M, respectively.)

dim(genos)
## [1]  1500 10000
dim(phenos)
## [1] 1500    1
N = nrow(genos)
M = ncol(genos)


There are 1,500 individuals and 10,000 SNPs in the data. We know this because we expect one phenotype for each individual.

Compute the minor allele frequency for every SNP. Check MAFs are <0.5.

MAFs = array(0,M)
for(i in 1:M) {
      gi = genos[,i]
      MAFs[i] = min(sum(gi)/(2*N),sum(2-gi)/(2*N))
}
MAFs[1:10]
##  [1] 0.1516667 0.3226667 0.4126667 0.2156667 0.4626667 0.4826667 0.4863333
##  [8] 0.1516667 0.2990000 0.3310000
max(MAFs)
## [1] 0.5


There are multiple ways to do the MAF calculation. The key ideas are to identify if 0 or 2 is the minor allele (using minimum or maximum functions, for example), to count homozygous minor allele twice, to count heterozygous once, and to divide by twice the number of individuals as they represent \(2*N\) chromosomes.
Run a GWAS under an additive model and save the p-values, z-scores, and effect sizes.

pvalues = array(0,M)
zscores = array(0,M)
betas = array(0,M)
for(i in 1:M) {
    g = genos[,i]
    res = summary(lm(phenos~g))
    zscores[i] = res$coefficients[2,3]
    pvalues[i] = res$coefficients[2,4]
    betas[i] = res$coefficients[2,1]
}


Summarize the effect sizes.

summary(betas)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.400728 -0.105822  0.002475  0.004204  0.113524  2.245525
hist(betas)


Are there any significantly associated SNPs? If so, which SNPs are they?

assoc = which(pvalues<0.05/M)
assoc
##  [1]  1  2  3  4  5  6  7  8  9 10


SNPs 1-10 are significantly associated with the phenotype.
How big are their effect sizes? How significant are they?

betas[assoc]
##  [1] -3.4007282  2.1340404  1.5238919  0.7930685 -1.4787008  1.9852459
##  [7] -1.2225227  2.2455246  1.8605128  0.7607997
zscores[assoc]
##  [1] -21.021900  16.013141  11.615036   5.003065 -11.436804  16.128209
##  [7]  -9.115941  12.602187  13.341784   5.383964
pvalues[assoc]
##  [1] 3.488001e-86 2.148840e-53 6.341996e-30 6.308608e-07 4.244077e-29
##  [6] 4.411319e-54 2.434694e-19 1.103317e-34 1.884171e-38 8.447687e-08


The effect sizes and p-values vary but all represent fairly strong associations. The first SNP has a particularly significant phenotype association.

Draw a QQ plot for log10(p) values.

obsLogPvs = sort(-log10(pvalues))
expLogPvs = sort(-log10(seq(1/M,1,1/M)))
plot(expLogPvs,obsLogPvs,main='QQ plot')
abline( a=0, b=1 )
#label the significant SNPs red 
points(expLogPvs[(M-length(assoc)):M],obsLogPvs[(M-length(assoc)):M],col="red")


Is there inflation? Use the chi-square statistics to check.

chis = zscores^2
lambdaGC = median(chis)/0.454 # why .454?
median(rnorm(10000000)^2)
## [1] 0.4552651
lambdaGC
## [1] 1.007873


Given that the calculated lambdaGC value (1.008) is close to 1, there is not a significant amount of inflation. This is corroborated by the QQ plot showing that SNP p-values do not deviate from the null expectation except for a few highly significant SNPs.
The value 0.454 is used because it is the expected median chi-square statistic (df=1) under the null hypothesis that no SNPs are significantly associated. We can check this empirically by generating a large number of N(0,1) variables, squaring them, and taking the median. It can also be shown analytically.

Plot the phenotype predictions for the most significant SNP.

topSNP = genos[,order(pvalues)[1]]
plot(topSNP,phenos)
abline(lm(phenos~topSNP)$coeff,col="red")


The predictions go through the middle of the phenotype values for each genotype group, but are not precisely equal to the mean for each group since we fit an additive model that constrains the difference between the prediction for the 0 vs. 1 genotype to be the same as that of the 1 vs. 2 genotype. If we encoded the genotype as a factor, we could have estimated a separate prediction for each of the three genotypes (technically, an intercept representing the 0 genotype and two offsets from the intercept representing the 1 and 2 genotypes). In this case, the prediction for each genotype group is the mean value of the phenotypes for individuals with that genotype.

Build a linear predictor of the phenotype using the associated SNPs.

ypred = array(0,N)
for(i in 1:N) {
      ypred[i] = genos[i,assoc] %*% betas[assoc]
}
plot(ypred,phenos)


There is some variability, but most phenotype predictions are close to the observed phenotypes.

What is the correlation between the predicted phenotype and the true phenotype?

cor(ypred,phenos)


This correlation is high and indicates good prediction by the model.

BONUS: Test each of the associated SNPs for non-linearity.

hp = array(0,length(assoc))
for (i in 1:length(assoc)) {
  g = genos[,assoc[i]]
  h = g
  h[h==2]=0
  #Hint: can use anova(lm(?),lm(?)) or summary(lm(?))
  hp[i] <- anova( lm(phenos~g+h), lm(phenos~g) )$Pr[2] 
  #skip multiple test correction for now
}
hp 
##  [1] 0.0008859446 0.9243902273 0.4974970958 0.1514667494 0.4854254373
##  [6] 0.1959153850 0.1201895538 0.1931039794 0.8697644427 0.4132013052


These are p-values from a series of ANOVA tests, one for each of the significant SNPs. SNP 1 is the only SNP with significant evidence of a non-linear relationship with phenos, from this test. The ANOVA shows that adding h to the model significantly changes the fit. Including h allows the change in mean for 0 vs. 1 genotype to be different than the change in mean for 1 vs. 2 genotype.

BONUS: Visualize a linear SNP and a non-linear SNP.

par( mfrow=c(1,2) )
plot( genos[,assoc[1]], phenos, xlab="Genotype", ylab="Phenotype", main="SNP 1 (non-line
ar)" )
points( c(0,1,2), tapply( phenos, genos[,assoc[1]], mean ), col=2, pch=16, cex=3 )
lines( c(0,1,2), tapply( phenos, genos[,assoc[1]], mean ), col=2, lwd=2 )
plot( genos[,assoc[2]], phenos, xlab="Genotype", ylab="Phenotype", main="SNP 2 (linear)"
)
points( c(0,1,2), tapply( phenos, genos[,assoc[2]], mean ), col=2, pch=16, cex=3 )
lines( c(0,1,2), tapply( phenos, genos[,assoc[2]], mean ), col=2, lwd=2 )


The first SNP is non-linear, and hence there is not a straight line through its predicted values. The linear SNP does have a straight line through the predictions.

Repeat the GWAS to test for recessive rather than additive genetic effects.

genos2 = genos
genos2[genos<2]=1
pvalues2 = array(0,M)
zscores2 = array(0,M)
betas2 = array(0,M)
for(i in 1:M) {
  g = genos2[,i]
  res = summary(lm(phenos~g))
  zscores2[i] = res$coefficients[2,3]
  pvalues2[i] = res$coefficients[2,4]
  betas2[i] = res$coefficients[2,1]
}


Are the same SNPs significant or not?

assoc2 = which(pvalues2<0.05/M)
assoc2
## [1] 1 2 3 4 5 6 7 8 9


Nine of the SNPs are the same, but the tenth one is no longer significant. This may be because adding the phenotypes for homozygous 0 individuals to the genotype 1 group increased the variability in this group.

How did the effect sizes change?

plot(betas,betas2)
points(betas[1:10],betas2[1:10],col="red")


For the unassociated SNPs, the effect sizes change randomly. For the associated SNPs, the effect sizes are all bigger in absolute value. We know that for these SNPs, the original data showed differences in mean phenotype across all three values of genotype. In the recessive model, we have set genotype 0 to be the same as genotype 1. This brought the mean of genotype 1 down, and hence increased the expected difference between genotypes 1 and 2 (i.e., the effect sizes).

PART2: Simulating genotypes with LD

Establish some important simulation parameters

N = 1000 #number of individuals
M = 30   #number of non-causal SNPs
gs = matrix(0,nrow=N,ncol=M)


Simulate a GWAS data set. First, simulate the causal variant.

set.seed = (42) #set random seed so we all get the same numbers
MAF = 0.5
gC = rbinom(N,1,MAF) #causal variant


Then, simulate the phenotypes given the causal variant.

beta = 0.3 #association of causal variant
pheno = gC*beta + rnorm(N) 


Generate 10 SNPS in tight LD with the causal SNP.

rho = 0.9
for(i in 1:10) {
  idx = rbinom(N,1,rho)
  gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
  # test they have the right LD empirically
  cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
  # Bonus: prove they have the right LD theoretically
}
## Observed LD =  0.8998779 
## Observed LD =  0.9098648 
## Observed LD =  0.8858282 
## Observed LD =  0.9098657 
## Observed LD =  0.8878622 
## Observed LD =  0.9219837 
## Observed LD =  0.8858772 
## Observed LD =  0.9100792 
## Observed LD =  0.8738291 
## Observed LD =  0.8938566


Do the same for 10 moderate LD partners (rho=0.6).

rho = 0.6
for(i in 1:10+10) {
  idx = rbinom(N,1,rho)
  gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
  cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
}
## Observed LD =  0.6123675 
## Observed LD =  0.5993654 
## Observed LD =  0.6093578 
## Observed LD =  0.6192982 
## Observed LD =  0.5976917 
## Observed LD =  0.6541894 
## Observed LD =  0.6056998 
## Observed LD =  0.6137079 
## Observed LD =  0.5315755 
## Observed LD =  0.5592475


Do the same for 10 independent SNPs (rho=0).

rho = 0
for(i in 21:30) {
  idx = rbinom(N,1,rho)
  gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
  cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
}
## Observed LD =  0.006769094 
## Observed LD =  0.006308758 
## Observed LD =  -0.04263168 
## Observed LD =  -0.01370634 
## Observed LD =  -0.004231132 
## Observed LD =  -0.004597956 
## Observed LD =  0.01401012 
## Observed LD =  0.06222271 
## Observed LD =  0.01416234 
## Observed LD =  -0.028511

Run GWAS on the causal variant. Then run GWAS on the other variants. Keep track of the zscores only.

zsC = summary(lm(pheno~gC))$coef[2,3]
zs = sapply( 1:M, function(i) summary(lm(pheno~gs[,i]))$coef[2,3] )


Visualize the relationship between the mean z-scores at the tag SNPs and the z-score at the causal SNP.

par( mfrow=c(2,2) )
breaks = hist(c(0,zsC,zs),plot=F)$breaks
hist(zs[1:10],breaks=breaks, col=1, main='LD partners')
abline(v=zsC)
hist(zs[11:20],breaks=breaks, col=2, main='Low-LD partner SNPs')
abline(v=zsC)
hist(zs[21:30],breaks=breaks, col=3, main='Independent SNPs')
abline(v=zsC)


As LD increases, the z-scores of the tag SNPs get closer to the z-score of the causal SNP. In real data where the causal SNP is unknown, this makes it hard to identify the causal SNP from a group of associated SNPs with high LD.

BONUS: Perform LD score regression. First, calculate the LD scores. There should be M+1 of them.

ldscores = sapply( 1:(M+1), function(m) sum(cor(cbind(gs,gC))[m,]^2))
#another way to do it
lds = cor(cbind(gs,gC))
ldscores = diag( lds %*% lds )
ldscores
##                                                                                 
## 10.629917 10.833282 10.348543 10.873696 10.358878 11.069537 10.345425 10.932940 
##                                                                                 
## 10.203679 10.428446  5.698840  5.509622  5.694847  5.734401  5.372572  6.194922 
##                                                                                 
##  5.639377  5.760472  4.604696  4.856151  1.023082  1.018806  1.037817  1.030807 
##                                                                    gC 
##  1.021018  1.021816  1.022184  1.070219  1.024166  1.025303 12.684445


BONUS: Visualize LD score regression.

chis = c( zsC, zs )^2
plot(ldscores, chis, ylab=expression(chi^2))

#test for inflation
lambdaGC = median(chis)/0.454
lambdaGC
## [1] 27.65272


BONUS: Estimate heritability.

summary( lm( chis - 1 ~ ldscores ) )$coef[2,1] * M/N
## [1] 0.05226158


BONUS: What is the true heritability?

var(gC*beta) / var(pheno)
## [1] 0.02171375