### PART1: Analyzing provided genotype and phenotype
data.
Prepare the data. Read in the genotype and phenotype matrices.
genos = as.matrix(read.table("./genos.txt"))
phenos = as.matrix(read.table("./phenos.txt"))
Make a histogram of the phenotypes. Do they look normally distributed?
hist(phenos)
How are the genotypes encoded?
table(genos)
## genos
## 0 1 2
## 4773842 5447131 4779027
How many individuals are there in the dataset and how many SNPs?
(Save them in N and M, respectively.)
dim(genos)
dim(phenos)
N = ?
M = ?
Compute the minor allele frequency for every SNP. Check MAFs are <0.5.
MAFs = array(0,M)
for(i in 1:M) {
MAFs[i] = ?
}
MAFs[1:10]
max(MAFs)
Run a GWAS under an additive model and save the p-values, z-scores, and effect sizes.
pvalues = array(0,M)
zscores = array(0,M)
betas = array(0,M)
for(i in 1:M) {
g = genos[,i]
res = summary(lm(?))
zscores[i] = ?
pvalues[i] = ?
betas[i] = ?
}
Summarize the effect sizes.
summary(?)
hist(?)
Are there any significantly associated SNPs? If so, which SNPs are they?
assoc = which(pvalues<?)
assoc
How big are their effect sizes? How significant are they?
betas[assoc]
zscores[assoc]
pvalues[assoc]
Draw a QQ plot for log10(p) values.
obsLogPvs = sort(-log10(pvalues))
expLogPvs = sort(-log10(seq(1/M,1,1/M)))
plot(expLogPvs,obsLogPvs,main='QQ plot')
abline( a=0, b=1 )
#label the significant SNPs red
points(expLogPvs[(M-length(assoc)):M],obsLogPvs[(M-length(assoc)):M],col="red")
Is there inflation? Use the chi-square statistics to check.
chis = zscores^2
lambdaGC = median(chis)/0.454 # why .454?
lambdaGC
Plot the phenotype predictions for the most significant SNP.
topSNP = genos[,order(pvalues)[?]]
plot(topSNP,phenos)
abline(lm(phenos~topSNP)$coeff,col="red")
Build a linear predictor of the phenotype using the associated SNPs.
ypred = array(0,N)
for(i in 1:N) {
ypred[i] = genos[i,assoc] %*% ?
}
plot(ypred,phenos)
What is the correlation between the predicted phenotype and the true phenotype?
cor(ypred,phenos)
BONUS: Repeat the GWAS to test for recessive rather than additive genetic effects.
genos2 = genos
genos2[genos<?]=1
pvalues2 = array(0,M)
zscores2 = array(0,M)
betas2 = array(0,M)
for(i in 1:M) {
g = genos2[,i]
res = summary(lm(?))
zscores2[i] = ?
pvalues2[i] = ?
betas2[i] = ?
}
BONUS: Are the same SNPs significant or not?
assoc2 = which(pvalues2<?)
assoc2
BONUS: How did the effect sizes change?
plot(?,?)
Establish some important simulation parameters.
N = 1000 #number of individuals
M = 20 #number of non-causal SNPs
gs = matrix(0,nrow=N,ncol=M)
Simulate a GWAS data set. First, simulate the causal variant.
set.seed = (42) #set random seed so we all get the same numbers
MAF = 0.5
gC = rbinom(N,1,MAF) #causal variant
Then, simulate the phenotypes given the causal variant.
beta = 0.3 #association of causal variant
pheno = gC*beta + rnorm(N)
Generate 10 tag SNPS in tight LD with the causal SNP.
rho = 0.9
for(i in 1:10) {
idx = rbinom(N,1,rho)
gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
# test they have the right LD empirically
cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
# Bonus: prove they have the right LD theoretically
}
Do the same for 10 independent SNPs (rho=0).
rho = 0
for(i in 11:20) {
?
}
Run GWAS on the causal variant. Then run GWAS on the other variants. Keep track of the zscores only.
zsC = summary(lm(pheno~gC))$coef[2,3]
zs = sapply( 1:M, function(i) summary(lm(pheno~gs[,i]))$coef[2,3] )
Visualize the relationship between the distribution of z-scores for tag SNPs and uncorrelated SNPs, showing the z-score at the causal SNP as a vertical line.
par( mfrow=c(2,1) )
breaks = hist(c(0,zsC,zs),plot=F)$breaks
hist(?,breaks=breaks, col=2, main='Tag SNPs')
abline(v=?)
hist(?,breaks=breaks, col=3, main='Other SNPs')
abline(v=?)