### PART1: Analyzing provided genotype and phenotype data.

Prepare the data. Read in the genotype and phenotype matrices.

genos = as.matrix(read.table("./genos.txt"))
phenos = as.matrix(read.table("./phenos.txt"))


Make a histogram of the phenotypes. Do they look normally distributed?

hist(phenos)


How are the genotypes encoded?

table(genos)
## genos
##       0       1       2 
## 4773842 5447131 4779027


How many individuals are there in the dataset and how many SNPs? (Save them in N and M, respectively.)

dim(genos)
dim(phenos)
N = ?
M = ?


Compute the minor allele frequency for every SNP. Check MAFs are <0.5.

MAFs = array(0,M)
for(i in 1:M) {
      MAFs[i] = ?
}
MAFs[1:10]
max(MAFs)


Run a GWAS under an additive model and save the p-values, z-scores, and effect sizes.

pvalues = array(0,M)
zscores = array(0,M)
betas = array(0,M)
for(i in 1:M) {
    g = genos[,i]
    res = summary(lm(?))
    zscores[i] = ?
    pvalues[i] = ?
    betas[i] = ?
}


Summarize the effect sizes.

summary(?)
hist(?)


Are there any significantly associated SNPs? If so, which SNPs are they?

assoc = which(pvalues<?)
assoc


How big are their effect sizes? How significant are they?

betas[assoc]
zscores[assoc]
pvalues[assoc]


Draw a QQ plot for log10(p) values.

obsLogPvs = sort(-log10(pvalues))
expLogPvs = sort(-log10(seq(1/M,1,1/M)))
plot(expLogPvs,obsLogPvs,main='QQ plot')
abline( a=0, b=1 )
#label the significant SNPs red 
points(expLogPvs[(M-length(assoc)):M],obsLogPvs[(M-length(assoc)):M],col="red")


Is there inflation? Use the chi-square statistics to check.

chis = zscores^2
lambdaGC = median(chis)/0.454 # why .454?
lambdaGC


Plot the phenotype predictions for the most significant SNP.

topSNP = genos[,order(pvalues)[?]]
plot(topSNP,phenos)
abline(lm(phenos~topSNP)$coeff,col="red")


Build a linear predictor of the phenotype using the associated SNPs.

ypred = array(0,N)
for(i in 1:N) {
      ypred[i] = genos[i,assoc] %*% ?
}
plot(ypred,phenos)


What is the correlation between the predicted phenotype and the true phenotype?

cor(ypred,phenos)


BONUS: Repeat the GWAS to test for recessive rather than additive genetic effects.

genos2 = genos
genos2[genos<?]=1
pvalues2 = array(0,M)
zscores2 = array(0,M)
betas2 = array(0,M)
for(i in 1:M) {
  g = genos2[,i]
  res = summary(lm(?))
  zscores2[i] = ?
  pvalues2[i] = ?
  betas2[i] = ?
}


BONUS: Are the same SNPs significant or not?

assoc2 = which(pvalues2<?)
assoc2


BONUS: How did the effect sizes change?

plot(?,?)


PART2: Simulating genotypes with LD.

Establish some important simulation parameters.

N = 1000 #number of individuals
M = 20   #number of non-causal SNPs
gs = matrix(0,nrow=N,ncol=M)


Simulate a GWAS data set. First, simulate the causal variant.

set.seed = (42) #set random seed so we all get the same numbers
MAF = 0.5
gC = rbinom(N,1,MAF) #causal variant


Then, simulate the phenotypes given the causal variant.

beta = 0.3 #association of causal variant
pheno = gC*beta + rnorm(N) 


Generate 10 tag SNPS in tight LD with the causal SNP.

rho = 0.9
for(i in 1:10) {
  idx = rbinom(N,1,rho)
  gs[,i]=gC*idx+rbinom(N,1,MAF)*(1-idx)
  # test they have the right LD empirically
  cat( 'Observed LD = ', cor( gs[,i], gC ), '\n' )
  # Bonus: prove they have the right LD theoretically
}


Do the same for 10 independent SNPs (rho=0).

rho = 0
for(i in 11:20) {
  ?
}


Run GWAS on the causal variant. Then run GWAS on the other variants. Keep track of the zscores only.

zsC = summary(lm(pheno~gC))$coef[2,3]
zs = sapply( 1:M, function(i) summary(lm(pheno~gs[,i]))$coef[2,3] )


Visualize the relationship between the distribution of z-scores for tag SNPs and uncorrelated SNPs, showing the z-score at the causal SNP as a vertical line.

par( mfrow=c(2,1) )
breaks = hist(c(0,zsC,zs),plot=F)$breaks
hist(?,breaks=breaks, col=2, main='Tag SNPs')
abline(v=?)
hist(?,breaks=breaks, col=3, main='Other SNPs')
abline(v=?)