Lab 3: Exploratory Data Analysis

Prepare the data.

The data used in this exercise was generated by performing whole genome sequencing analysis of metastatic prostate tumors. This assay allows us to identify structural variations in tumor genomes such as duplications, inversions, and large deletions.

For more details on the analysis and results, see Quigley et al. Cell 2018.

Load data we’ll need

We counted the number of structural variations in each tumor sample. For our purposes, variants come in five types: inversions, deletions, duplications, insertions, and translocations. We also assessed each biopsy to identify biallelic inactivating mutations in each gene. Read in summaries of the number of structural variants present in each tumor and the presence/absence of mutations for a few selected genes.

working_folder = '.'
fn_sum = paste0( working_folder, '/SV_summary_table.txt' )
fn_mut = paste0( working_folder, '/WCDT_mutations.txt')
sv = read.table(fn_sum, sep='\t', header=TRUE, stringsAsFactors=FALSE)
mut = read.table(fn_mut, sep='\t', header=TRUE, stringsAsFactors=FALSE)

# if you don't have ggplot2 or reshape2 installed, un-comment and run the next two lines:
#install.packages("ggplot2")
#install.packages("reshape2")
library(ggplot2)
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.0.2

Question 1: The variant showcase showdown

The sv matrix object reports counts of five types of structural variants (SV) in 101 patient tumor biopsies.

QUESTION 1.1: Plot a summary of the distributions for each type of SV individually. Do any of the distributions have outliers, defined as “values that exceed the whiskers of a boxplot”?

# ANSWER 1.1
# all of the distributions have outliers according to this definition.
layout( matrix( 1:6, 2, 3 ) )
par(mar=c(3,3,2,1))
boxplot( sv$deletions, ylim=c(0,400), main="Deletions" )
boxplot( sv$translocations, ylim=c(0,400), main="Translocations" )
boxplot( sv$inversions, ylim=c(0,400), main="Inversions" )
boxplot( sv$duplications, ylim=c(0,400), main="Duplications" )
boxplot( sv$insertions, ylim=c(0,400), main="Insertions" )

QUESTION 1.2 Which SV types are most frequent? Least frequent?

# ANSWER 1.2
# Deletions are the most frequent on average (either mean or median).
# Insertions are the least frequent.
summary( sv )

##    deletions     translocations     inversions      duplications   
##  Min.   : 25.0   Min.   : 19.00   Min.   : 12.00   Min.   : 10.00  
##  1st Qu.: 59.0   1st Qu.: 55.00   1st Qu.: 48.00   1st Qu.: 23.00  
##  Median : 80.0   Median : 83.00   Median : 69.00   Median : 31.00  
##  Mean   :100.9   Mean   : 97.53   Mean   : 79.62   Mean   : 50.49  
##  3rd Qu.:116.0   3rd Qu.:120.00   3rd Qu.: 91.00   3rd Qu.: 50.00  
##  Max.   :455.0   Max.   :410.00   Max.   :387.00   Max.   :393.00  
##    insertions    
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 6.000  
##  Mean   : 8.693  
##  3rd Qu.:12.000  
##  Max.   :54.000

Question 2: Is this normal?

Some analyses are contingent on distributional assumptions. For example, they may assume values are normally distributed. We can test the assumption that a sample is normally distributed with a QQ plot:

set.seed(124)
x=rnorm(100)
qqnorm( x )
qqline( x )

QUESTION 2.1 Generate a QQ plot to evaluate the assumption that deletions are normally distributed. Are deletions normally distributed?

# ANSWER 2.1
# Deletions are not normally distributed
qqnorm( sv$deletions)
qqline( sv$deletions )

Question 3: It’s not normal

Data that are not normally distributed can be coerced towards a normal distribution by transforming the data. How could you transform the distribution of deletions so that it’s closer to normal?

QUESTION 3.1 Replot the QQ plot after performing a log transformation to see what effect your transformation had. Did this transformation make the sample more similar to a normal distribution?

Note: leave the data as their original counts for the rest of the questions. Just transform for this question.

# ANSWER 3.1
# log transformation makes Deletions closer to normally distributed
# any log transformation (log2, log, etc ) will give the same result
qqnorm( log2( sv$deletions ) )
qqline( log2( sv$deletions ) )

Question 4: We belong together

Structural variations arise from DNA damage that is not repaired. By analyzing tumor genomes, we can figure out the kind of DNA damage that occurred by studying the patterns of SVs.

The null model is that there is no relationship between the number of any type of SV. Alternatively, there might be an association between some of the SV types, suggesting somthing in common about their etiology.

QUESTION 4.1 Calculate pairwise correlation between all five types of SV and plot the resulting correlation coefficients as a heat map. Try both Pearson correlation (the parametric default method in R) and non-parametric Spearman rank correlation, to see if it matters.

Hint for plotting a simple correlation heatmap, where the matrix X contains the values to plot in the heatmap:

XTidy = melt( X, value.name="val", varnames = c("x", "y") )
ggplot( XTidy, aes( x, y ) ) + 
     geom_tile( aes( fill = val ) ) + 
     geom_text( aes( label = val ) )

# ANSWER 4.1
# You should see correlation between deletions and inversions at rho = 0.58
cor_matrix = round( cor( sv, method="spearman" ), 2)
cor_matrix[ lower.tri( cor_matrix, diag = TRUE ) ] = 0
cor_matrix_tidy = melt( cor_matrix, 
                        value.name="rho", 
                        varnames = c("sv_1", "sv_2") )
ggplot( cor_matrix_tidy, aes(sv_1, sv_2) ) + 
        geom_tile( aes( fill = rho) ) + 
        geom_text( aes( label = rho ) ) +
        scale_fill_distiller(palette = "PuBu")

QUESTION 4.2 Which types of SV are most strongly correlated with each other? Without formal testing, do the correlation data support the null model, or is there reason to investigate an alternative model? Which pairs of SV are most likely to occur in similar counts?

# ANSWER 4.2
# inversions & duplications are most strongly correlated (rho = 0.66)
# These results suggest a non-random association between some of the SV pairs, 
# particularly inversions/deletions, inversions/duplications. 
# You could also credit deletions/duplications and translocations/insertions, 
# though we won't investigate these.

QUESTION 4.3 (BONUS): Why are the correlation values so different when comparing Spearman rank correlation to Pearson correlation? What might be driving these differences? Does it matter?

# ANSWER 4.3
# The fact that these data violate the assumptions of the Pearson test matters 
# a lot; there are outliers that obscure the trends. It does turn out to matter; 
# you don't need an exact answer to this, but the answer should include the 
# observation that the correlation estimates are very deflated in the Pearson 
# correlation analysis.

cor_matrix = round( cor( sv ), 2)
cor_matrix[ lower.tri( cor_matrix, diag = TRUE ) ] = 0
cor_matrix_tidy = melt( cor_matrix, 
                        value.name="rho", 
                        varnames = c("sv_1", "sv_2") )
ggplot( cor_matrix_tidy, aes(sv_1, sv_2) ) + 
        geom_tile( aes( fill = rho) ) + 
        geom_text( aes( label = rho ) ) +
        scale_fill_distiller(palette = "PuBu")

Question 5

Let’s drill down on two contrasting pairs of SVs:

duplications and inversions
deletions and inversions

QUESTION 5.1 Create two scatter plots: duplications vs inversions, and deletions vs. inversions. Based on question 4, we’d expect the counts of these SVs to be somewhat correlated with each other. Does this hold up? Are there samples that are outliers from the linear trend in these comparisons?

Samples that deviate from a relationship like this might be of particular interest.

# ANSWER 5.1
# The correlations do indeed hold up; looking at the data suggests the values are
# indeed correlated and not a weird feature of the data. The are however outliers 
# in both conditions that don't adhere to the linear relationship. In these cases,
# something special is going on.

layout(matrix(1:2,1,2))
plot( sv$duplications, sv$inversions)
plot( sv$deletions, sv$inversions)

sv[sv$duplications>300,]

##            deletions translocations inversions duplications insertions
## DTB-063-BL        66            272         87          393         12
## DTB-183-BL        68            113        102          391         20
## DTB-214-BL        65             80         48          360          8

sv[sv$deletions>200 & sv$inversions<200,]

##             deletions translocations inversions duplications insertions
## DTB-003-BL        320            146         84           22         10
## DTB-037-BL        251            101         69           31          9
## DTB-061-BL        262             95         60           44         18
## DTB-097-PRO       455             72         47           24          1
## DTB-129-BL        244             83         43           30          1
## DTB-173-BL        281            108         39           26         15
## DTB-176-BL        238            150        105           53         22

Question 6: Why so many duplicates?

Let’s drill down on the duplications. Look at your plot comparing the number of duplications to the number of inversions. Note that there are three samples that have far more duplications than any other sample, and four samples that have both a large number of inversions and a large (though not the highest) number of duplications. Looking at this plot, we might wonder if there is something special about the three samples that have a lot of duplications but not a lot of inversions.

The mut matrix that you loaded contains one row for each sample and one column for each of 15 genes, with a TRUE value if there is a biallelic inactivation of that gene in that sample.

QUESTION 6.1 We’ll test the hypothesis that tumors with a particular gene inactivation acquire a lot more duplications than tumors lacking that inactivation. Test this hypothesis by performing a Wilcoxon test comparing the number of duplicates in tumors with vs. without mutation in each gene. Create a barplot of -log10( p ) and nominate the strongest hit as worthy of further investigation.

# ANSWER 6.1
# The gene with the strongest association is CDK12. Bialleleic inactivation of 
# CDK12 produces tandem duplications, though no one yet knows exactly why.
pvals = rep(NA, dim(mut)[2])
for( i in 1:dim(mut)[2]){
    pvals[i] = wilcox.test( sv$duplications ~ mut[,i] )$p.value
}
barplot( -1*log10( pvals), names.arg = dimnames(mut)[[2]], las=2 )

Question 7: Why so many deletions?

Now let’s take a close look at deletions. Look at the scatter plot you made comparing inversions vs deletions. There tends to be a linear relationship, even for samples with large numbers of inversions, but there is a set of samples that have lots of deletions but not a lot of inversions.

QUESTION 7.1 You might hypothesize that tumors that have inactivated a DNA repair gene harbor a lot more deletions than normal tumors. Test this hypothesis by performing a Wilcox test comparing the number of deletions in tumors with vs. without mutation in each gene. This time, instead of a barplot, create a volcano plot to compare the difference in means (as the effect) vs. the -log10(p) as the statistical strength. Does this analysis nominate any gene as associated with large numbers of deletions?

# ANSWER 7.1
# Biallelic inactivation of BRCA2 prevents cells from performing homologous
# recombination double strand break repair. This produces a large number of 
# deletions because the cells are forced to rely on alternative non-homologous 
# end joining, which is non-conservative.

pvals = rep(NA, dim(mut)[2])
effects = rep(NA, dim(mut)[2])
for( i in 1:dim(mut)[2]){
    wt = wilcox.test( sv$deletions ~ mut[,i] )
    tt = t.test( sv$deletions ~ mut[,i] )
    pvals[i] = wt$p.value
    effects[i] = tt$estimate[2]-tt$estimate[1]  
}
plot( effects, -1*log10( pvals ), las=1, xlim=c(-250,250), ylim=c(0,7),
      xlab="change in deletions associated with mutation")
text( effects, -1*log10( pvals )+0.25, dimnames(mut)[[2]] )

QUESTION 7.2 (BONUS) Re-run the test you performed in problem 7.1, using a t test rather than a Wilcoxon test. Explain why these tests have different performance and produce different results. Discuss the difference between ranking candidates based on P value and based on a combination of P value and effect size.

# ANSWER 7.2
# This method still nominates BRCA2 as the strongest postive association, but it 
# also nominates PRKDC and CDK12 inactivation as being associated with 
# _fewer_ mutations. Again, this is likely a result of the relatively small 
# sample sizes with non-normal distributions. Ranking candidates based only on the
# P value would not take account of the direction of association (reversed for 
# PRKDC and CDK12) or the estimated effect size (much smaller for those genes 
# than BRCA2).
pvals = rep(NA, dim(mut)[2])
effects = rep(NA, dim(mut)[2])
for( i in 1:dim(mut)[2]){
    tt = t.test( sv$deletions ~ mut[,i] )
    pvals[i] = tt$p.value
    effects[i] = tt$estimate[2]-tt$estimate[1]  
}
plot( effects, -1*log10( pvals ), las=1, xlim=c(-250,250), ylim=c(0,7),
      xlab="change in deletions associated with mutation")
text( effects, -1*log10( pvals )+0.25, dimnames(mut)[[2]] )