Lab 2: Enrichment Analyses

Import needed packages and set a random number seed

#load packages
library(regioneR)
library(ggplot2) # for plotting
library(dplyr)
set.seed(10)

Read in BED formatted region files: all tested regions and two sets of positives Note: these are in hg19 human genome assembly coordinates

all <- toGRanges(read.table("all.bed",sep="\t"))
hits1 <- toGRanges(read.table("hits1.bed",sep="\t"))
hits2 <- toGRanges(read.table("hits2.bed",sep="\t"))

Q1. How many regions are in hits1? How many in hits2?

n_regions1 <- nrow(toDataframe(hits1))
n_regions2 <- nrow(toDataframe(hits2))

There are 140 regions in hits1 and 322 regions in hits2.

Q2. Are hits1 and hits2 proper subsets of all the tested regions? Check how many of each set overlaps a region in all.

numOverlaps(hits1, all, count.once=TRUE)

## [1] 140

numOverlaps(hits2, all, count.once = TRUE)

## [1] 322

numOverlaps(hits1, all)

## [1] 141

numOverlaps(hits2, all)

## [1] 337

Because every region in hits1 and hits2 overlaps with at least one region in all, hits1 and hits2 can be considered proper subsets of all. Because the number of unique overlaps (count.once=TRUE) is less than the total number of overlaps for both hits1 and hits2, we can also conclude that some of the regions appear more than once in these datasets.

The next few questions explore the overlap of genomic regions in hits1 and hits2.

Q3. How many regions overlap? How many regions are exactly identical?

# How many regions overlap?
numOverlaps(hits1, hits2)
# how many regions are exactly identical?
nrow(toDataframe(commonRegions(hits1, hits2)))
#alternative way to check exactly identical regions
#hits1[hits1%in%hits2,]  
#hits2[hits2%in%hits1,]

There are 6 overlapping regions between hits1 and hits2. There are also 6 identical regions between hits1 and hits2, meaning that all of the overlapping regions are identical.

Q4. Generate a set of random genomic regions of the same size as hits1. Match these to the mean and sd of the genomic length of the hits1 regions.

# install masked version of hg19
# BiocManager::install("BSgenome.Hsapiens.UCSC.hg19.masked")
# generate random regions same size as hits1
hits1_widths <- width(hits1)
hits1_randomized <- createRandomRegions(nregions=length(hits1),
                                    length.mean=mean(hits1_widths),
                                    length.sd=sqrt(var(hits1_widths)),
                                    mask=NULL)

Do the random genomic regions overlap hits2 more or less than hits1 does?

numOverlaps(hits1_randomized, hits2)

## [1] 0

There are 6 overlapping regions between hits1 and hits2, and 0 overlaps between hits2 and the randomized hits1. So the random regions overlap hits2 less.

How much do the random genomic regions overlap all tested regions?

numOverlaps(hits1_randomized, all)

## [1] 0

0 times.

Repeatedly generate genomic regions to compute a z-score for the overlap of hits1 with hits2

perm_test <- permTest(A=hits1, ntimes = 100, 
                      randomize.function = randomizeRegions,
                      evaluate.function = numOverlaps, count.once = TRUE, 
                      B = hits2)
summary(perm_test)

## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: randomizeRegions
## Tests Results:
##                 pvalue  zscore    test
## numOverlaps 0.00990099 17.0677 greater

Note: This can be done a number of different ways, for example, using overlapPermTest or your own for-loop instead of PermTest. It is also equivalent to use createRandomRegions instead of randomizeRegions.
Another note: The names of the functions PermTest and overlapPermTest are a bit misleading, because they can be used to generate random draws (i.e., bootstrapping) in addition to permuting. The argument randomize.function controls the type of resampling.

Use the set of overlaps with random regions to test the null hypothesis that hits2 overlaps hits1 more than expected compared to totally random regions.
The p-value from the permutation test is .01, which is lower than the significance threshold of .05. Therefore, we can reject the null hypothesis that hits1 and hits2 overlap the same amount as what would be expected between hits2 and random regions.
What is the smallest p-value you could have gotten?
This will depend on how many permutations were performed. The lowest possible p-value will be 1 divided by the number of permutations performed plus 1: 1/(#permutations + 1). In this case, we see that our p-value is slightly less than .01, which is the lowest possible p-value attainable with 100 permutations.
How do the results change with number of resamples? Random seed?

# try a number of permutation tests between 50 and 1000 permutations
p_vals <- c()
zs <- c()
N = seq(50, 1000, 50)

for (i in N){
  perm_test <- permTest(A=hits1, ntimes = i, 
                        randomize.function = randomizeRegions,
                        evaluate.function = numOverlaps, count.once = TRUE, 
                        B = hits2)
  p_vals <- c(p_vals, perm_test$numOverlaps["pval"][[1]])
  zs <- c(zs, perm_test$numOverlaps["zscore"][[1]])
}

## [1] "Note: The minimum p-value with only 50 permutations is 0.0196078431372549. You should consider increasing the number of permutations."

# save results in a dataframe for easy access
test_results <- data.frame(N, p_vals, zs)

# plot the p-values by number of permutations
ggplot(test_results) + 
  geom_line(aes(N, log(p_vals)), color="red") + 
  xlab("Number of Permuations") +
  ylab("Log p-value of test") +
  theme_classic()

# plot the z-scores by number of permutations
ggplot(test_results) + 
  geom_line(aes(N, zs), color="blue") + 
  xlab("Number of Permuations") +
  ylab("Z-score of test") + 
  theme_classic()

We see that the p-values continue decreasing as the number of permutations goes up. This is because we are effectively reducing the minimum possible p-value, which we are reaching with most of the tests we perform here. The z-score does not vary significantly or with any trend except for the one from the 50-permutation test, though even this deviation is likely just due to the low number of permutations and random chance.

# try a number of random seeds
p_vals <- c()
zs <- c()
seeds <- seq(1, 10)

for (i in seeds){
  set.seed(i)
  # use the same N value as our original test
  perm_test <- permTest(A=hits1, ntimes = 100, 
                        randomize.function = randomizeRegions,
                        evaluate.function = numOverlaps, count.once = TRUE, 
                        B = hits2)
  p_vals <- c(p_vals, perm_test$numOverlaps["pval"][[1]])
  zs <- c(zs, perm_test$numOverlaps["zscore"][[1]])
}
# save results in a dataframe for easy access
test_results <- data.frame(seeds, p_vals, zs)

# plot the p-values by number of permutations
ggplot(test_results) + 
  geom_line(aes(seeds, log(p_vals)), color="red") + 
  xlab("Random seed") +
  ylab("Log p-value of test") +
  theme_classic()

# plot the z-scores by number of permutations
ggplot(test_results) + 
  geom_line(aes(seeds, zs), color="blue") + 
  xlab("Random seed") +
  ylab("Z-score of test") + 
  theme_classic()

The p-values always stay at the most significant value possible. The z-scores vary slightly with different number of resamples or different seeds, but not in a way that seems systematic. That is, all of the z-scores are quite high and indicate significant deviation from the null.

Q5. Repeat Q4 switching the roles of hits1 and hits2. Are conclusions similar?

set.seed(10)
perm_test <- permTest(A=hits2, ntimes = 100, 
                      randomize.function = randomizeRegions,
                      evaluate.function = numOverlaps, count.once = TRUE, 
                      B = hits1)
perm_test$numOverlaps["pval"]

## $pval
## [1] 0.00990099

perm_test$numOverlaps["zscore"]

## $zscore
## [1] 19.2531

Yes, the conclusion should be the same given the low p-value and high z-score.

Q6. Create a random bootstrap sample of regions from all tested regions.

random_bootstrap <- resampleRegions(hits1, all)
numOverlaps(random_bootstrap, hits2)

## [1] 24

Do these random regions overlap hits2 more or less than hits1 does?
More. We see that hits1 and hits2 overlap 6 times, and the bootstrap sample and hits2 overlap 24 times.
How does this test differ from the one in Q4? Look at the z-score and p-value.

# run a better bootstrap test
boot_test <- permTest(A=hits1, B=hits2, 
               ntimes=100, 
               randomize.function=resampleRegions, 
               universe=all,
               evaluate.function=numOverlaps)
summary(boot_test)

## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: resampleRegions
## Tests Results:
##                 pvalue  zscore test
## numOverlaps 0.00990099 -4.0319 less

Now we are sampling only from our known regions all, not the entirety of the hg19 human genome assembly. We are also not matching by mean region length. Our “universe” is much smaller and defined. This is reflected in the dramatically reduced z-score compared to the permutation test using hg19 above. This suggests all is more similar to the hits than the whole genome is. In other words, the elements in all are not random regions of hg19.

Q7. Repeat Q6 switching the role of hits1 and hits2. Are conclusions similar?

# run a better bootstrap test
boot_test2 <- permTest(A=hits2, B=hits1, 
               ntimes=100, 
               randomize.function=resampleRegions, 
               universe=all,
               evaluate.function=numOverlaps)

## Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
##   - in 'x': chr21, chrX
##   - in 'y': chr19
##   Make sure to always combine/compare objects based on the same reference
##   genome (use suppressWarnings() to suppress this warning).

summary(boot_test2)

## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: resampleRegions
## Tests Results:
##                 pvalue  zscore test
## numOverlaps 0.00990099 -4.3152 less

Yes, similar but not exact. We get a slightly lower z-score, perhaps because hits2 is a larger set.

Q8. Which null distribution would you use in your own research and why?
It would depend on the question. If you were interested in how often these two, separate genome-wide assays returned the same regions you would use the permutation test with hg19. You would probably want to spend time to match it better as a null (masking repetitive regions, etc). If you were interested in whether hits1 and hits2 are different from other regions in all, you would use bootstrap sampling from all. Or you could repeat the permutation test but only generating random regions from the possible bounds of all (if you knew how all was defined, which you do not know here).

The next few questions involve downloading genomics data. You can choose sets of regions, e.g, gene annotation, ChIPseq, RNAseq, ATACseq, GWAS SNPs. The code below shows examples of how to look for associations, but there are many ways you could do this. Most of the output of the code is suppressed because it takes a long time to run.

Q9. Using data you download, can you infer what function was tested in the assay that discovered hits1 and hits2? Choose data sets that will be informative about candidate functions. Compute overlaps or mean values of the downloaded data for the union of hits1 and hits2
First, look at the sizes of the regions.

a <- data.frame("length" = c(width(all), width(hits1), width(hits2)),
                "group" = c(rep("ALL", length(all)), rep("HITS1", length(hits1)), rep("HITS2", length(hits2)))
)
#table(a$group)
ggplot(a, aes(x=length, color=group)) + 
  geom_histogram(aes(y=..density..), fill="white", alpha=0.5, position="identity", bins = 40) +
  theme_bw() +
  geom_vline(xintercept = mean(a[a$group == "ALL", "length"]), color="red") + 
  geom_vline(xintercept = mean(a[a$group == "HITS1", "length"]), color="darkgreen") +
  geom_vline(xintercept = mean(a[a$group == "HITS2", "length"]), color="blue") +
  xlab("Region Length") + ylab("Frequency") + 
  ggtitle("Mean region length is ~2kb for ALL and HITS2, ~3kb for HITS1")

Some of the genomic features you may want to explore are RNA, narrow peak ChIPseq (e.g., transcription factor binding), broad peak ChIPseq (e.g., repressive histone marks), and annotations of nearby genes.

library(ChIPpeakAnno) #try loading here
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
hits = c(hits1,hits2)
aCR<-ChIPpeakAnno::assignChromosomeRegion(all, nucleotideLevel=FALSE, 
                            precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR_hits<-ChIPpeakAnno::assignChromosomeRegion(hits, nucleotideLevel=FALSE, 
                            precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)

par(mfrow=c(2,1))
barplot(aCR$percentage, main="All: majority of regions are intergenic/intronic == non-coding")
barplot(aCR_hits$percentage, main="Same pattern for HITS")

Our hits are mostly in intergenic regions. Next, examine how close they are to gene transcription start sites (TSS).

annoData <- suppressMessages(toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene"))

annotatedPeak_all <- annotatePeakInBatch(all, AnnotationData = annoData)
annotatedPeak_all <- addGeneIDs(annotatedPeak_all, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")
annotatedPeak_hits1 <- annotatePeakInBatch(hits1, AnnotationData = annoData)
annotatedPeak_hits1 <- addGeneIDs(annotatedPeak_hits1, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")
annotatedPeak_hits2 <- annotatePeakInBatch(hits2, AnnotationData = annoData)
annotatedPeak_hits2 <- addGeneIDs(annotatedPeak_hits2, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")

mean(annotatedPeak_all$distancetoFeature) 
table(annotatedPeak_all$insideFeature)

Only 13 overlap TSS. Most are further upstream/downstream from the annotated elements. At this point you might guess that they are some kind of regulatory element. Maybe a gene ontology search will shed some light.

#Define some functions
quick_GO <- function(annotatedPeak){
  over <- getEnrichedGO(annotatedPeak, orgAnn="org.Hs.eg.db", 
                    maxP=0.01, minGOterm=10, 
                    multiAdjMethod="BH",
                    condense=FALSE, feature_id_type = "entrez_id")
  final <- rbind(unique(over[["bp"]][, c(1,2,4,5)]), 
                 unique(over[["mf"]][, c(1,2,4,5)]), 
                 unique(over[["cc"]][, c(1,2,4,5)]))
  #note, dont take into account background, length, etc, 
  #not correcting for a lot of things I should in theory. 
  final$log10pval <- -log10(final$pvalue)
  final  <- final %>% arrange(pvalue)
  return(final)
}

GOplot <- function(df, my.title, my.n=20, my.col="#08589E"){
  df <- head(df, n=my.n)
  df <- df %>% arrange(log10pval)
  df$go.term <- factor(df$go.term, levels = df$go.term) # so it holds same order as df

  g2 <- ggplot(data=df, aes(x=go.term, y=log10pval, fill="Gene")) +
    geom_bar(stat="identity") + 
    ylab("-log10(padj)") + 
    xlab("") +
    ggtitle(my.title) +
    coord_flip() +
    scale_fill_manual(values=c(my.col)) +
    geom_text(aes(label = Ontology), hjust = 1.5, color = "white", size = 3) +
    theme(axis.text.x = element_text(size=10), legend.position="none") +
    labs(fill='Gene Ontology Grouping')
  return(g2)
}

#Gene ontology on all regions given
go_all <- quick_GO(annotatedPeak_all)
go_hits1 <- quick_GO(annotatedPeak_hits1)
go_hits2 <- quick_GO(annotatedPeak_hits2)

GOplot(go_all, "ALL: go enrichment")
GOplot(go_hits1, "HITS1: go enrichment", my.col = "salmon")
GOplot(go_hits2, "HITS2: go enrichment", my.col = "darkgreen")

This indicates some different terms for hits1 versus hits2. Maybe they function in different tissues.

Guess what type of genomic element these hits are (i.e., what assay was performed))
A reasonable guess: This is ChIP seq pull down of a transcription factor in heart and some brain-type cell.

Answer: This is a good guess. The true answer is that these regions tested for enhancer activity using transient transgenic mouse assays, downloaded from the VISTA Enhancer Browser. The file all includes all tested regions at the time of download, which encompasses negatives (no enhancer activity) and regions positive for activity in different tissues. The elements in all were prioritized for the assay based on evolutionary conservation and epigenetic marks. The regions in hits1 are limb enhancers (limb does include muscle, and hence the heart GO terms), whereas the regions in hits2 are brain enhancers.

BONUS Q10. Do you think hits1 and hits2 function in the same cell type?
- Build on your analysis in Q9 by separately testing overlaps with hits1 and hits2. Choose datasets that are from several different cell types

The analyses above used the cell type gene lists provided by GSEA and calculating the significance that genes (closest to region) in hits1 or hits2 overlap with these lists more than expected by random chance. Given these results and Q9, we would conclude that hits1 is from an assay on heart cells (e.g., cardiomyocytes) or some other muscle cells, and hits2 is likely from an assay on brain/neuronal cells. So, these do not appear to be the same cell type.

BONUS Q11: Try matching the random regions more closely to regions in hits1
- On what variables will you match them? e.g., spacing, chromosome, GC-content, distance to nearest gene, masking
- How does matching affect the z-score and p-value?

You could leverage another R package (gapped-kmer-SVM classifier) that was built to generate a matched set of regions by GC content (tolerates 0.02 deviation from input), repeat regions (tolerates 0.02 deviation), and region length (again, 0.02 deviation) and outputs a list of the same length (xfold=1). To save time during rendering, you can generate a matched list larger than hits1 (xfold=5 so ~700 regions) and then randomly sample from that as the “universe”. Otherwise it would be necessary to run the svm every iteration which is a lot of computation.

The matched hits actually have higher zscores which may seem surprising at first. This reflects the fact that regions tested by VISTA for enhancer activity are not random genomic regions; they were selected based on evolutionary conservation and epigenetic marks, and hence they have a different GC-content than the genome does. Nonetheless, this analysis does reinforce that having 6 regions overlap when considering the whole genome is very unlikely to occur by chance – even if you control for other variables.

Lab 2: Enrichment Analyses

BMI 206

Answer Key