Import needed packages and set a random number seed
#load packages
library(regioneR)
library(ggplot2) # for plotting
library(dplyr)
set.seed(10)
Read in BED formatted region files: all tested regions and two sets of positives Note: these are in hg19 human genome assembly coordinates
all <- toGRanges(read.table("all.bed",sep="\t"))
hits1 <- toGRanges(read.table("hits1.bed",sep="\t"))
hits2 <- toGRanges(read.table("hits2.bed",sep="\t"))
Q1. How many regions are in hits1? How many in hits2?
n_regions1 <- nrow(toDataframe(hits1))
n_regions2 <- nrow(toDataframe(hits2))
There are 140 regions in hits1
and 322 regions in
hits2
.
Q2. Are hits1 and hits2 proper subsets of all the tested regions? Check how many of each set overlaps a region in all.
numOverlaps(hits1, all, count.once=TRUE)
## [1] 140
numOverlaps(hits2, all, count.once = TRUE)
## [1] 322
numOverlaps(hits1, all)
## [1] 141
numOverlaps(hits2, all)
## [1] 337
Because every region in hits1
and
hits2
overlaps with at least one region in
all
, hits1
and hits2
can be
considered proper subsets of all
. Because the number of
unique overlaps (count.once=TRUE
) is less than the total
number of overlaps for both hits1
and hits2
,
we can also conclude that some of the regions appear more than once in
these datasets.
The next few questions explore the overlap of genomic regions in
hits1
and hits2
.
Q3. How many regions overlap? How many regions are exactly identical?
# How many regions overlap?
numOverlaps(hits1, hits2)
# how many regions are exactly identical?
nrow(toDataframe(commonRegions(hits1, hits2)))
#alternative way to check exactly identical regions
#hits1[hits1%in%hits2,]
#hits2[hits2%in%hits1,]
There are 6 overlapping regions between hits1
and
hits2
. There are also 6 identical regions between
hits1
and hits2
, meaning that all of the
overlapping regions are identical.
Q4. Generate a set of random genomic regions of the same size
as hits1. Match these to the mean and sd of the genomic length of the
hits1 regions.
# install masked version of hg19
# BiocManager::install("BSgenome.Hsapiens.UCSC.hg19.masked")
# generate random regions same size as hits1
hits1_widths <- width(hits1)
hits1_randomized <- createRandomRegions(nregions=length(hits1),
length.mean=mean(hits1_widths),
length.sd=sqrt(var(hits1_widths)),
mask=NULL)
numOverlaps(hits1_randomized, hits2)
## [1] 0
There are 6 overlapping regions between hits1
and
hits2
, and 0 overlaps between hits2
and the
randomized hits1
. So the random regions overlap
hits2
less.
numOverlaps(hits1_randomized, all)
## [1] 0
0 times.
perm_test <- permTest(A=hits1, ntimes = 100,
randomize.function = randomizeRegions,
evaluate.function = numOverlaps, count.once = TRUE,
B = hits2)
summary(perm_test)
## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: randomizeRegions
## Tests Results:
## pvalue zscore test
## numOverlaps 0.00990099 17.0677 greater
Note: This can be done a number of different ways, for
example, using overlapPermTest
or your own for-loop instead
of PermTest
. It is also equivalent to use
createRandomRegions
instead of
randomizeRegions
.
Another note: The names of the
functions PermTest
and overlapPermTest
are a
bit misleading, because they can be used to generate random draws (i.e.,
bootstrapping) in addition to permuting. The argument
randomize.function
controls the type of resampling.
Use the set of overlaps with random regions to test the
null hypothesis that hits2 overlaps hits1 more than expected compared to
totally random regions.
The p-value from the permutation
test is .01
, which is lower than the significance threshold
of .05
. Therefore, we can reject the null hypothesis that
hits1
and hits2
overlap the same amount as
what would be expected between hits2
and random
regions.
What is the smallest p-value you could have
gotten?
This will depend on how many permutations were
performed. The lowest possible p-value will be 1 divided by the number
of permutations performed plus 1: 1/(#permutations + 1). In this case,
we see that our p-value is slightly less than .01
, which is
the lowest possible p-value attainable with 100 permutations.
How do the results change with number of resamples?
Random seed?
# try a number of permutation tests between 50 and 1000 permutations
p_vals <- c()
zs <- c()
N = seq(50, 1000, 50)
for (i in N){
perm_test <- permTest(A=hits1, ntimes = i,
randomize.function = randomizeRegions,
evaluate.function = numOverlaps, count.once = TRUE,
B = hits2)
p_vals <- c(p_vals, perm_test$numOverlaps["pval"][[1]])
zs <- c(zs, perm_test$numOverlaps["zscore"][[1]])
}
## [1] "Note: The minimum p-value with only 50 permutations is 0.0196078431372549. You should consider increasing the number of permutations."
# save results in a dataframe for easy access
test_results <- data.frame(N, p_vals, zs)
# plot the p-values by number of permutations
ggplot(test_results) +
geom_line(aes(N, log(p_vals)), color="red") +
xlab("Number of Permuations") +
ylab("Log p-value of test") +
theme_classic()
# plot the z-scores by number of permutations
ggplot(test_results) +
geom_line(aes(N, zs), color="blue") +
xlab("Number of Permuations") +
ylab("Z-score of test") +
theme_classic()
We see that the p-values continue decreasing as the number of permutations goes up. This is because we are effectively reducing the minimum possible p-value, which we are reaching with most of the tests we perform here. The z-score does not vary significantly or with any trend except for the one from the 50-permutation test, though even this deviation is likely just due to the low number of permutations and random chance.
# try a number of random seeds
p_vals <- c()
zs <- c()
seeds <- seq(1, 10)
for (i in seeds){
set.seed(i)
# use the same N value as our original test
perm_test <- permTest(A=hits1, ntimes = 100,
randomize.function = randomizeRegions,
evaluate.function = numOverlaps, count.once = TRUE,
B = hits2)
p_vals <- c(p_vals, perm_test$numOverlaps["pval"][[1]])
zs <- c(zs, perm_test$numOverlaps["zscore"][[1]])
}
# save results in a dataframe for easy access
test_results <- data.frame(seeds, p_vals, zs)
# plot the p-values by number of permutations
ggplot(test_results) +
geom_line(aes(seeds, log(p_vals)), color="red") +
xlab("Random seed") +
ylab("Log p-value of test") +
theme_classic()
# plot the z-scores by number of permutations
ggplot(test_results) +
geom_line(aes(seeds, zs), color="blue") +
xlab("Random seed") +
ylab("Z-score of test") +
theme_classic()
The p-values always stay at the most significant value possible.
The z-scores vary slightly with different number of resamples or
different seeds, but not in a way that seems systematic. That is, all of
the z-scores are quite high and indicate significant deviation from the
null.
Q5. Repeat Q4 switching the roles of hits1 and hits2. Are conclusions similar?
set.seed(10)
perm_test <- permTest(A=hits2, ntimes = 100,
randomize.function = randomizeRegions,
evaluate.function = numOverlaps, count.once = TRUE,
B = hits1)
perm_test$numOverlaps["pval"]
## $pval
## [1] 0.00990099
perm_test$numOverlaps["zscore"]
## $zscore
## [1] 19.2531
Yes, the conclusion should be the same given the low p-value and
high z-score.
Q6. Create a random bootstrap sample of regions from all
tested regions.
random_bootstrap <- resampleRegions(hits1, all)
numOverlaps(random_bootstrap, hits2)
## [1] 24
Do these random regions overlap hits2 more or less than
hits1 does?
More. We see that hits1
and
hits2
overlap 6 times, and the bootstrap sample and
hits2
overlap 24 times.
How does this test differ from the one in Q4? Look at the
z-score and p-value.
# run a better bootstrap test
boot_test <- permTest(A=hits1, B=hits2,
ntimes=100,
randomize.function=resampleRegions,
universe=all,
evaluate.function=numOverlaps)
summary(boot_test)
## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: resampleRegions
## Tests Results:
## pvalue zscore test
## numOverlaps 0.00990099 -4.0319 less
Now we are sampling only from our known regions
all
, not the entirety of the hg19 human genome assembly. We
are also not matching by mean region length. Our “universe” is much
smaller and defined. This is reflected in the dramatically reduced
z-score compared to the permutation test using hg19 above. This suggests
all
is more similar to the hits than the whole genome is.
In other words, the elements in all
are not random regions
of hg19.
Q7. Repeat Q6 switching the role of hits1 and hits2. Are conclusions similar?
# run a better bootstrap test
boot_test2 <- permTest(A=hits2, B=hits1,
ntimes=100,
randomize.function=resampleRegions,
universe=all,
evaluate.function=numOverlaps)
## Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': chr21, chrX
## - in 'y': chr19
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
summary(boot_test2)
## Permutation tests: 1
## Significant permutation tests: 1
## Iterations: 100
## Randomization Function: resampleRegions
## Tests Results:
## pvalue zscore test
## numOverlaps 0.00990099 -4.3152 less
Yes, similar but not exact. We get a slightly lower z-score,
perhaps because hits2
is a larger set.
Q8. Which null distribution would you use in your own
research and why?
It would depend on the question. If you
were interested in how often these two, separate genome-wide assays
returned the same regions you would use the permutation test with hg19.
You would probably want to spend time to match it better as a null
(masking repetitive regions, etc). If you were interested in whether
hits1
and hits2
are different from other
regions in all
, you would use bootstrap sampling from
all
. Or you could repeat the permutation test but only
generating random regions from the possible bounds of all
(if you knew how all
was defined, which you do not know
here).
The next few questions involve downloading genomics data. You
can choose sets of regions, e.g, gene annotation, ChIPseq, RNAseq,
ATACseq, GWAS SNPs. The code below shows examples of how to look for
associations, but there are many ways you could do this. Most of the
output of the code is suppressed because it takes a long time to
run.
Q9. Using data you download, can you infer what function was
tested in the assay that discovered hits1 and hits2? Choose data sets
that will be informative about candidate functions. Compute overlaps or
mean values of the downloaded data for the union of hits1 and
hits2
First, look at the sizes of the regions.
a <- data.frame("length" = c(width(all), width(hits1), width(hits2)),
"group" = c(rep("ALL", length(all)), rep("HITS1", length(hits1)), rep("HITS2", length(hits2)))
)
#table(a$group)
ggplot(a, aes(x=length, color=group)) +
geom_histogram(aes(y=..density..), fill="white", alpha=0.5, position="identity", bins = 40) +
theme_bw() +
geom_vline(xintercept = mean(a[a$group == "ALL", "length"]), color="red") +
geom_vline(xintercept = mean(a[a$group == "HITS1", "length"]), color="darkgreen") +
geom_vline(xintercept = mean(a[a$group == "HITS2", "length"]), color="blue") +
xlab("Region Length") + ylab("Frequency") +
ggtitle("Mean region length is ~2kb for ALL and HITS2, ~3kb for HITS1")
Some of the genomic features you may want to explore are RNA,
narrow peak ChIPseq (e.g., transcription factor binding), broad peak
ChIPseq (e.g., repressive histone marks), and annotations of nearby
genes.
library(ChIPpeakAnno) #try loading here
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
hits = c(hits1,hits2)
aCR<-ChIPpeakAnno::assignChromosomeRegion(all, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR_hits<-ChIPpeakAnno::assignChromosomeRegion(hits, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
par(mfrow=c(2,1))
barplot(aCR$percentage, main="All: majority of regions are intergenic/intronic == non-coding")
barplot(aCR_hits$percentage, main="Same pattern for HITS")
Our hits are mostly in intergenic regions. Next, examine how close
they are to gene transcription start sites (TSS).
annoData <- suppressMessages(toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene"))
annotatedPeak_all <- annotatePeakInBatch(all, AnnotationData = annoData)
annotatedPeak_all <- addGeneIDs(annotatedPeak_all, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")
annotatedPeak_hits1 <- annotatePeakInBatch(hits1, AnnotationData = annoData)
annotatedPeak_hits1 <- addGeneIDs(annotatedPeak_hits1, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")
annotatedPeak_hits2 <- annotatePeakInBatch(hits2, AnnotationData = annoData)
annotatedPeak_hits2 <- addGeneIDs(annotatedPeak_hits2, orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"), feature_id_type="entrez_id")
mean(annotatedPeak_all$distancetoFeature)
table(annotatedPeak_all$insideFeature)
Only 13 overlap TSS. Most are further upstream/downstream from
the annotated elements. At this point you might guess that they are some
kind of regulatory element. Maybe a gene ontology search will shed some
light.
#Define some functions
quick_GO <- function(annotatedPeak){
over <- getEnrichedGO(annotatedPeak, orgAnn="org.Hs.eg.db",
maxP=0.01, minGOterm=10,
multiAdjMethod="BH",
condense=FALSE, feature_id_type = "entrez_id")
final <- rbind(unique(over[["bp"]][, c(1,2,4,5)]),
unique(over[["mf"]][, c(1,2,4,5)]),
unique(over[["cc"]][, c(1,2,4,5)]))
#note, dont take into account background, length, etc,
#not correcting for a lot of things I should in theory.
final$log10pval <- -log10(final$pvalue)
final <- final %>% arrange(pvalue)
return(final)
}
GOplot <- function(df, my.title, my.n=20, my.col="#08589E"){
df <- head(df, n=my.n)
df <- df %>% arrange(log10pval)
df$go.term <- factor(df$go.term, levels = df$go.term) # so it holds same order as df
g2 <- ggplot(data=df, aes(x=go.term, y=log10pval, fill="Gene")) +
geom_bar(stat="identity") +
ylab("-log10(padj)") +
xlab("") +
ggtitle(my.title) +
coord_flip() +
scale_fill_manual(values=c(my.col)) +
geom_text(aes(label = Ontology), hjust = 1.5, color = "white", size = 3) +
theme(axis.text.x = element_text(size=10), legend.position="none") +
labs(fill='Gene Ontology Grouping')
return(g2)
}
#Gene ontology on all regions given
go_all <- quick_GO(annotatedPeak_all)
go_hits1 <- quick_GO(annotatedPeak_hits1)
go_hits2 <- quick_GO(annotatedPeak_hits2)
GOplot(go_all, "ALL: go enrichment")
GOplot(go_hits1, "HITS1: go enrichment", my.col = "salmon")
GOplot(go_hits2, "HITS2: go enrichment", my.col = "darkgreen")
This indicates some different terms for hits1
versus
hits2
. Maybe they function in different tissues.
Guess what type of genomic element these hits are (i.e., what
assay was performed))
A reasonable guess: This is ChIP seq
pull down of a transcription factor in heart and some brain-type cell.
Answer: This is a good guess. The true answer is that these
regions tested for enhancer activity using transient transgenic mouse
assays, downloaded from the VISTA Enhancer Browser. The file
all
includes all tested regions at the time of download,
which encompasses negatives (no enhancer activity) and regions positive
for activity in different tissues. The elements in all
were
prioritized for the assay based on evolutionary conservation and
epigenetic marks. The regions in hits1
are limb enhancers
(limb does include muscle, and hence the heart GO terms), whereas the
regions in hits2
are brain enhancers.
BONUS Q10. Do you think hits1 and hits2 function in the same
cell type?
- Build on your analysis in Q9 by separately
testing overlaps with hits1
and hits2
. Choose
datasets that are from several different cell types
The
analyses above used the cell type gene lists provided by GSEA and
calculating the significance that genes (closest to region) in
hits1
or hits2
overlap with these lists more
than expected by random chance. Given these results and Q9, we would
conclude that hits1
is from an assay on heart cells (e.g.,
cardiomyocytes) or some other muscle cells, and hits2
is
likely from an assay on brain/neuronal cells. So, these do not appear to
be the same cell type.
BONUS Q11: Try matching the random regions more closely to
regions in hits1
- On what variables will you match them?
e.g., spacing, chromosome, GC-content, distance to nearest gene, masking
- How does matching affect the z-score and p-value?
You could leverage another R package (gapped-kmer-SVM classifier)
that was built to generate a matched set of regions by GC content
(tolerates 0.02 deviation from input), repeat regions (tolerates 0.02
deviation), and region length (again, 0.02 deviation) and outputs a list
of the same length (xfold=1). To save time during rendering, you can
generate a matched list larger than hits1
(xfold=5 so ~700
regions) and then randomly sample from that as the “universe”. Otherwise
it would be necessary to run the svm every iteration which is a lot of
computation.
The matched hits actually have higher zscores which may seem surprising at first. This reflects the fact that regions tested by VISTA for enhancer activity are not random genomic regions; they were selected based on evolutionary conservation and epigenetic marks, and hence they have a different GC-content than the genome does. Nonetheless, this analysis does reinforce that having 6 regions overlap when considering the whole genome is very unlikely to occur by chance – even if you control for other variables.