source('https://bioconductor.org/biocLite.R'); biocLite('edgeR')
install.packages('pheatmap')
Specifically, we'll use a matrix generated by "Subread featureCounts." Other tools will generate files in slightly different format. (Other common tools that generate similar matrices are HTSeq-counts (not recommended) and Cufflinks (outputs normalized values, not raw counts).)
Excel Warning: Excel will save your files with different (invisible) line ending characters. R actually handles this just fine, but UNIX tools will have weird issues. Beware! Search online for "CR LF line endings" or "use the 'tr' command to translate carriage returns to line feeds" to find the solution to this issue.
COUNTS_FILE <- "/path/to/file/on/your/filesystem/counts.txt" allDat <- read.delim(file=COUNTS_FILE , comment.char='#' # <-- occasionally dangerous , header=TRUE, row.names=1 , sep="\t" # <-- \t = tab , strip.white=TRUE , stringsAsFactors=FALSE # <-- this is SUPER IMPORTANT for most R things , fill=FALSE) head(allDat) # Check out the top of the file to make sure it looks OK
You can open the "counts" file in any plain text editor, or in Excel / LibreOffice.
If you're on the Mac, I recommend the free program TextWrangler (also free on the App Store) for your plain-text-wrangling needs.
is.data.frame(allDat) # "read.delim" returns a data frame, so this should be TRUE countMat <- data.matrix(allDat) # Convert the input "data.frame" to a numeric matrix. # This is the matrix we'll use below. # ("as.matrix(...)" would have also worked, but # it's less assertive about requiring only numbers.) typeof(countMat) # Should be 'numeric' or 'integer'Optional note: You'll see a lot of input datasets with extraneous columns that you may need to remove, so be prepared to subset your "allDat" matrix if you have extra columns / rows. (e.g.
only_the_numbers_please <- allDat[ , 4:6]
). Subread featureCounts normally provides a bunch of extra data in the columns 2 through 5 by default, but we've cut it out for this example.
summary(countMat) # Get some basic info. Conveniently summarized by column (default behavior)! # Result of summary(countMat): # mtBC02 mtBC06 mtBC07 wtBC04 wtBC05 wtBC08 # Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 # 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 # Median : 15.0 Median : 18.0 Median : 19.5 Median : 18.0 Median : 14.0 Median : 22.0 # Mean : 328.0 Mean : 430.3 Mean : 621.3 Mean : 420.7 Mean : 402.6 Mean : 627.3 # 3rd Qu.: 252.5 3rd Qu.: 386.8 3rd Qu.: 572.2 3rd Qu.: 357.2 3rd Qu.: 323.2 3rd Qu.: 543.8 # Max. :4266.0 Max. :7481.0 Max. :10062.0 Max. :6177.0 Max. :7896.0 Max. :13621.0
Question 1: Does this look like reasonable input data, based on your expectation of sequence data? Why is the maximum so much higher than the mean?
Question 2: Why is the "1st Qu." 0.0? What's wrong with our data? This is totally normal for sequence data. A high fraction of genes (sometimes 50% or more) will have zero detected counts.
'pheatmap' and 'heatmap.2' make reasonable heatmaps with scale bars (unlike R's default 'heatmap' function).
install.packages("pheatmap") require("pheatmap") pheatmap(log2(1+countMat), border_color="#00000000", show_rownames=FALSE)
pheatmap(log2(1+countMat), border_color="#00000000", show_rownames=FALSE)
. The only oddity here is that 'border_color' thing, to set borders to transparent. Otherwise the heatmaps look awful.
?pheatmap
). As you can see, it's incredibly obvious how to properly configure - heatmaps.
Optional Section about R's default heatmaps, which you should not useHere, we will use R's default 'heatmap' package. But you should never use it (use 'heatmap.2' or 'pheatmap').heatmap(countMat , scale="none") # <-- Looks terrible. (It's data with a high dynamic range, in linear space.) heatmap(log2(1+countMat), scale="none") # <-- Log-transformed data will be a lot clearer. heatmap(countMat) # <-- Default 'heatmap' scales each row (not what we want!). Note columns 'mtBC07' and 'wtBC08' ![]() The default R heatmap row-normalizes your values if you don't tell it not to, and it doesn't include scale bars. Question: why are the first two columns here so different from the others? Answer: those are the two columns (samples) with higher overall expression (look at the mean value in the 'summary(countMat)' results). Since this heatmap shows raw counts with no normalization, we get misleading results. |
hist(countMat[,1], xlim=c(0,500), breaks=1000, col="red", main="Just sample 1") # Q: Why is "xlim=c(0, 500)" there? # Optional: plot the same thing, but in log space: hist(log2(1+countMat[,1]), breaks=20, col="blue", main="log2(Sample 1)")
pairs(countMat) # Hmm... this looks really bad pairs(countMat, pch=19, cex=0.25, col="red") # Why "pch = 19"? What about "pch='.'"?
Question: do these samples look similar, or is there some obviously-messed-up replicate? (I can't really tell from these graphs, because they are totally dominated by a few high-expression outliers.)
Let's log-transform the data to make it a little more clear what's going on with the samples.
logMat <- log2(1 + countMat) # Why did we add one? pairs(logMat, pch=19, cex=0.25, col="blue") # X and Y axis scales are not the same by default
So does anything seem "off" about this 'pairs' plot? Any samples that really should be excluded? Answer: These all look pretty OK (from this metric, at least).
Secret knowledge: if you want to understand EdgeR better, the documentation is amazingly good (this is very rare, so you should savor the experience). You can access it online, or within R (it will open as a PDF) by typing: edgeRUsersGuide()
We're going to tell edgeR which sample belongs to which group annotation. Let's see if we know the answer to that.
colnames(countMat) # Hopefully, sample groups are obvious from the column names # We're going to MANUALLY specify the column names here. metadata <- data.frame( "sample" = colnames(countMat) ,"disease" = c("AD","AD","AD", "CTL","CTL","CTL") )Here's a fancier way to do it:
group_names_automatically_computed <- ifelse(grepl("^mt", colnames(countMat)), yes="ALZ", no="CONTROL") # Above: checks to see if the column name starts with 'mt' or not. The 'mt' samples are the "Alzheimer's" samples. # Result: [1] "ALZ" "ALZ" "ALZ" "CTL" "CTL" "CTL" <-- automatically figures out groups from the sample names
Now we need to set up the model matrix, A.K.A. the "hey edgeR, this is which sample belongs to which group" matrix. You can also use it to set up more complicated experimental scenarios (like "drug X vs control, and also day 10 vs day 20 samples").
Here, we just have two groups: samples from patients with Alzheimer's disease and "control" samples without the disease.
We need to tell edgeR which samples belong to which groups, and also which groups we want to compare. To do that, we'll use an R "model formula." It's just called a "formula" in R, but don't mistake it for an algebraic formula!
Perils of R fomulas: R 'model formulas' have an extremely strange method of specification that is unlike anything in other programming langauges. You can get some info with by typing: help('formula'). The ONE most crucial thing about model formulas in R is that they are NOT ALGEBRAIC EQUATIONS, even though they look like they should be. Here's a web page that describes them in better detail than R's documentation: http://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html.
Code to generate the model matrix from our list of groups:
require("edgeR") str(metadata) # View the underlying STRucture with "str" print(metadata) # sample disease # 1 mtBC02 AD # 2 mtBC06 AD # 3 mtBC07 AD # 4 wtBC04 CTL # 5 wtBC05 CTL # 6 wtBC08 CTL group <- metadata[, "disease"] design <- stats::model.matrix( ~ group ) # <- "~ THING" is an R formula
Optional details about R model formulasModel formula info: if you had multiple factors to consider, you could specify them using the "~" notation here.For example, if you had cell growth as a factor of both gene-deletion-mutation state ("mutant_status") and also whether or not a plate of cells was given Drug X or no drug ("drug_status"), you would have:
design <- model.matrix( ~ mutant_status + drug_status ) # mutant_status and drug_status are both vectors of TRUE / FALSE # (They are the same length, also—one vector element per experimental sample.) But actually, maybe the effect of the drug changes based on the gene deletion! Like, say drugX only affects mutant cells. Then that means the drug effect interacts with the gene-deletion effect. So then, we'd really need: design <- model.matrix( ~ mutant_status + drug_status + mutant_status:drug_status) # That last "thing1:thing2" term is the "interaction term" # Theoretically you wouldn't need it if the effect of the drug # and of the mutation were totally unrelated and independent of each other.Wikipedia examples of interaction terms:
|
Let's check out this model matrix we just made.
print(design) # What does 'design' look like now? Note the column names! > design (Intercept) groupCTL 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1Note that the "CTL" column may be slightly confusing, here, the samples with "CTL" have a 1 and the "AD" samples have a zero. So you might think "0 means AD," but that's not quite accurate. (We will discuss this more later.)
Optional Section about R column namingSo what was the point of thatgroup variable above anyway?
Can't we just omit 'group' entirely and set that 'design' model matrix in one line? Yes, but check out what happens to the column names in the design matrix. This is just an R quirk—functions can actually make use of the exact names of the input variables, not just their values. This is extremely unusual; you would probably have a hard time finding another programming language that operates this way! design <- stats::model.matrix( ~ metadata[, "disease"] ) print(design) # (Intercept) metadata[, "disease"]CTL # <-- Awful column names now # 1 1 0 # 2 1 0 # 3 1 0 # 4 1 1 # 5 1 1 # 6 1 1 colnames(design) <- c("Intercept", "CTL") # <-- Fix the bizarre column names from earlier print(design) # <-- Ok, now the column names are reasonable. |
Hey, what happened to the "AD" category? We only have CTL now and this "Intercept" thing. Well, the second column is actually "AD or CTL," despite its misleading name. Probably it should be named "Is_this_sample_control" or something, but that's really verbose, so "CTL" will have to do.
We aren't going to use the Intercept term for anything.
edgeRUsersGuide()
, or read the original edgeR paper.y <- edgeR::DGEList(counts=countMat, group=group) # or group=metadata[, "disease"] y <- edgeR::calcNormFactors(y) y <- edgeR::estimateGLMTrendedDisp(y, design) y <- edgeR::estimateGLMTagwiseDisp(y, design) fit <- edgeR::glmFit(y, design) # "Genewise Negative Binomial Generalized Linear Models" lrt <- edgeR::glmLRT(fit, coef=2)So what we are actually doing with edgeR here?
y <- edgeR::DGEList(counts=countMat, group=group)
Note that this is a purely programming-related line, it has no statistical purpose whatsoever!
Question: How do we know what kind of R object the variable 'y' is? Answer: str(y)
(see the part where it says "Formal class ___") or class(y)
. Or if you want to be super confused and learn nothing, try typeof(y)
Question: What's up with the edgeR::
in each of these commands (and can we remove it)? Answer: it's optional. I just like it so I can remember which of the many R packages actually has a specific function when I read code like 3 years later.
y <- edgeR::calcNormFactors(y)
Inspect the actual normalization factors with:
y$samples
.
There are different methods for normalizing your input data. The default is TMM.
method="TMM" # TMM is the default methodBasically: 1) Ignore the highest- and lowest-expressed genes. 2) Find a scaling coefficients that minimize the fold change between (remaining) genes across samples. (Just one coefficient per sample, so if you have N = 6 samples, you'll have six normalization factors.)
Description of TMM in detail: Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.
Question: We presumably expect the normalization factor to be the inverse of the total number of reads, right? In fact, no!.
Check it out with: plot(y$samples$lib.size, y$samples$norm.factors)
y <- edgeR::estimateGLMTrendedDisp(y, design)
Note that the 'design' gets used here.
y <- edgeR::estimateGLMTagwiseDisp(y, design)
Here, we estimate variance / dispersion for each gene. You might expect that this would be easy—calculate the variance for each gene. (This would work if you had enough replicates—say, 25—for each gene.)
Let's see how the dispersion estimates changes as a function of average expression.
plot(lrt$table$logCPM, y$tagwise.dispersion, pch=19, main="Note: logCPM across all replicates"); lines(supsmu(lrt$table$logCPM, y$tagwise.dispersion), col="red", lwd=3) # Let's add a smoothed line
Question: why is the variance going down as the expression value goes up? That can't be right... shouldn't the actual variance for each gene INCREASE as the expression level goes up? Answer: The x-axis isn't variance, it's the 'dispersion' parameter, which is basically "how much more variable is this data than the Poisson estimate would suggest?" So 0.0 means "it's basically Poisson."
Another questions about this graph: what's up with that one gene at the top of the figure (under the "e" in "replicates")? Answer: Evidently it has high expression as well as super high dispersion, which means high variance as well.
This is one place where R doesn't shine too well for exploratory data; ideally you'd be able to click on that gene to see which one it is, but instead:
reord <- order(y$tagwise.dispersion) # Got to keep all the rows in the same order... zzz <- cbind( y$counts[reord, ] , "tagwise.dispersion"=y$tagwise.dispersion[reord]) print(zzz) mtBC02 mtBC06 mtBC07 wtBC04 wtBC05 wtBC08 tagwise.dispersion ENSG00000157554 4 0 0 27 0 15 1.188 ENSG00000244025 6 0 0 0 2 0 1.263 ENSG00000160202 2 0 1 0 1 4602 1.801 <-- this is the weird pointAha! So it was gene "ENSG00000160202." Note that these values in the graph are raw values, not the (averaged over replicates) CPM values that are actually used in the graph.
By the way, here are the genes with the LOWEST "tagwise.dispersion."
mtBC02 mtBC06 mtBC07 wtBC04 wtBC05 wtBC08 tagwise.dispersion ENSG00000159140 3880 6154 6628 5200 5164 8190 0.0294 ENSG00000142192 3773 4545 7058 5081 4498 9148 0.0312 ENSG00000241837 3076 3092 4983 3692 3667 4094 0.0317Should be pretty clear which points these are on the scatterplot above. Answer: Yes, it is pretty clear.
Other algorithms / methods for variance estimation:
fit <- edgeR::glmFit(y, design)
?glmFit
: "glmFit fits genewise negative binomial glms, all with the same design matrix but possibly different dispersions, offsets and weights."
This is the first part where it's immediately clear why the 'design' is being used—we need to specify which samples are replicates.
Let's see what these adjustments did to the raw counts.
multiplot_nrow <- 2 multiplot_ncol <- ncol(countMat) / multiplot_nrow par(mfrow=c(multiplot_nrow, multiplot_ncol)) for(i in seq(from=1, to=ncol(countMat)) ) { # Plot all the columns! plot( x=countMat[,i] , y=fit$fitted.values[,i] , main=colnames(countMat)[i] , xlab="raw counts (not CPM!)" , ylab="fitted counts", pch=19, col=i) abline(a=0, b=1, lty="dotted") # 45 degree line at X=Y } ## Let's plot that same information, but in LOG SPACE so we can actually see what's going on. multiplot_nrow <- 2 multiplot_ncol <- ncol(countMat) / multiplot_nrow par(mfrow=c(multiplot_nrow, multiplot_ncol)) for(i in seq(from=1, to=ncol(countMat)) ) { # Plot all the columns... log2-style plot( x=log2(1+countMat[,i]) , y=log2(1+fit$fitted.values[,i]) , main=colnames(countMat)[i] , xlab="log2(1 + raw count)" , ylab="log2(1 + fitted count)", pch=19, col=i) abline(a=0, b=1, lty="dotted") # 45 degree line at X=Y } par(mfrow=c(1,1)) # return to normal one-at-a-time plotting!
lrt <- edgeR::glmLRT(fit, coef=2)
?glmLRT
: "glmLRT conducts likelihood ratio tests for one or more coefficients in the linear model."
This actually runs the differential expression analysis, on the (complex) 'fit' object that contains our counts, normalization factors, and design matrix.
Q: What if you took out 'coef=2'? What is that doing?
Let us investigate what we can do for more complex design matrices
g3 <- c("A","A","A","B","B","B","C","C","C") # Three groups instead of just two! aaa <- stats::model.matrix( ~ g3 ) # <-- this is what we did earlier bbb <- stats::model.matrix( ~ 1 + g3 ) # Same as ~ g3 ccc <- stats::model.matrix( ~ 0 + g3 ) # Slightly more obvious what's going on
Note the differences between 'bbb' and 'ccc'. Can you use a term besides '0' or '1' as a constant?
If you wanted to compare group "C" to group "B," you can no longer just set 'coef=#'. How do we handle this?
Now let's look at the output:
edgeR::topTags(lrt, n=10) # Show the top genes by p-value head(lrt$table , n=10)
> topTags(lrt) Coefficient: groupCTL logFC logCPM LR PValue FDR ENSG00000160202 9.9074697 12.548053 17.813626 2.436322e-05 0.00521373 ENSG00000183486 5.9215134 5.799361 12.962664 3.177647e-04 0.03400082 ENSG00000159200 -1.2325853 12.214501 11.865118 5.719515e-04 0.04079921 ENSG00000159212 4.6759523 5.132394 7.789338 5.255545e-03 0.27147019 ENSG00000159261 4.4485518 5.056227 7.146401 7.511457e-03 0.27147019Q: Why does 'topTags' have one more column than lrt$table? (We will talk about generating that extra column below.)
"Generalized" linear model ≠ "General" linear model.
Time to plot some basic summary information from the edgeR run:
fc <- lrt$table$logFC # log2 p <- lrt$table$PValue plot(x=p, y=fc, pch=19) # Plot fold change (log2 scale) vs P-value abline(h=0, col="red") # What is each point? # Well, that was unsurprising; the genes with higher fold-change usually also have lower P-values # When would that NOT be the case? # Same idea as a "volcano plot"
Optional Section about translucent pointsFor scatterplots with a bunch of points (like the plot above), it's sometimes useful for the points to be partially transparent.plot(p, fc, pch=19, cex=4, col="#000000") # <-- Unreadable! plot(p, fc, pch=19, cex=4, col="#00000033") # <-- Partially readable!If #00000033 makes no sense to you, search online for "RGBA" (http://www.w3schools.com/cssref/css_colors_legal.asp)
|
Let's see how things work out if we use the T-test or Wilcoxon test to calculate P-values:
# We happen to remember that columns 1 through 3 are one group, and 4 through 6 are the other. results.ttest <- apply(countMat, MARGIN=1, function(row) { t.test(row[1:3], row[4:6]) } ) results.wilcoxon <- apply(countMat, MARGIN=1, function(row) { wilcox.test(row[1:3], row[4:6]) } ) p.edge <- lrt$table$PValue names(p.edge) <- rownames(lrt$table$PValue) # Doesn't get the row names by default... p.t <- sapply(results.ttest, "[[", "p.value") p.w <- sapply(results.wilcoxon, "[[", "p.value") stopifnot(all(names(p.t) == names(p.w))) stopifnot(all(names(p.t) == names(p.edge))) plot(p.edge, p.t, pch=19, col="red", main="T-test vs EdgeR") plot(p.edge, p.w, pch=19, col="blue", main="Wilcoxon vs EdgeR") plot(p.t , p.w, pch=19, col="purple", main="Wilcoxon vs T-test")If only there was some way to plot all three of those comparisons in one plot... perhaps some way that we used earlier today, even.
pairs(cbind(p.edge, p.t, p.w), pch=19, col="#FF000066")
Optional: Variable Name Issues in RHere's something you may encounter while implementing a formula in R. What is horribly wrong with this reasonable looking R code?t <- apply(countMat, MARGIN=1, function(row) { t.test(row[1:3], row[4:6]) })
or this code?
or this code?
T, t, c (and some other common short names) are existing R functions/values, but R will let you redefine them! So "pi = 3" is totally valid (but not recommended) R code. Interestingly, although "T = 10" is allowed, "TRUE = 10" gives an error. |
P < 0.05
threshold, even if there were no real differences in your experimental conditions? Answer: 10000 * 0.05 = 500. That's a lot of "significant" genes!
Let's look at a histogram of our P-values:
p <- lrt$table$PValue sum(p < 0.05) # 12 genes with P < 0.05 hist(p, breaks=25, col="gray")Now to get some adjusted statistics.
ben <- stats::p.adjust(p, method="BH") # Benjamini-Hochberg method bon <- stats::p.adjust(p, method="bonferroni") # "There seems no reason to use the unmodified Bonferroni correction # because it is dominated by Holm's method, which is also valid under # arbitrary assumptions.""The "BH" (aka "fdr") and "BY" method of Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others." # Bonferroni plots hist(bon, breaks=25, col="orange", main="Bonferroni") plot(p, bon, col="orange", pch=19, main="Bonferroni") # BH plots hist(ben, breaks=25, col="green", main="Benjamini-Hochberg FDR (not a P-value)") plot(p, ben, col="green" , pch=19, main="BH") plot(ben, bon, col="black", pch=19)
Question for you: what are the values returned by "p.adjust" with the "FDR" (Benjamini-Hochberg) method—are they P-values? No! Those values are the estimated false discovery rate at a particular P-value cutoff.
Question: Could it ever happen that you get ALL values with an adjusted Benjamini-Hochberg FDR value of 1.0? What would that even mean? Yes—this is surprisingly common. It means you have about as many differentially expressed genes in your condition-A-versus-condition-B experiment as you'd expect to get if you just ran "control condition vs some more control condition" with no actual biological effect.
It is generally recommended to remove genes with "too low" counts and genes with "too high" counts.
One single mitochondrial gene can account for up to 10% of total reads in a sample. Why might you want to remove those?
Let's also try removing the genes with basically no expression. We could use CPM as a cutoff, or raw reads. Probably it's better to set up some formula involving the total number of samples, but for our example, we'll just pick a semi-arbitrary cutoff.
REQUIRED_TOTAL_RAW_COUNTS_ACROSS_ALL_SAMPLES = 10 filtMat <- countMat[ rowSums(countMat) >= REQUIRED_TOTAL_RAW_COUNTS_ACROSS_ALL_SAMPLES, ] # a subset of genes with "enough" counts. Note that we are not filtering "too high count" genes here.
Also, we keep running the same edgeR code over and over. Maybe we should stop copying and pasting our code, and make a simple function to do it...
myEdgeR <- function(mat, g, d) { # ARGUMENTS: mat = input counts matrix, g = groups, d = design matrix z <- edgeR::DGEList(counts=mat, group=g) z <- edgeR::calcNormFactors(z ) z <- edgeR::estimateGLMTrendedDisp(z, d) z <- edgeR::estimateGLMTagwiseDisp(z, d) z.fit <- edgeR::glmFit(z, d) z.lrt <- edgeR::glmLRT(z.fit, coef=2) return(z.lrt) } # Run edgeR on the filtered data. lrt.f = lrt, filtered lrt.f <- myEdgeR(filtMat, g=group, d=design) genes_in_both.vec <- intersect(rownames(lrt$table), rownames(lrt.f$table)) tab.all <- lrt$table[genes_in_both.vec, ] # <-- 'lrt' from way earlier tab.filt <- lrt.f$table[genes_in_both.vec, ] stopifnot(all(rownames(tab.all) == rownames(tab.filt)))
Will the results change as a result of our filtering?
plot(tab.all$PValue, tab.filt$PValue, pch=19, col="#FF000044", cex=2) abline(a=0,b=1) # add a line at x = y
write.table(countMat, file="OUTPUT_Count_Matrix.txt", col.names=NA, row.names=T, quote=F, sep="\t") write.table(lrt$table, file="OUTPUT_LRT_table.txt", col.names=NA, row.names=T, quote=F, sep="\t")Actually that's not a very good way to do it. You should probably put all your output data in one table, so it isn't a total nightmare to look at it later. For example:
same_order <- rownames(countMat) fdr_in_same_order <- p.adjust(lrt$table[same_order, "PValue"], method="BH") write_me.df <- data.frame("COUNTS"=countMat[same_order, ] , lrt$table[same_order, ] , "FDR"=fdr_in_same_order) write.table(write_me.df, file="OUTPUT_ALL_table.txt", col.names=NA, row.names=T, quote=F, sep="\t")
Note: be sure all the rows are in the same order before smooshing them together! Incorrectly mashing columns together is the NUMBER ONE cause of messed up data in R. (Similar to the popular "I accidentally only sorted one column, not the entire table" issue in Excel).
Q: What is the worst way of technically making your data available, but in a format that no one can reasonably use? Text -> Print as PDF. Many journals will actually accept data in this format, for some reason.
Q: When someone inevitably opens your output table in Excel, what is going to happen to the genes 'Sept7' and 'Dec2' in your "common name" gene annotation? They will get converted to something like "September 7, 1904" and "December 2, 1904".
P.S. Never use gene common names as the main identifier for a gene. (Of course, you should include them as additional human-readable annotation on top of the less immediately obvious "ENSG123456789".)
Question for you: You have data from two other labs. One lab used Cufflinks to obtain per-gene counts from 20 samples, but the other lab ran Subread FeatureCounts on their 15 samples. Your PI has asked you to combine samples from both labs into a single table. How will you deal with this? Answer (highlight the black box -->): the best thing to do is to get the raw FASTQ sequence files (or aligned BAM files) and re-run the counting process yourself. Although you might be able to use the 'RUVSeq' package to normalize the counts and just deal with the confounding effect of the different methods. But that's a really bad solution as compared to just re-running the samples with the same method!
P.S. Here is something you might expect to be a big deal that (surprisingly) usually isn't:
How would things change if we had slightly different input data?
topTags(lrt, n=1) # logFC logCPM LR PValue FDR # ENSG00000160202 9.90747 12.54805 17.81363 2.436322e-05 0.00521373
What happens if we multiply ALL the counts (that ones that made it this far through the filtering process) by 10 and re-run our edgeR code?
Will that change the results? How?
Specifically, what is the p-value for this same gene ID ("ENSG00000160202") if you multiply all the counts by 10 and re-run edgeR?
Your answer: when we multiply all counts by 10, the p-value for ENSG00000160202 becomes _________________ (your answer here)
What will happen if we multiply the counts for just sample "mtBC06" (the second column) by 100? Is that going to completely change the genes that show up as differentially-expressed?
Your answer: when we just multiply all counts in "mtBC06" by 100, the raw p-value for ENSG00000160202 becomes _________________, and the new most-significant P-value is for the gene ____________.
Unusually, this is actually an error that is detected automatically---"Error in estimateGLMTagwiseDisp.DGEList(y3b, design): No common.dispersion found in data object. Run estimateGLMCommonDisp first."
(The per-gene dispersion estimate makes use of the more general estimate, so you need to run common dispersion OR tagwise dispersion first.)
For this part, you'll need to install the 'multtest' package from Bioconductor.
source("http://bioconductor.org/biocLite.R"); biocLite("multtest") # "Author: K. Pollard" in the descriptionNow that that's installed, let's look at using PERMUTATION testing on our data, with N = 3 diseased samples and N = 3 control samples.
require("multtest") # We're going to use the same 'countMat' and 'group' variables from above countMatSuperDiff <- countMat countMatSuperDiff[1:10, 1:3] <- countMatSuperDiff[1:10, 1:3] * 25 # Multiply the counts for 10 genes # by 25 in the "AD" condition. binary_group <- ifelse(group == "AD", 0, 1) # The "outcome" must be all 0s or 1s for t-test # Effectively the same as "group <- c(0,0,0,1,1,1)" perm <- multtest::MTP(X=countMatSuperDiff, Y=binary_group, B=10, method="sd.maxT", null="perm") # X = A matrix, data.frame or ExpressionSet containing the raw data. # Y = [...] the outcome of interest. This may be class labels [...] or a continuous [...] dependent variable... # B = number of permutationsNow we've permuted the column (sample) labels. Let's see how the raw values stack up to the permuted values.
# Raw p-value: perm@rawp # Permuted p-value: perm@adjp plot(jitter(perm@rawp), jitter(perm@adjp), xlim=c(0,1), ylim=c(0,1), pch=19, col="#FF000033", cex=2)
Your answer: the maximum number of (valid) permutations for this data (the "B" parameter in MTP) is ___________. This makes sense, since it is also the result of the value of ______ choose _____ (e.g. "N choose M"). The reason this that permutation works poorly on this data is that _______________.
What will happen if we multiply the counts for just sample "mtBC06" (the second column) by 100? Is that going to completely change the genes that show up as differentially-expressed?
Does 'pheatmap' sound more like it should be pronounced more like "P, heatmap" to you, or "feetmap?" Your answer: ____________.