Differential expression analysis in R.

Writeup by Alex Williams, October 2016.

Note about formatting:

Finding differentially-expressed genes in RNA-seq data, using the EdgeR (an R package)

  1. Install "edgeR"

    Install the module "edgeR" and "pheatmap" (nicer heatmaps) from within R (via Bioconductor) with:

    source('https://bioconductor.org/biocLite.R');   biocLite('edgeR')
    install.packages('pheatmap')
  2. Reading a matrix of gene-count data into R

    We are going to read a matrix of per-gene counts data into R.

    Specifically, we'll use a matrix generated by "Subread featureCounts." Other tools will generate files in slightly different format. (Other common tools that generate similar matrices are HTSeq-counts (not recommended) and Cufflinks (outputs normalized values, not raw counts).)

    Excel Warning: Excel will save your files with different (invisible) line ending characters. R actually handles this just fine, but UNIX tools will have weird issues. Beware! Search online for "CR LF line endings" or "use the 'tr' command to translate carriage returns to line feeds" to find the solution to this issue.

    Read the "COUNTS_FILE" matrix into R

    COUNTS_FILE <- "/path/to/file/on/your/filesystem/counts.txt"
    
    allDat <- read.delim(file=COUNTS_FILE
       , comment.char='#'           # <-- occasionally dangerous
       , header=TRUE, row.names=1
       , sep="\t"                   # <-- \t = tab
       , strip.white=TRUE
       , stringsAsFactors=FALSE     # <-- this is SUPER IMPORTANT for most R things
       , fill=FALSE)
       
    head(allDat)  # Check out the top of the file to make sure it looks OK
    

    Your input data should look something like this. Note the columns being mis-aligned in a plain text editor due to tab widths.

    You can open the "counts" file in any plain text editor, or in Excel / LibreOffice.

    If you're on the Mac, I recommend the free program TextWrangler (also free on the App Store) for your plain-text-wrangling needs.

  3. Formatting the input data frame as an actual numeric matrix

    R has a lot of weird properties when reading in data. Let's convert this "data.frame" to a numeric matrix so that:
    1. The "heatmap" function won't complain later
    2. We can make sure all our values are in fact numeric! This is a good sanity check
    is.data.frame(allDat)   # "read.delim" returns a data frame, so this should be TRUE
    
    countMat  <- data.matrix(allDat)  # Convert the input "data.frame" to a numeric matrix.
                                      # This is the matrix we'll use below.
                                      # ("as.matrix(...)" would have also worked, but
                                      #  it's less assertive about requiring only numbers.)
    
    typeof(countMat)   # Should be 'numeric' or 'integer'
    
    Optional note: You'll see a lot of input datasets with extraneous columns that you may need to remove, so be prepared to subset your "allDat" matrix if you have extra columns / rows. (e.g. only_the_numbers_please <- allDat[ , 4:6]). Subread featureCounts normally provides a bunch of extra data in the columns 2 through 5 by default, but we've cut it out for this example.
  4. Examining our matrix of counts: Numerical edition

    We're going to be processing this 'countMat' matrix in edgeR later, but let's check its properties first.
    summary(countMat)  # Get some basic info. Conveniently summarized by column (default behavior)!
    
    # Result of summary(countMat):
    #     mtBC02           mtBC06           mtBC07            wtBC04           wtBC05           wtBC08       
    # Min.   :   0.0   Min.   :   0.0   Min.   :    0.0   Min.   :   0.0   Min.   :   0.0   Min.   :    0.0  
    # 1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:    0.0   1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:    0.0  
    # Median :  15.0   Median :  18.0   Median :   19.5   Median :  18.0   Median :  14.0   Median :   22.0  
    # Mean   : 328.0   Mean   : 430.3   Mean   :  621.3   Mean   : 420.7   Mean   : 402.6   Mean   :  627.3  
    # 3rd Qu.: 252.5   3rd Qu.: 386.8   3rd Qu.:  572.2   3rd Qu.: 357.2   3rd Qu.: 323.2   3rd Qu.:  543.8  
    # Max.   :4266.0   Max.   :7481.0   Max.   :10062.0   Max.   :6177.0   Max.   :7896.0   Max.   :13621.0  
    

    Question 1: Does this look like reasonable input data, based on your expectation of sequence data? Why is the maximum so much higher than the mean?

    Question 2: Why is the "1st Qu." 0.0? What's wrong with our data? This is totally normal for sequence data. A high fraction of genes (sometimes 50% or more) will have zero detected counts.

  5. Examining our matrix of counts: Visual overview of the raw data

    Heatmaps:

    'pheatmap' and 'heatmap.2' make reasonable heatmaps with scale bars (unlike R's default 'heatmap' function).

    install.packages("pheatmap")
    require("pheatmap")
    pheatmap(log2(1+countMat), border_color="#00000000", show_rownames=FALSE)
    

    'pheatmap' also gives you a scale. This is the output of: pheatmap(log2(1+countMat), border_color="#00000000", show_rownames=FALSE) . The only oddity here is that 'border_color' thing, to set borders to transparent. Otherwise the heatmaps look awful.


    'pheatmap' documentation (output of ?pheatmap). As you can see, it's incredibly obvious how to properly configure - heatmaps.

    Optional Section about R's default heatmaps, which you should not use

    Here, we will use R's default 'heatmap' package. But you should never use it (use 'heatmap.2' or 'pheatmap').
    heatmap(countMat        , scale="none")  # <-- Looks terrible. (It's data with a high dynamic range, in linear space.)
    heatmap(log2(1+countMat), scale="none")  # <-- Log-transformed data will be a lot clearer.
    heatmap(countMat)                        # <-- Default 'heatmap' scales each row (not what we want!). Note columns 'mtBC07' and 'wtBC08'
    

    The default R heatmap row-normalizes your values if you don't tell it not to, and it doesn't include scale bars. Question: why are the first two columns here so different from the others? Answer: those are the two columns (samples) with higher overall expression (look at the mean value in the 'summary(countMat)' results). Since this heatmap shows raw counts with no normalization, we get misleading results.

    Histograms of read counts:

    hist(countMat[,1], xlim=c(0,500), breaks=1000, col="red", main="Just sample 1")
    # Q: Why is "xlim=c(0, 500)" there?
    
    # Optional: plot the same thing, but in log space:
    hist(log2(1+countMat[,1]), breaks=20, col="blue", main="log2(Sample 1)")
    

    Pairs plots (for looking at sample-sample correlation in detail):

    pairs(countMat)  # Hmm... this looks really bad
    pairs(countMat, pch=19, cex=0.25, col="red")   # Why "pch = 19"? What about "pch='.'"?
    

    Question: do these samples look similar, or is there some obviously-messed-up replicate? (I can't really tell from these graphs, because they are totally dominated by a few high-expression outliers.)

    Let's log-transform the data to make it a little more clear what's going on with the samples.

    logMat <- log2(1 + countMat)                  # Why did we add one?
    pairs(logMat, pch=19, cex=0.25, col="blue")   # X and Y axis scales are not the same by default
    

    A 'pairs(...)' plot in R, which can be customized with much more information than in this example. Question for you: is this plot perfectly symmetrical (what's the difference between the cell just BELOW the top-left one, and the cell just RIGHT of the top-left one)?

    So does anything seem "off" about this 'pairs' plot? Any samples that really should be excluded? Answer: These all look pretty OK (from this metric, at least).

  6. EdgeR: Differential Expression

    Our input to EdgeR will be the 'countMat' matrix. Note that for these steps, we aren't going to filter out any reads from the matrix. (But, for a full experiment, we would probably want to remove certain genes with too high / too low expression.)

    Secret knowledge: if you want to understand EdgeR better, the documentation is amazingly good (this is very rare, so you should savor the experience). You can access it online, or within R (it will open as a PDF) by typing: edgeRUsersGuide()

    We're going to tell edgeR which sample belongs to which group annotation. Let's see if we know the answer to that.

    colnames(countMat) # Hopefully, sample groups are obvious from the column names
    
    # We're going to MANUALLY specify the column names here.
    metadata <- data.frame(    "sample" = colnames(countMat)
                             ,"disease" = c("AD","AD","AD", "CTL","CTL","CTL") )
    
    Here's a fancier way to do it:
    group_names_automatically_computed <- ifelse(grepl("^mt", colnames(countMat)), yes="ALZ", no="CONTROL")
    # Above: checks to see if the column name starts with 'mt' or not. The 'mt' samples are the "Alzheimer's" samples.
    # Result:  [1] "ALZ" "ALZ" "ALZ" "CTL" "CTL" "CTL"    <-- automatically figures out groups from the sample names
    

    Model Matrix

    Now we need to set up the model matrix, A.K.A. the "hey edgeR, this is which sample belongs to which group" matrix. You can also use it to set up more complicated experimental scenarios (like "drug X vs control, and also day 10 vs day 20 samples").

    Here, we just have two groups: samples from patients with Alzheimer's disease and "control" samples without the disease.

    We need to tell edgeR which samples belong to which groups, and also which groups we want to compare. To do that, we'll use an R "model formula." It's just called a "formula" in R, but don't mistake it for an algebraic formula!

    Perils of R fomulas: R 'model formulas' have an extremely strange method of specification that is unlike anything in other programming langauges. You can get some info with by typing: help('formula'). The ONE most crucial thing about model formulas in R is that they are NOT ALGEBRAIC EQUATIONS, even though they look like they should be. Here's a web page that describes them in better detail than R's documentation: http://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html.

    Code to generate the model matrix from our list of groups:

    require("edgeR")
    
    str(metadata)      # View the underlying STRucture with "str"
    
    print(metadata)
    
    #   sample disease
    # 1 mtBC02      AD
    # 2 mtBC06      AD
    # 3 mtBC07      AD
    # 4 wtBC04     CTL
    # 5 wtBC05     CTL
    # 6 wtBC08     CTL
    
    group   <- metadata[, "disease"]
    design  <- stats::model.matrix( ~ group ) # <- "~ THING" is an R formula
    

    Optional details about R model formulas

    Model formula info: if you had multiple factors to consider, you could specify them using the "~" notation here.

    For example, if you had cell growth as a factor of both gene-deletion-mutation state ("mutant_status") and also whether or not a plate of cells was given Drug X or no drug ("drug_status"), you would have:

    design <- model.matrix( ~  mutant_status + drug_status )
           # mutant_status and drug_status are both vectors of TRUE / FALSE
           # (They are the same length, also—one vector element per experimental sample.)
    

    But actually, maybe the effect of the drug changes based on the gene deletion! Like, say drugX only affects mutant cells. Then that means the drug effect interacts with the gene-deletion effect. So then, we'd really need:

    design <- model.matrix( ~  mutant_status + drug_status + mutant_status:drug_status)
                 # That last "thing1:thing2" term is the "interaction term"
                 # Theoretically you wouldn't need it if the effect of the drug
                 # and of the mutation were totally unrelated and independent of each other.
    
    Wikipedia examples of interaction terms:
    • "Interaction between adding sugar to coffee and stirring the coffee. Neither of the two individual variables has much effect on sweetness but a combination of the two does."
    • "Interaction between adding carbon to steel and quenching. Neither of the two individually has much effect on strength but a combination of the two has a dramatic effect."

    Let's check out this model matrix we just made.

    print(design)  # What does 'design' look like now? Note the column names!
    
    > design
      (Intercept) groupCTL
    1           1        0
    2           1        0
    3           1        0
    4           1        1
    5           1        1
    6           1        1
    
    Note that the "CTL" column may be slightly confusing, here, the samples with "CTL" have a 1 and the "AD" samples have a zero. So you might think "0 means AD," but that's not quite accurate. (We will discuss this more later.)

    Optional Section about R column naming

    So what was the point of that group variable above anyway?

    Can't we just omit 'group' entirely and set that 'design' model matrix in one line? Yes, but check out what happens to the column names in the design matrix. This is just an R quirk—functions can actually make use of the exact names of the input variables, not just their values. This is extremely unusual; you would probably have a hard time finding another programming language that operates this way!

    design  <- stats::model.matrix( ~ metadata[, "disease"] )
    print(design)
    
    #    (Intercept)   metadata[, "disease"]CTL  # <-- Awful column names now
    # 1       1                  0
    # 2       1                  0
    # 3       1                  0
    # 4       1                  1
    # 5       1                  1
    # 6       1                  1
    
    colnames(design) <- c("Intercept", "CTL")  # <-- Fix the bizarre column names from earlier
    print(design)                              # <-- Ok, now the column names are reasonable.
    

    Hey, what happened to the "AD" category? We only have CTL now and this "Intercept" thing. Well, the second column is actually "AD or CTL," despite its misleading name. Probably it should be named "Is_this_sample_control" or something, but that's really verbose, so "CTL" will have to do.

    We aren't going to use the Intercept term for anything.


    EdgeR has unusually good documentation and examples for most types of analysis that you would want to do. You can get it in R as a PDF by typing edgeRUsersGuide(), or read the original edgeR paper.
  7. EdgeR

    y   <- edgeR::DGEList(counts=countMat, group=group)    # or group=metadata[, "disease"]
    y   <- edgeR::calcNormFactors(y)
    y   <- edgeR::estimateGLMTrendedDisp(y, design)
    y   <- edgeR::estimateGLMTagwiseDisp(y, design)
    fit <- edgeR::glmFit(y, design)  # "Genewise Negative Binomial Generalized Linear Models"
    lrt <- edgeR::glmLRT(fit, coef=2)
    
    So what we are actually doing with edgeR here?

    EdgeR differential expression output

    Now let's look at the output:

    edgeR::topTags(lrt, n=10)     # Show the top genes by p-value
    head(lrt$table    , n=10)
    
    > topTags(lrt)
    Coefficient:  groupCTL 
                          logFC     logCPM         LR        PValue         FDR
    ENSG00000160202   9.9074697  12.548053  17.813626  2.436322e-05  0.00521373
    ENSG00000183486   5.9215134   5.799361  12.962664  3.177647e-04  0.03400082
    ENSG00000159200  -1.2325853  12.214501  11.865118  5.719515e-04  0.04079921
    ENSG00000159212   4.6759523   5.132394   7.789338  5.255545e-03  0.27147019
    ENSG00000159261   4.4485518   5.056227   7.146401  7.511457e-03  0.27147019
    
    Q: Why does 'topTags' have one more column than lrt$table? (We will talk about generating that extra column below.)

    "Generalized" linear model ≠ "General" linear model.

    Time to plot some basic summary information from the edgeR run:

    fc <- lrt$table$logFC    # log2
    p  <- lrt$table$PValue
    
    plot(x=p, y=fc, pch=19) # Plot fold change (log2 scale) vs P-value
    abline(h=0, col="red")
    # What is each point?
    # Well, that was unsurprising; the genes with higher fold-change usually also have lower P-values
    # When would that NOT be the case?
    
    # Same idea as a "volcano plot"
    

    Optional Section about translucent points

    For scatterplots with a bunch of points (like the plot above), it's sometimes useful for the points to be partially transparent.
    plot(p, fc, pch=19, cex=4, col="#000000")   # <-- Unreadable!
    plot(p, fc, pch=19, cex=4, col="#00000033") # <-- Partially readable!
    
    If #00000033 makes no sense to you, search online for "RGBA" (http://www.w3schools.com/cssref/css_colors_legal.asp)
  8. Using a more naive method for calculating differential expression

    If you wanted to calculate differential expression in Excel, you might be tempted to use a T-test on it. (And it will actually kind of work, despite not being the right test.) You'd eventually find yourself adding some ad-hoc rules to penalize genes with low expression or particularly low variance. Eventually, you might end up re-inventing EdgeR!

    Let's see how things work out if we use the T-test or Wilcoxon test to calculate P-values:

    # We happen to remember that columns 1 through 3 are one group, and 4 through 6 are the other.
    results.ttest    <- apply(countMat, MARGIN=1, function(row) {      t.test(row[1:3], row[4:6]) } )
    results.wilcoxon <- apply(countMat, MARGIN=1, function(row) { wilcox.test(row[1:3], row[4:6]) } )
    
    p.edge        <- lrt$table$PValue
    names(p.edge) <- rownames(lrt$table$PValue)  # Doesn't get the row names by default...
    
    p.t     <- sapply(results.ttest,    "[[", "p.value")
    p.w     <- sapply(results.wilcoxon, "[[", "p.value")
    
    stopifnot(all(names(p.t) == names(p.w)))
    stopifnot(all(names(p.t) == names(p.edge)))
    
    plot(p.edge, p.t, pch=19, col="red", main="T-test vs EdgeR")
    plot(p.edge, p.w, pch=19, col="blue", main="Wilcoxon vs EdgeR")
    plot(p.t   , p.w, pch=19, col="purple", main="Wilcoxon vs T-test")
    
    If only there was some way to plot all three of those comparisons in one plot... perhaps some way that we used earlier today, even.
    pairs(cbind(p.edge, p.t, p.w), pch=19, col="#FF000066")
    

    Various and sundry options.

    Optional: Variable Name Issues in R

    Here's something you may encounter while implementing a formula in R. What is horribly wrong with this reasonable looking R code?
    t <- apply(countMat, MARGIN=1, function(row) { t.test(row[1:3], row[4:6]) })

    or this code?
    T <- apply(countMat, MARGIN=1, function(row) { t.test(row[1:3], row[4:6]) })

    or this code?
    c = sqrt(a**2 + b**2)

    T, t, c (and some other common short names) are existing R functions/values, but R will let you redefine them! So "pi = 3" is totally valid (but not recommended) R code. Interestingly, although "T = 10" is allowed, "TRUE = 10" gives an error.

  9. Adjusting for multiple hypothesis testing

    Q: If tested all 10,000 genes of a particular species, how many "significantly differentially expressed" genes would you expect to meet the P < 0.05 threshold, even if there were no real differences in your experimental conditions? Answer: 10000 * 0.05 = 500. That's a lot of "significant" genes!

    Let's look at a histogram of our P-values:

    p  <- lrt$table$PValue
    sum(p < 0.05)   # 12 genes with P < 0.05
    hist(p, breaks=25, col="gray")
    
    Now to get some adjusted statistics.
    ben <- stats::p.adjust(p, method="BH")  # Benjamini-Hochberg method
    
    bon <- stats::p.adjust(p, method="bonferroni")
                 # "There seems no reason to use the unmodified Bonferroni correction
                 # because it is dominated by Holm's method, which is also valid under
                 # arbitrary assumptions."
    
    "The "BH" (aka "fdr") and "BY" method of Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others." # Bonferroni plots hist(bon, breaks=25, col="orange", main="Bonferroni") plot(p, bon, col="orange", pch=19, main="Bonferroni") # BH plots hist(ben, breaks=25, col="green", main="Benjamini-Hochberg FDR (not a P-value)") plot(p, ben, col="green" , pch=19, main="BH") plot(ben, bon, col="black", pch=19)

    Top: raw P-values. Bottom: Benjamini-Hochberg-adjusted false discovery rates. Not actually P-values anymore! So if you have 1000 genes with FDR < 0.1, that's actually pretty good—the method is suggesting that maybe 900 of your genes are actually biologically relevant.

    Question for you: what are the values returned by "p.adjust" with the "FDR" (Benjamini-Hochberg) method—are they P-values? No! Those values are the estimated false discovery rate at a particular P-value cutoff.

    Question: Could it ever happen that you get ALL values with an adjusted Benjamini-Hochberg FDR value of 1.0? What would that even mean? Yes—this is surprisingly common. It means you have about as many differentially expressed genes in your condition-A-versus-condition-B experiment as you'd expect to get if you just ran "control condition vs some more control condition" with no actual biological effect.

  10. Removing low-count / high-count genes

    It is generally recommended to remove genes with "too low" counts and genes with "too high" counts.

    One single mitochondrial gene can account for up to 10% of total reads in a sample. Why might you want to remove those?

    Let's also try removing the genes with basically no expression. We could use CPM as a cutoff, or raw reads. Probably it's better to set up some formula involving the total number of samples, but for our example, we'll just pick a semi-arbitrary cutoff.

    REQUIRED_TOTAL_RAW_COUNTS_ACROSS_ALL_SAMPLES = 10
    filtMat <- countMat[ rowSums(countMat) >= REQUIRED_TOTAL_RAW_COUNTS_ACROSS_ALL_SAMPLES, ]
      # a subset of genes with "enough" counts. Note that we are not filtering "too high count" genes here.
    

    Also, we keep running the same edgeR code over and over. Maybe we should stop copying and pasting our code, and make a simple function to do it...

    myEdgeR <- function(mat, g, d) {
      # ARGUMENTS: mat = input counts matrix, g = groups, d = design matrix
      z     <- edgeR::DGEList(counts=mat, group=g)
      z     <-        edgeR::calcNormFactors(z   )
      z     <- edgeR::estimateGLMTrendedDisp(z, d)
      z     <- edgeR::estimateGLMTagwiseDisp(z, d)
      z.fit <-                 edgeR::glmFit(z, d)
      z.lrt <-                 edgeR::glmLRT(z.fit, coef=2)
      return(z.lrt)
    }
    
    # Run edgeR on the filtered data. lrt.f = lrt, filtered
    lrt.f <- myEdgeR(filtMat, g=group, d=design)
    
    genes_in_both.vec <- intersect(rownames(lrt$table), rownames(lrt.f$table))
    tab.all           <-   lrt$table[genes_in_both.vec, ] # <-- 'lrt' from way earlier
    tab.filt          <- lrt.f$table[genes_in_both.vec, ]
    
    stopifnot(all(rownames(tab.all) == rownames(tab.filt)))
    

    Will the results change as a result of our filtering?

    plot(tab.all$PValue, tab.filt$PValue, pch=19, col="#FF000044", cex=2)
    abline(a=0,b=1) # add a line at x = y
    

    Similar results in this case.
  11. Finally, output data in tabular form

    Don't forget to actually save your data in a reasonable numeric form!
    write.table(countMat, file="OUTPUT_Count_Matrix.txt", col.names=NA, row.names=T, quote=F, sep="\t")
    write.table(lrt$table, file="OUTPUT_LRT_table.txt", col.names=NA, row.names=T, quote=F, sep="\t")
    
    Actually that's not a very good way to do it. You should probably put all your output data in one table, so it isn't a total nightmare to look at it later. For example:
    same_order        <- rownames(countMat)
    
    fdr_in_same_order <- p.adjust(lrt$table[same_order, "PValue"], method="BH")
    
    write_me.df       <- data.frame("COUNTS"=countMat[same_order, ]
                                           , lrt$table[same_order, ]
                                     , "FDR"=fdr_in_same_order)
    
    write.table(write_me.df, file="OUTPUT_ALL_table.txt", col.names=NA, row.names=T, quote=F, sep="\t")
    

    Note: be sure all the rows are in the same order before smooshing them together! Incorrectly mashing columns together is the NUMBER ONE cause of messed up data in R. (Similar to the popular "I accidentally only sorted one column, not the entire table" issue in Excel).

    Q: What is the worst way of technically making your data available, but in a format that no one can reasonably use? Text -> Print as PDF. Many journals will actually accept data in this format, for some reason.

    Q: When someone inevitably opens your output table in Excel, what is going to happen to the genes 'Sept7' and 'Dec2' in your "common name" gene annotation? They will get converted to something like "September 7, 1904" and "December 2, 1904".

    P.S. Never use gene common names as the main identifier for a gene. (Of course, you should include them as additional human-readable annotation on top of the less immediately obvious "ENSG123456789".)

  12. Different methods will result in surprisingly different results

    But be aware that the following things will (possibly dramatically) change your results:

    Question for you: You have data from two other labs. One lab used Cufflinks to obtain per-gene counts from 20 samples, but the other lab ran Subread FeatureCounts on their 15 samples. Your PI has asked you to combine samples from both labs into a single table. How will you deal with this? Answer (highlight the black box -->): the best thing to do is to get the raw FASTQ sequence files (or aligned BAM files) and re-run the counting process yourself. Although you might be able to use the 'RUVSeq' package to normalize the counts and just deal with the confounding effect of the different methods. But that's a really bad solution as compared to just re-running the samples with the same method!

    P.S. Here is something you might expect to be a big deal that (surprisingly) usually isn't:

  13. Homework!

    Note: most of these questions are rhetorical, except the ones in BOLD RED.

    How would things change if we had slightly different input data?

    Homework question #1 of 3 ("multiply all counts by 10")

    Recall that this was the "top" gene when we ran topTags(lrt, n=1):
    topTags(lrt, n=1)
    #                   logFC   logCPM       LR       PValue        FDR
    # ENSG00000160202 9.90747 12.54805 17.81363 2.436322e-05 0.00521373
    

    What happens if we multiply ALL the counts (that ones that made it this far through the filtering process) by 10 and re-run our edgeR code?

    Will that change the results? How?

    Specifically, what is the p-value for this same gene ID ("ENSG00000160202") if you multiply all the counts by 10 and re-run edgeR?

    Your answer: when we multiply all counts by 10, the p-value for ENSG00000160202 becomes _________________ (your answer here)

    Homework question #2 of 3 ("Multiplying counts for just sample "mtBC06" by 100x")

    What will happen if we multiply the counts for just sample "mtBC06" (the second column) by 100? Is that going to completely change the genes that show up as differentially-expressed?

    Your answer: when we just multiply all counts in "mtBC06" by 100, the raw p-value for ENSG00000160202 becomes _________________, and the new most-significant P-value is for the gene ____________.

    Unusually, this is actually an error that is detected automatically---"Error in estimateGLMTagwiseDisp.DGEList(y3b, design): No common.dispersion found in data object. Run estimateGLMCommonDisp first." (The per-gene dispersion estimate makes use of the more general estimate, so you need to run common dispersion OR tagwise dispersion first.)

    Homework question #3 of 3 ("multtest")

    For this part, you'll need to install the 'multtest' package from Bioconductor.

    source("http://bioconductor.org/biocLite.R"); biocLite("multtest")  # "Author: K. Pollard" in the description
    
    Now that that's installed, let's look at using PERMUTATION testing on our data, with N = 3 diseased samples and N = 3 control samples.
    require("multtest")
    
    # We're going to use the same 'countMat' and 'group' variables from above
    
    countMatSuperDiff            <- countMat
    countMatSuperDiff[1:10, 1:3] <- countMatSuperDiff[1:10, 1:3] * 25   # Multiply the counts for 10 genes
                                                                        # by 25 in the "AD" condition.
    
    binary_group <- ifelse(group == "AD", 0, 1)         # The "outcome" must be all 0s or 1s for t-test
                                                        # Effectively the same as "group <- c(0,0,0,1,1,1)"
    perm <- multtest::MTP(X=countMatSuperDiff, Y=binary_group, B=10, method="sd.maxT", null="perm")
              # X = A matrix, data.frame or ExpressionSet containing the raw data.
              # Y = [...] the outcome of interest. This may be class labels [...] or a continuous [...] dependent variable...
              # B = number of permutations
    
    Now we've permuted the column (sample) labels. Let's see how the raw values stack up to the permuted values.
    #      Raw p-value: perm@rawp
    # Permuted p-value: perm@adjp
    plot(jitter(perm@rawp), jitter(perm@adjp), xlim=c(0,1), ylim=c(0,1), pch=19, col="#FF000033", cex=2)
    
    

    Your answer: the maximum number of (valid) permutations for this data (the "B" parameter in MTP) is ___________. This makes sense, since it is also the result of the value of ______ choose _____ (e.g. "N choose M"). The reason this that permutation works poorly on this data is that _______________.


    "Multtest" raw (X-axis) vs permuted P-value (Y-axis). 3 replicates per experimental condition. (Points jittered to prevent overlap.)

    Homework supplement for no-extra-credit ("pheatmap")

    What will happen if we multiply the counts for just sample "mtBC06" (the second column) by 100? Is that going to completely change the genes that show up as differentially-expressed?

    Does 'pheatmap' sound more like it should be pronounced more like "P, heatmap" to you, or "feetmap?" Your answer: ____________.