Read in the genes-by-cells read count matrix sc.mat

library(Matrix)
## Warning: package 'Matrix' was built under R version 4.0.5
library(Rtsne)
## Warning: package 'Rtsne' was built under R version 4.0.2
library(ggplot2)

load("visualization.RData")


Take a look at a few rows and columns of the matrix

sc.mat[1:3,1:3]
## 3 x 3 sparse Matrix of class "dgCMatrix"
##            AAACATACCACTCC-1 AAACATACCGCTAA-1 AAACATACTAACGC-1
## MIR1302-10                .                .                .
## FAM138A                   .                .                .
## OR4F5                     .                .                .


How many cells?

ncol(sc.mat) 
## [1] 3478


How many genes?

nrow(sc.mat) 
## [1] 32738


Produce a summary of counts for the first cell

summary(sc.mat[,1])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02627  0.00000 57.00000


Produce a summary of counts for the APOL2 gene

summary(sc.mat[grep("APOL2",rownames(sc.mat)),])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.00345 0.00000 1.00000


Get just the genes that are expressed in at least 10 cells

sc.mat.expr=sc.mat[which(rowSums(sc.mat>0) >= 10),]


How many genes are retained after the filter of >=10 cells?

nrow(sc.mat.expr)
## [1] 9170


Normalize to median depth for each cell and take the log * hint: add a pseudocount of 1 to avoid log(0)

n.umis=median(colSums(sc.mat.expr))
sc.mat.norm=log2(t(t(sc.mat.expr)*n.umis/colSums(sc.mat.expr)) + 1)

#or in two steps:
#nf=colSums(sc.mat.expr)/n.umis
#sc.mat.norm=log2(t(t(sc.mat.expr)/nf) + 1)


Visualize distribution of normalized expression values for first cell

hist(sc.mat.norm[,1])


Visualize distribution of normalized expression values for first gene

hist(sc.mat.norm[1,])



*NOTE: prcomp and Rtsne are slow on a matrix this big, so you can use the saved data frame sc.df that was loaded with the input count matrix two dimensions for PCA and two for tSNE were saved as columns “pc1”, “pc2”, “tsne1”, “tsne2”


Run PCA on the normalized expression matrix

pcs=prcomp(t(sc.mat.norm), rank.=30)


Visualize PCA results

sc.df$pc1=pcs$x[,1]
sc.df$pc2=pcs$x[,2]
ggplot(sc.df, aes(x=pc1, y=pc2, colour=cell.type)) + geom_point()


Run tsne on the normalized expression matrix using PCA as starting values

tsne=Rtsne(pcs$x)


Visualize TSNE results

sc.df$tsne1=tsne$Y[,1]
sc.df$tsne2=tsne$Y[,2]
ggplot(sc.df, aes(x=tsne1, y=tsne2, colour=cell.type)) + geom_point()