Read in the genes-by-cells read count matrix sc.mat
library(Matrix)
## Warning: package 'Matrix' was built under R version 4.0.5
library(Rtsne)
## Warning: package 'Rtsne' was built under R version 4.0.2
library(ggplot2)
load("visualization.RData")
Take a look at a few rows and columns of the matrix
sc.mat[1:3,1:3]
## 3 x 3 sparse Matrix of class "dgCMatrix"
## AAACATACCACTCC-1 AAACATACCGCTAA-1 AAACATACTAACGC-1
## MIR1302-10 . . .
## FAM138A . . .
## OR4F5 . . .
How many cells?
ncol(sc.mat)
## [1] 3478
How many genes?
nrow(sc.mat)
## [1] 32738
Produce a summary of counts for the first cell
summary(sc.mat[,1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02627 0.00000 57.00000
Produce a summary of counts for the APOL2 gene
summary(sc.mat[grep("APOL2",rownames(sc.mat)),])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.00345 0.00000 1.00000
Get just the genes that are expressed in at least 10 cells
sc.mat.expr=sc.mat[which(rowSums(sc.mat>0) >= 10),]
How many genes are retained after the filter of >=10 cells?
nrow(sc.mat.expr)
## [1] 9170
Normalize to median depth for each cell and take the log * hint: add a pseudocount of 1 to avoid log(0)
n.umis=median(colSums(sc.mat.expr))
sc.mat.norm=log2(t(t(sc.mat.expr)*n.umis/colSums(sc.mat.expr)) + 1)
#or in two steps:
#nf=colSums(sc.mat.expr)/n.umis
#sc.mat.norm=log2(t(t(sc.mat.expr)/nf) + 1)
Visualize distribution of normalized expression values for first cell
hist(sc.mat.norm[,1])
Visualize distribution of normalized expression values for first gene
hist(sc.mat.norm[1,])
*NOTE: prcomp and Rtsne are slow on a matrix this big, so you can use the saved data frame sc.df that was loaded with the input count matrix two dimensions for PCA and two for tSNE were saved as columns “pc1”, “pc2”, “tsne1”, “tsne2”
Run PCA on the normalized expression matrix
pcs=prcomp(t(sc.mat.norm), rank.=30)
Visualize PCA results
sc.df$pc1=pcs$x[,1]
sc.df$pc2=pcs$x[,2]
ggplot(sc.df, aes(x=pc1, y=pc2, colour=cell.type)) + geom_point()
Run tsne on the normalized expression matrix using PCA as starting values
tsne=Rtsne(pcs$x)
Visualize TSNE results
sc.df$tsne1=tsne$Y[,1]
sc.df$tsne2=tsne$Y[,2]
ggplot(sc.df, aes(x=tsne1, y=tsne2, colour=cell.type)) + geom_point()