Last updated: 2017-10-19
Code version: 861184a54d0981caa0791fd3a31ddee6995fa00b
We evaluated the methods that have been described in peer-reviewed papers. This list does not include Monocole and Seurat (which are included in Sonesen and Robinson 2017).
| Bulk RNA-seq | Model Input | Default normalization | Pseudo-count |
|---|---|---|---|
| edgeR | Count | Weighted trimmed mean of M-values (TMM) | None |
| DESeq2 | Count | Median ratio (MR) | None |
| limmaVoom | log2 counts | Counts per million (CPM) | 1 |
| Single-cell | Model Input | Default normalization | Pseudo-count |
|---|---|---|---|
| BPSC | Count | CPM or FPKM recommended | None |
| MAST | log2 counts | CPM recommended | Adaptive thresholding |
| ROTS | Count | Normalization recommended | None |
| SCDE | Count | RPM recommended | None |
D3E may need to be included later. It’s a python-based software.
limmaVoom is the only method that explicity applies pseudo-count. MAST is another log-count based method, which models expression as a two-part process generating “non-drop-outs” and “drop-outs” - the cutoff is decided arbitrarily.
Extracted from: https://github.com/hms-dbmi/scw/blob/d57755ca045260e9368540850854dd11ef2fa834/scw2016/tutorials/batcheffects/Hicks.Rmd
Potential limiting factors in experimental design
A. Bulk RNA-seq literature in the past decade has established the need for
Within-sample normalization to adjust for GC content and transcript length,
Between-sample normalization to adjust for differences in sampling depth
B. The bulk RNA-seq normalization methods don’t work so well for single-cell RNA-seq data (?), especially for datasets with many zeros and also with data with highly variable genes (there’s this paper that says that bulk normalization methods don’t work well when there are a lot of DE genes, which one?).
C. The single-cell protocols can be described in these general steps: 1. Cell lysis 2. Reverse transcription 3. PCR amplification 4. Dilution
D. In terms of protocol, does single cell protocol introduce additional sources of variation? Perhaps, it is inherent cell-to-cell variation in total mRNA content?
E. Description of existing methods
Between-sample normalization
| Evaluated methods | Typical use for | Spike-in control |
|---|---|---|
| Counts-per-million | bulk RNA-seq | NA |
| TMM | bulk RNA-seq | NA |
| RLE (DESeq) | bulk RNA-seq | NA |
| SCnorm | single-cell RNA-seq | Not required, but included an option. |
| Scran | single-cell RNA-seq | Not required, but included an option. |
| BASiCs | single-cell RNA-seq | Required |
| Census | single-cell RNA-seq | Not required, but included an option. |
Within-sample normalization
| Evaluted methods | |
|---|---|
| TPM | |
| FPKM | |
| RPKM |
Notes.
The first step in analyzing CPM - adjust for between sample differences in library size TMM, MR : adjust for variation in library size due to differences in gene expression distribution
SCnorm : count-read relationship within each biological condition, then normalize again across the two conditions using a different procedure (see function scaleNormMultCont)
RLE: this method requires the imputation of pseudocount.
scran: this method has an option for large cell size. In this case, cells are clustered and cells with similar gene expression profiles are clustered together forming a pseudo-cell for computing library size normalization factor.
About FPKM, RPKM and TPM: “These methods are not applicable to our dataset since the end of the transcript which contains the UMI was preferentially sequenced. Furthermore in general these should only be calculated using appropriate quantification software from aligned BAM files not from read counts since often only a portion of the entire gene/transcript is sequenced, not the entire length. If in doubt check for a relationship between gene/transcript length and expression level.” (extracted from hemberg-lab.github.io)
F. Implementation details
I like Po’s model-based approach to filtering. This has not been done yet in the literature…
Many things that can be done… but for now, I’ll use the simple rule of including samples/features detected as expressed. At least to keep the filtering criteria consistent across datasets and evaluated methods.
A. Feature-level
Genes expressed in at least X percent of cells
For UMI data, should we correct for collison?
B. Sample-level
Gene expression range as expected by the range of ERCC - within-sample variation is expected to be the same between endogeneous genes and ERCC genes
Can we use spike-in variation to predict endogeneous gene variation?
Percent features expressed in spike-in is greater than percent features expressed in endogeneous
what if no spike-in? between-feature variation?
cell variation and coverage?
proportion of genes experssed
total mRNA recovery - library size
C. Existing QC pipelines
scater
edgeR uses a moderated prior count (moderated because it’s depended on library size)
in voom/edgeR/limma, pseudocount is added after normalizing sample depth
This R Markdown site was created with workflowr