Experimental datasets

Last updated: 2017-10-19

Code version: 861184a54d0981caa0791fd3a31ddee6995fa00b

Datasets

UMI count data

Study	Organism	Protocol	UMI	Spike-in	Cell type	No. cells
Jaitin et al. (2014)	Mouse	MARS-Seq	Y	ERCC	Spleen	~4,500
Zeisel et al. (2015)	Mouse	STRT-Seq, C1	Y	ERCC	Brain	3,005
Tung et al., (2016)	Human	SMART-Seq, C1	Y	ERCC	iPSC
Klein et al. (2015)	Mouse	inDrop	Y	Y	ESC	2,717

Read count data

Study	Organism	Protocol	UMI	Spike-in	Cell type	No. cells
Guo et al. (2015)	Human	Tang et al. (2009)	N	ERCC	PGC	328
Engel et al. (2016)	Mouse	SMART2-Seq+	N	N	Tcell	203
Deng et al. (2014)	Mouse	SMART-Seq	N	ArrayControl	ESC
Kumar et al. (2014)	Mouse	SMART-Seq	N	N	ESC	268

Notes:
1. All read count data were downloaded from Conquer.
2. Smart2-Seq+ was adapted from the original Smart2-Seq protocol.

Descriptions

UMI count data

Jaitin et al. (2014). Mouse Spleen cells were collected using MARS-Seq. Data were downloaded from NCBI Gene Expression Omnibus accession GSE54006, including processed count data (GSE54006_umitab.txt) and corresponding single cell sample information (GSE54006_experimental_design.txt).

In the paper, X cell types were identified and etc… A total of XX cells are included. A total of XX endogenous genes and XX ERCC genes. Single cell samples were obtain after FACTs scoring filtering of spleen cells. The authors designed the study to lean the cell types in spleen cells. The Gene specific information: frequency of cells detected. Cell specific information - proportion of genes detected

Zeisel et al. (2015). Mouse brain cells collected using STRT-Seq. In the experiment, a total of 76 96-well C1 Fluidigm runs were performed, which were then processed to include 3,005 single cells passed quality filtering criteria. Data were downloaded from NCBI Gene Expression Omnibus accession GSE6036, including processed count data (expression_mRNA_17-Aug-2014.txt) and corresponding single cell sample information (GSE60361_C1-3005-Expression).

In the paper, X cell types were identified and etc…

Klein et al. (2015). Mouse Embryonic Stem Cells collected by inDrop. We include four of the eight inDrop runs on embryonic stem cells. Data were downloaded from NCBI Gene Expression Omnibus accession [GSE65525](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65525 (GSM1599494_ES_d0_main.csv, GSM1599497_ES_d2_LIFminus.csv, GSM1599498_ES_d4_LIFminus.csv, GSM1599499_ES_d7_LIFminus.csv). These data include unfiltered single cell samples and genes filtered for UMI count.

Tung et al., (2016). Human induced Pluripotent Stem Cells. X single cells were collected from three iPSC cell lines by SMART-Seq. The count data including quality single cells were downloaded from the GitHub repository (molecules-filter.txt, https://github.com/jdblischak/singleCellSeq), with the original GEO accession GSE77288.

Read count data

Guo et al. (2015). Human Primordial Germ Cells and Somatic Cells. In the experiment, 242 Primodial Germ Cells and 81 Somatic Cells were collected by the protocol used in Tang et al. (2009). The count data were downloaded from the Conquer portal (http://imlspenticton.uzh.ch:3838/conquer/ ) (GSE63818-GPL16791.rds), with the original GEO accession GSE63818.

Engel et al. (2016). Mouse Natural Killer T cells by SMART2-Seq+. In the experiment, 203 single cells were collected from purified population of thymic Natural Killer T cell subsets (KNT0, NKT1, NKT2, NKT17). The count data were downloaded from the Conquer portal (http://imlspenticton.uzh.ch:3838/conquer/ ) (GSE74596.rds), with the original GEO accession GSE74596.

Deng et al. (2014). Mouse Pre-implantation Embryonic Stem Cells by SMART-Seq. In the experiment, 286 ESCs were collected from single cells were collected from zygote to blastocyst pre-implantation stages. The count data were downloaded from the Conquer portal (http://imlspenticton.uzh.ch:3838/conquer/ ) (GSE45719.rds), with the original GEO accession GSE45719.

Kumar et al. (2014) Mouse Embryonic Stem Cells by SMART-Seq. In the experiment, 417 single cells cultured in different serum conditions were collected. The count data were downloaded from the Conquer portal (http://imlspenticton.uzh.ch:3838/conquer/ ) (GSE60749-GPL13112.rds, GSE60749-GPL17021.rds ), with the original GEO accession GSE60749.

Remarks

The studies are prioritized by the availabliity of UMI, Spike-in and then processed count data.
Concern: In the single cell RNA-seq studies that use experimental data from a complex tissue, the collected single cell samples are usually consisted of multiple cel populations. Should we first cluster the single cell samples and the extract some cell populations for simulated data? Or should we just ignore this fact and take subsamples of cells? Well, it’s not gonna matter for simulation experiments where we permute sample labels for each gene, but it may matter for when we use the same permuted labels across genes.
For now, we use conquer-curated data for single-cell RNA-seq studies that did not use UMI. These datasets include all of the single cell samples before quality control filtering.
For studies that use UMI, we use the processed count table generated in the original study. Hence, some of these datasets may include single cell samples of higher quality. However, all of the studies include all of the genes that passed mapped reads threshold.
In terms of data filtering, the conern is more that if the threshold for low expression is different between studies, since we are mainly interested in different expression analysis. However, filtering criteria for single cell samples are not that big of a concern (unless we care about finding cell populations). Although it is important to note that different filtering criteria may result in different threshold for expression variable…
Measurement unit: some of the datasets have both transcript-level count and gene-level count, while some only have gene-level count. Especially for single-cell RNA-seq datasets, most of these are processed count data downloaded from GEO and only have gene-level count data avaiable. However, in order to perform within-sample normalization for transcript length and to apply census, we need the transcript-level information.

TPM versus RPKM and FPKM

Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK). Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.

Divide the RPK values by the “per million” scaling factor. This gives you TPM.

This R Markdown site was created with workflowr