Generate simulated data

Last updated: 2017-10-19

Code version: 861184a54d0981caa0791fd3a31ddee6995fa00b

Current considerations

SCDE requires at least 2000 genes to run and require filtering…

Simulation strategy

We will compare and evaluate methods based simulated datasets that are generated based on real datasets of single cell RNA-seq and bulk RNA-seq. These simulated datasets will permit us to evaluate the power performance of the methods while controlling for important parameters: number of sequencing samples, number of genes, depth of coverage, fraction of true DE genes. Most importantly, our simulation approach preserves the sample correlation structure in each sequencing datasets and thereby provides “realistic” evaluation datasets.

We will generate three major classes of simulated datasets:
1. single cell RNA-sequencing datasets, including both UMI and read-based count data
2. bulk RNA-sequencing data, using GTEx liver samples
3. thinned bulk RNA-sequencing data

The simulation parameters are:
1. Number of sequencing samples/group: n = 40, 80, 200, 500 (the upper limit is fixed for each dataset; expect higher number for drop-seq and at most 90~ for Fluidigm C1)
2. Number of genes: G = 1,000
3. Number of simulated datasets: M = 50
4. Fraction of null genes: $\pi_0 \in \{ .5, .9, 1\}$
5. Thinning parameter: $\theta \in {.001, .01}$ 6. Effect size distribution: $\hat{\beta}_g \sim N(0, 1)$

Here we briefly summarize the steps in simulated data generation: 1. Take a subset of $G$ genes,
2. Take a subset of $2n$ samples and assign the samples to two arbitrary biological conditions: either same labels at each gene (preserve correlation structure) or different labels at each gene (no sample correlation),
3. Modify the expression count data to add signal.

Add signals

The method of adding signals has been previously described: ash (?), vash (?), and Gerard and Stephens (RUV and mouthwash). For the motivation of this approach, see the main text in Gerard and Stephens (RUV) for a detailed description. The overview here closely follows the text in Gerard and Stephens (RUV).

We add signals to the null datasets in a randomly selected fraction of genes $\pi_1 = 1 - \pi0$ ($\pi_1 = .1$ or $.5$). The signals $\beta_g$ are randomly sampled from a $N(0, ?)$. the standard deviation was chosen so that the classification would be neither too easy nor too hard (?). Let

$\beta_{g1}, \dots, \beta_{g \pi_1*p} \sim N(0, s^2)$

be the effect sizes, where $gl \in \Omega$, the set of non-null genes. Then we drew a $\mathbf{Z}$ matrix of the same dimension as the expression count matrix $\mathbf{Y}$ by

\[ z_{igl} | y_{igl} \sim \begin{cases} \text{Binomial}(y_{igl}, 2^{\beta_{gl} x_{i}}) & \text{if $\beta_{gl} < 0$ and $gl \in \Omega$},\\ \text{Binomial}(y_{igl}, 2^{-\beta_{gl} (1-x_{i})}) & \text{if $\beta_{gl} > 0$ and $gl \in \Omega$},\\ \delta(y_{igl}) & \text{if $jl \notin \Omega$}, \end{cases} \]

where $\delta(y_{igl})$ indicates a point mass at $y_{igl}$. We then use $\mathbf{Z}$ as our non-null expression count matrix.

How to choose $s^2$?

As a rule of thumb, large $s^2$ generates large effect sizes and thus DE classification is easier. To help choosing the value of $s^2$, I compute the fraction of true positive genes (fraction of true signals classifed as DE) given 5% and 1% false positive rate (proportion of nulls classifed as DE) under the parameter setting:

Number of sequencing samples per group: 50 and 100
Number of genes: 1000
Number of simulated data: 50
Fraction of null genes: .5, .9

limmaVoom was used to perform DE analysis. Overall, a stringent FDR threshold makes classification more difficult (lower TPR compared when 1% FDR compared to 5% FDR) under different sample size, fraction of null genes, and $s^2$. About the parameter settings, all but fraction of null genes may affect TPR under fixed FDR. Larger sample size is associated with higher TPR. More importantly, a large $s^2$ is associated with high TPR - which is as expected since there are more large effect sizes.

When choosing $s^2$, the rule of thumb is setting a value of false positve rate that makes the classification neither easy nor difficult. We chose a value of $s^2$ such taht TPR is about 50%. (maybe report results under other $s^2$ values in the supplemental materials). For now, we choose 1% FDR and $s = 1$.

make_tpr_plot(tpr01_long, title = "TPR given 1% FDR")

Warning: package 'dplyr' was built under R version 3.4.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

make_tpr_plot(tpr05_long, title = "TPR given 5% FDR")

How to choose parameter settings for different distributional assumptions?

question: what parameter values are required for shape A and shape B to have the same TPR under fixed FDR, sample size, and pi0

Other papers

In Soneson and Robinson (2017), the simulated datasets include a subset of single cell samples but all of the genes in the original datasets.

Session information

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2  dplyr_0.7.4   ggplot2_2.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     bindr_0.1        knitr_1.17       magrittr_1.5    
 [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.1.2     
 [9] stringr_1.2.0    plyr_1.8.4       tools_3.4.1      grid_3.4.1      
[13] gtable_0.2.0     htmltools_0.3.6  yaml_2.1.14      lazyeval_0.2.0  
[17] rprojroot_1.2    digest_0.6.12    assertthat_0.2.0 tibble_1.3.4    
[21] glue_1.1.1       evaluate_0.10.1  rmarkdown_1.6    labeling_0.3    
[25] stringi_1.1.5    compiler_3.4.1   scales_0.5.0     backports_1.1.1 
[29] pkgconfig_2.0.1

This R Markdown site was created with workflowr