Predicting cell cycle phase using peco

peco is a supervised approach for predicting cell cycle phase in a continuum using single-cell RNA sequencing data. The R package provides functions to build a training data set and predict cell cycle on a continuum.

Our work shows that peco is able to predict continuous cell cylce phase using a small set of cyclic genes, CDK1, UBE2C, TOP2A, HISTH1E, and HISTH1C (These were dentified as cell cycle marker genes in studies of yeast (Spellman et al, 1998) and HeLa cells (Whitfield et al, 2002).

Below, we present two use cases.

In the first use case, we show how to use the built-in training dataset to predict continuous cell cycle.

In the second use case, we show how to create a training data set, then build a predictor using the training data.

peco uses SingleCellExperiment class objects:

library(SingleCellExperiment)
library(doParallel)
library(foreach)
library(peco)

About the training data

training_human provides training data for 101 significant cyclic genes. The data include:

predict.yy, a gene-by-sample matrix (101 x 888) that storing predicted cyclic expression values.
cellcycle_peco_ordered, cell cycle phase in a unit circle (angle), ordered from 0 to 2$\pi$.
cellcycle_function, functions used for prediction corresponding to the top 101 cyclic genes identified.
sigma, standard error associated with cyclic trends of gene expression.
pve, proportion of variance explained by the cyclic trend.

data(training_human)

Predict cell cycle phase using gene expression data

peco is integrated with SingleCellExperiment objects in Bioconductor. Here we illustrate using a SingleCellExperiment object to perform cell cycle phase prediction.

sce_top101genes is a SingleCellExpression object for 101 genes and 888 single-cell samples.

data(sce_top101genes)
assays(sce_top101genes)
#> List of length 1
#> names(1): counts

Transform expression to quantile-normalizesd counts-per-million (CPM) values; peco uses the cpm_quantNormed slot as the input data for predictions.

sce_top101genes <- data_transform_quantile(sce_top101genes)
assays(sce_top101genes)
#> List of length 3
#> names(3): counts cpm cpm_quantNormed

Generate predictions using cycle_npreg_outsample.

pred_top101genes <-
  cycle_npreg_outsample(Y_test = sce_top101genes,
    sigma_est = training_human$sigma[rownames(sce_top101genes),],
    funs_est = training_human$cellcycle_function[rownames(sce_top101genes)],
    method.trend = "trendfilter",get_trend_estimates = FALSE)

pred_top101genes$Y contains a SingleCellExperiment object with the predicted cell cycle phase in the colData slot:

head(colData(pred_top101genes$Y)$cellcycle_peco)
#> 20170905-A01 20170905-A02 20170905-A03 20170905-A06 20170905-A07 20170905-A08 
#>     1.099557     4.680973     2.607522     4.303982     4.052655     1.413717

View the predictions for the CDK1 gene (Ensembl id ENSG00000170312). Because CDK1 is a known cell cycle gene, this visualization serves as a “sanity check” for the predictions.

x <- seq(0,2*pi,length.out = 100)
plot(x = colData(pred_top101genes$Y)$theta_shifted,
     y = assay(pred_top101genes$Y,"cpm_quantNormed")["ENSG00000170312",],
     pch = 21,col = "white",bg = "darkblue",
     main = "CDK1",xlab = "FUCCI phase",
     ylab = "quantile-normalized expression")
lines(x = x,y = training_human$cellcycle_function[["ENSG00000170312"]](x),
      col = "tomato",lwd = 2)

Visualize cyclic expression trend based on predicted phase

Next, we view the predictions for the top six genes. Here we use fit_cyclical_many to estimate cell cycle.

theta_predict <- colData(pred_top101genes$Y)$cellcycle_peco
names(theta_predict) <- rownames(colData(pred_top101genes$Y))
yy_input <- assay(pred_top101genes$Y,"cpm_quantNormed")[1:6,]
fit_cyclic <- fit_cyclical_many(Y = yy_input,theta = theta_predict)
gene_symbols <- rowData(pred_top101genes$Y)[rownames(yy_input),"hgnc"]
x <- seq(0,2*pi,length.out = 100)
par(mfrow = c(3,2))
for (i in 1:6) {
  plot(x = fit_cyclic$cellcycle_peco_ordered,y = yy_input[i,],
       pch = 1,col = "darkblue",main = gene_symbols[i],
       xlab = "FUCCI phase",ylab = "normalized expression")
  lines(x = x,y = fit_cyclic$cellcycle_function[[i]](x),
        col = "tomato",lwd = 2)
}

Session information

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] peco_1.1.21                 doParallel_1.0.15          
#>  [3] iterators_1.0.12            foreach_1.4.7              
#>  [5] SingleCellExperiment_1.6.0  SummarizedExperiment_1.16.1
#>  [7] DelayedArray_0.12.3         BiocParallel_1.18.1        
#>  [9] matrixStats_0.56.0          Biobase_2.46.0             
#> [11] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
#> [13] IRanges_2.20.2              S4Vectors_0.24.4           
#> [15] BiocGenerics_0.32.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.5.1            BiocSingular_1.0.0       genlasso_1.4            
#>  [4] viridisLite_0.3.0        DelayedMatrixStats_1.6.1 assertthat_0.2.1        
#>  [7] GenomeInfoDbData_1.2.2   vipor_0.4.5              yaml_2.2.0              
#> [10] pillar_1.4.3             backports_1.1.5          lattice_0.20-38         
#> [13] glue_1.3.1               digest_0.6.23            XVector_0.26.0          
#> [16] colorspace_1.4-1         htmltools_0.4.0          Matrix_1.2-18           
#> [19] pkgconfig_2.0.3          zlibbioc_1.32.0          purrr_0.3.3             
#> [22] mvtnorm_1.0-11           scales_1.1.0             pracma_2.2.9            
#> [25] tibble_2.1.3             ggplot2_3.3.0            conicfit_1.0.4          
#> [28] magrittr_1.5             crayon_1.3.4             memoise_1.1.0           
#> [31] evaluate_0.14            fs_1.3.1                 MASS_7.3-51.4           
#> [34] beeswarm_0.2.3           geigen_2.3               tools_3.6.2             
#> [37] scater_1.12.2            lifecycle_0.1.0          stringr_1.4.0           
#> [40] munsell_0.5.0            irlba_2.3.3              compiler_3.6.2          
#> [43] pkgdown_1.5.1            rsvd_1.0.2               rlang_0.4.5             
#> [46] grid_3.6.2               RCurl_1.98-1.2           BiocNeighbors_1.2.0     
#> [49] circular_0.4-93          igraph_1.2.5             bitops_1.0-6            
#> [52] rmarkdown_2.3            boot_1.3-23              gtable_0.3.0            
#> [55] codetools_0.2-16         R6_2.4.1                 gridExtra_2.3           
#> [58] knitr_1.26               dplyr_0.8.3              rprojroot_1.3-2         
#> [61] desc_1.2.0               stringi_1.4.3            ggbeeswarm_0.6.0        
#> [64] Rcpp_1.0.5               tidyselect_0.2.5         xfun_0.11

Joyce Hsiao

2020-11-13

About the training data

Predict cell cycle phase using gene expression data

Visualize cyclic expression trend based on predicted phase

Session information