The wrapper for the whole scDesign3 pipeline

scdesign3 takes the input data, fits the model and

Usage

scdesign3(
  sce,
  assay_use = "counts",
  celltype,
  pseudotime = NULL,
  spatial = NULL,
  other_covariates,
  ncell = dim(sce)[2],
  mu_formula,
  sigma_formula = "1",
  family_use = "nb",
  n_cores = 2,
  usebam = FALSE,
  edf_flexible = FALSE,
  corr_formula,
  empirical_quantile = FALSE,
  copula = "gaussian",
  if_sparse = FALSE,
  fastmvn = FALSE,
  DT = TRUE,
  pseudo_obs = FALSE,
  family_set = c("gauss", "indep"),
  important_feature = "all",
  nonnegative = TRUE,
  nonzerovar = FALSE,
  return_model = FALSE,
  simplify = FALSE,
  parallelization = "mcmapply",
  BPPARAM = NULL,
  trace = FALSE
)

Arguments

sce: A SingleCellExperiment object.
assay_use: A string which indicates the assay you will use in the sce. Default is 'counts'.
celltype: A string of the name of cell type variable in the colData of the sce. Default is 'cell_type'.
pseudotime: A string or a string vector of the name of pseudotime and (if exist) multiple lineages. Default is NULL.
spatial: A length two string vector of the names of spatial coordinates. Default is NULL.
other_covariates: A string or a string vector of the other covariates you want to include in the data.
ncell: The number of cell you want to simulate. Default is dim(sce)[2] (the same number as the input data).
mu_formula: A string of the mu parameter formula
sigma_formula: A string of the sigma parameter formula
family_use: A string of the marginal distribution. Must be one of 'poisson', 'nb', 'zip', 'zinb' or 'gaussian'.
n_cores: An integer. The number of cores to use.
usebam: A logic variable. If use bam for acceleration in marginal fitting.
edf_flexible: A logic variable. It is used for accelerating for spatial model if k is large in 'mu_formula'. Default is FALSE.
corr_formula: A string of the correlation structure.
empirical_quantile: Please only use it if you clearly know what will happen! A logic variable. If TRUE, DO NOT fit the copula and use the EMPIRICAL CDF values of the original data; it will make the simulated data fixed (no randomness). Default is FALSE. Only works if ncell is the same as your original data.
copula: A string of the copula choice. Must be one of 'gaussian' or 'vine'. Default is 'gaussian'. Note that vine copula may have better modeling of high-dimensions, but can be very slow when features are >1000.
if_sparse: A logic variable. Only works for Gaussian copula (family_set = "gaussian"). If TRUE, a thresholding strategy will make the corr matrix sparse.
fastmvn: An logical variable. If TRUE, the sampling of multivariate Gaussian is done by mvnfast, otherwise by mvtnorm. Default is FALSE. It only matters for Gaussian copula.
DT: A logic variable. If TRUE, perform the distributional transformation to make the discrete data 'continuous'. This is useful for discrete distributions (e.g., Poisson, NB). Default is TRUE. Note that for continuous data (e.g., Gaussian), DT does not make sense and should be set as FALSE.
pseudo_obs: A logic variable. If TRUE, use the empirical quantiles instead of theoretical quantiles for fitting copula. Default is FALSE.
family_set: A string or a string vector of the bivariate copula families. Default is c("gauss", "indep"). For more information please check package rvinecoplib.
important_feature: A numeric value or vector which indicates whether a gene will be used in correlation estimation or not. If this is a numeric value, then gene with zero proportion greater than this value will be excluded form gene-gene correlation estimation. If this is a vector, then this should be a logical vector with length equal to the number of genes in sce. TRUE in the logical vector means the corresponding gene will be included in gene-gene correlation estimation and FALSE in the logical vector means the corresponding gene will be excluded from the gene-gene correlation estimation. The default value for is "all" (a special string which means no filtering).
nonnegative: A logical variable. If TRUE, values < 0 in the synthetic data will be converted to 0. Default is TRUE (since the expression matrix is nonnegative).
nonzerovar: A logical variable. If TRUE, for any gene with zero variance, a cell will be replaced with 1. This is designed for avoiding potential errors, for example, PCA. Default is FALSE.
return_model: A logic variable. If TRUE, the marginal models and copula models will be returned. Default is FALSE.
simplify: A logic variable. If TRUE, the fitted regression model will only keep the essential contains for predict, otherwise the fitted models can be VERY large. Default is FALSE.
parallelization: A string indicating the specific parallelization function to use. Must be one of 'mcmapply', 'bpmapply', or 'pbmcmapply', which corresponds to the parallelization function in the package parallel,BiocParallel, and pbmcapply respectively. The default value is 'mcmapply'.
BPPARAM: A MulticoreParam object or NULL. When the parameter parallelization = 'mcmapply' or 'pbmcmapply', this parameter must be NULL. When the parameter parallelization = 'bpmapply', this parameter must be one of the MulticoreParam object offered by the package 'BiocParallel. The default value is NULL.
trace: A logic variable. If TRUE, the warning/error log and runtime for gam/gamlss will be returned, FALSE otherwise. Default is FALSE.

Value

A list with the components:

new_count: A matrix of the new simulated count (expression) matrix.
new_covariate: A data.frame of the new covariate matrix.
model_aic: The model AIC.
marginal_list: A list of marginal regression models if return_model = TRUE.
corr_list: A list of correlation models (conditional copulas) if return_model = TRUE.

Examples

data(example_sce)
my_simu <- scdesign3(
sce = example_sce,
assay_use = "counts",
celltype = "cell_type",
pseudotime = "pseudotime",
spatial = NULL,
other_covariates = NULL,
mu_formula = "s(pseudotime, bs = 'cr', k = 10)",
sigma_formula = "1",
family_use = "nb",
n_cores = 2,
usebam = FALSE,
edf_flexible = FALSE,
corr_formula = "pseudotime",
copula = "gaussian",
if_sparse = TRUE,
DT = TRUE,
pseudo_obs = FALSE,
ncell = 1000,
return_model = FALSE
)
#> Input Data Construction Start
#> fitting ...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |======================================================================| 100%
#> Input Data Construction End
#> Start Marginal Fitting
#> Marginal Fitting End
#> Start Copula Fitting
#> Convert Residuals to Multivariate Gaussian
#> Converting End
#> Copula group 1 starts
#> Copula group 2 starts
#> Copula group 3 starts
#> Copula group 4 starts
#> Copula Fitting End
#> Start Parameter Extraction
#> Parameter
#> Extraction End
#> Start Generate New Data
#> Use Copula to sample a multivariate quantile matrix
#> Sample Copula group 1 starts
#> Sample Copula group 2 starts
#> Sample Copula group 3 starts
#> Sample Copula group 4 starts
#> New Data Generating End