YARN: Robust Multi-Tissue RNA-Seq Preprocessing and Normalization
Joseph N. Paulson & John Quackenbush
2024-07-09
Source:vignettes/yarn.Rmd
yarn.Rmd
YARN - Yet Another RNa-seq package
The goal of yarn is to expedite large RNA-seq analyses using a combination of previously developed tools. Yarn is meant to make it easier for the user to perform accurate comparison of conditions by leveraging many Bioconductor tools and various statistical and normalization techniques while accounting for the large heterogeneity and sparsity found in very large RNA-seq experiments.
Installation
You can install yarn from github through netZooR with:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("netZooR")
Quick Introduction
If you’re here to grab the GTEx version 6.0 data then look no further than this gist that uses yarn to download all the data and preprocess it for you.
Preprocessing
Below are a few of the functions we can use to preprocess a large RNA-seq experiment. We follow a particular procedure where we:
- Filter poor quality samples
- Merge samples of similar conditions for increased power
- Filter genes while preserving tissue or group specificity
- Normalize while accounting for global differences in tissue distribution
We will make use of the skin
dataset for examples. The
skin
dataset is a small sample of the full GTEx data that
can be downloaded using the downloadGTEx
function. The
skin
dataset looks like this:
skin
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 40824 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## rowNames: GTEX-OHPL-0008-SM-4E3I9 GTEX-145MN-1526-SM-5SI9T ...
## GTEX-144FL-0626-SM-5LU43 (20 total)
## varLabels: SAMPID SMATSSCR ... DTHHRDY (65 total)
## varMetadata: labelDescription
## featureData
## featureNames: 48350 48365 ... 7565 (40824 total)
## fvarLabels: ensembl_gene_id hgnc_symbol ... gene_biotype (6 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
This is a basic workflow. Details will be fleshed out:
- First always remember to have the library loaded.
- Download the GTEx gene count data as an ExpressionSet object or load the sample skin dataset.
For computational reasons we load the sample skin data instead of having the user download the
- Check mis-annotation of gender or other phenotypes using group-specific genes
netZooR::checkMisAnnotation(skin,"GENDER",controlGenes="Y",legendPosition="topleft")
- Decide what sub-groups should be merged
netZooR::checkTissuesToMerge(skin,"SMTS","SMTSD")
- Filter lowly expressed genes
skin_filtered = netZooR::filterLowGenes(skin,"SMTSD")
dim(skin)
## Features Samples
## 40824 20
dim(skin_filtered)
## Features Samples
## 19933 20
Or group specific genes
tmp = netZooR::filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name")
# Keep only the sex names
tmp = netZooR::filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name",keepOnly=TRUE)
- Normalize in a tissue or group-aware manner
netZooR::plotDensity(skin_filtered,"SMTSD",main=expression('log'[2]*' raw expression'))
skin_filtered = netZooR::normalizeTissueAware(skin_filtered,"SMTSD")
netZooR::plotDensity(skin_filtered,"SMTSD",normalized=TRUE,main="Normalized")
## normalizedMatrix is assumed to already be log-transformed
Helper functions
Other than checkMisAnnotation
and
checkTissuesToMerge
we provide a few plotting function. We
include, plotCMDS
, plotDensity
,
plotHeatmap
.
plotCMDS
- PCoA / Classical Multi-Dimensional Scaling of
the most variable genes.
data(skin)
## Warning in data(skin): data set 'skin' not found
plotDensity
- Density plots colored by phenotype of
choosing. Allows for inspection of global trend differences.
filtData = netZooR::filterLowGenes(skin,"SMTSD")
netZooR::plotDensity(filtData,groups="SMTSD",legendPos="topleft")
plotHeatmap
- Heatmap of the most variable genes.
library(RColorBrewer)
tissues = pData(skin)$SMTSD
heatmapColColors=brewer.pal(12,"Set3")[as.integer(factor(tissues))]
heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)
netZooR::plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10,
col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.25,cexCol = 0.25)
Information
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 netZooR_1.5.17 matrixcalc_1.0-6
## [4] pandaR_1.34.0 Biobase_2.62.0 BiocGenerics_0.48.1
## [7] reticulate_1.35.0 igraph_2.0.3
##
## loaded via a namespace (and not attached):
## [1] STRINGdb_2.14.3 fs_1.6.3
## [3] matrixStats_1.2.0 bitops_1.0-7
## [5] httr_1.4.7 doParallel_1.0.17
## [7] Rgraphviz_2.46.0 repr_1.1.7
## [9] tools_4.3.2 doRNG_1.8.6
## [11] backports_1.4.1 utf8_1.2.4
## [13] R6_2.5.1 vegan_2.6-4
## [15] HDF5Array_1.30.1 mgcv_1.9-1
## [17] rhdf5filters_1.14.1 permute_0.9-7
## [19] prettyunits_1.2.0 base64_2.0.1
## [21] preprocessCore_1.64.0 cli_3.6.2
## [23] textshaping_0.3.7 penalized_0.9-52
## [25] sass_0.4.9 readr_2.1.5
## [27] genefilter_1.84.0 askpass_1.2.0
## [29] pkgdown_2.0.7 Rsamtools_2.18.0
## [31] systemfonts_1.0.6 pbdZMQ_0.3-11
## [33] siggenes_1.76.0 illuminaio_0.44.0
## [35] AnnotationForge_1.44.0 scrime_1.3.5
## [37] plotrix_3.8-4 limma_3.58.1
## [39] rstudioapi_0.16.0 RSQLite_2.3.5
## [41] generics_0.1.3 GOstats_2.68.0
## [43] quantro_1.36.0 BiocIO_1.12.0
## [45] gtools_3.9.5 distributional_0.4.0
## [47] dplyr_1.1.4 GO.db_3.18.0
## [49] Matrix_1.6-5 fansi_1.0.6
## [51] S4Vectors_0.40.2 abind_1.4-5
## [53] lifecycle_1.0.4 yaml_2.3.8
## [55] edgeR_4.0.16 SummarizedExperiment_1.32.0
## [57] gplots_3.1.3.1 rhdf5_2.46.1
## [59] SparseArray_1.2.4 BiocFileCache_2.10.1
## [61] grid_4.3.2 blob_1.2.4
## [63] crayon_1.5.2 lattice_0.22-6
## [65] GenomicFeatures_1.54.4 annotate_1.80.0
## [67] KEGGREST_1.42.0 pillar_1.9.0
## [69] knitr_1.45 beanplot_1.3.1
## [71] GenomicRanges_1.54.1 rjson_0.2.21
## [73] codetools_0.2-19 glue_1.7.0
## [75] downloader_0.4 data.table_1.15.2
## [77] vctrs_0.6.5 png_0.1-8
## [79] gtable_0.3.4 assertthat_0.2.1
## [81] gsubfn_0.7 cachem_1.0.8
## [83] xfun_0.42 S4Arrays_1.2.1
## [85] survival_3.5-8 iterators_1.0.14
## [87] statmod_1.5.0 nlme_3.1-164
## [89] Category_2.68.0 bit64_4.0.5
## [91] progress_1.2.3 filelock_1.0.3
## [93] GenomeInfoDb_1.38.8 tensorA_0.36.2.1
## [95] bslib_0.6.2 nor1mix_1.3-2
## [97] KernSmooth_2.23-22 colorspace_2.1-0
## [99] DBI_1.2.2 nnet_7.3-19
## [101] processx_3.8.4 tidyselect_1.2.1
## [103] bit_4.0.5 compiler_4.3.2
## [105] curl_5.2.1 chron_2.3-61
## [107] graph_1.80.0 xml2_1.3.6
## [109] desc_1.4.3 ggdendro_0.2.0
## [111] DelayedArray_0.28.0 posterior_1.5.0
## [113] rtracklayer_1.62.0 checkmate_2.3.1
## [115] scales_1.3.0 caTools_1.18.2
## [117] hexbin_1.28.3 quadprog_1.5-8
## [119] RBGL_1.78.0 rappdirs_0.3.3
## [121] stringr_1.5.1 digest_0.6.35
## [123] rmarkdown_2.26 GEOquery_2.70.0
## [125] XVector_0.42.0 htmltools_0.5.7
## [127] pkgconfig_2.0.3 base64enc_0.1-3
## [129] sparseMatrixStats_1.14.0 MatrixGenerics_1.14.0
## [131] highr_0.10 dbplyr_2.5.0
## [133] fastmap_1.1.1 rlang_1.1.3
## [135] DelayedMatrixStats_1.24.0 jquerylib_0.1.4
## [137] jsonlite_1.8.8 BiocParallel_1.36.0
## [139] mclust_6.1 RCurl_1.98-1.14
## [141] magrittr_2.0.3 GenomeInfoDbData_1.2.11
## [143] Rhdf5lib_1.24.2 IRkernel_1.3.2
## [145] munsell_0.5.0 Rcpp_1.0.12
## [147] proto_1.0.0 sqldf_0.4-11
## [149] stringi_1.8.3 RJSONIO_1.3-1.9
## [151] zlibbioc_1.48.2 MASS_7.3-60.0.1
## [153] plyr_1.8.9 org.Hs.eg.db_3.18.0
## [155] bumphunter_1.44.0 minfi_1.48.0
## [157] parallel_4.3.2 Biostrings_2.70.3
## [159] IRdisplay_1.1 splines_4.3.2
## [161] multtest_2.58.0 hash_2.2.6.3
## [163] hms_1.1.3 locfit_1.5-9.9
## [165] ps_1.7.6 uuid_1.2-0
## [167] RUnit_0.4.33 base64url_1.4
## [169] rngtools_1.5.2 reshape2_1.4.4
## [171] biomaRt_2.58.2 stats4_4.3.2
## [173] XML_3.99-0.16.1 evaluate_0.23
## [175] tzdb_0.4.0 foreach_1.5.2
## [177] tidyr_1.3.1 openssl_2.1.1
## [179] purrr_1.0.2 reshape_0.8.9
## [181] ggplot2_3.5.1 xtable_1.8-4
## [183] restfulr_0.0.15 viridisLite_0.4.2
## [185] ragg_1.3.0 RCy3_2.23.2
## [187] tibble_3.2.1 memoise_2.0.1
## [189] AnnotationDbi_1.64.1 GenomicAlignments_1.38.2
## [191] IRanges_2.36.0 cluster_2.1.6
## [193] cmdstanr_0.6.1.9000 GSEABase_1.64.0