×

Scalable genomics with R and bioconductor. (English) Zbl 1332.62009

Summary: This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

MSC:

62-04 Software, source code, etc. for problems pertaining to statistics
62-07 Data analysis (statistics) (MSC2010)
62A09 Graphical methods in statistics
92-04 Software, source code, etc. for problems pertaining to biology
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J. and Weihs, C. (2011). Computing on high performance clusters with R: Packages BatchJobs and BatchExperiments. Technical Report 1, TU Dortmund.
[2] Chambers, J. M. (2008). Software for Data Analysis : Programming with R . Springer, New York. · Zbl 1180.62002
[3] Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C. (2001). Introduction to Algorithms , 2nd ed. McGraw-Hill, Boston, MA. · Zbl 1047.68161
[4] Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R. and 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics 27 2156-2158.
[5] Gentleman, R. C., Carey, V. J., Bates, D. M. and others (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5 R80.
[6] Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. and Haussler, D. (2002). The human genome browser at UCSC. Genome Res. 12 996-1006.
[7] Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. and Karolchik, D. (2010). BigWig and BigBed: Enabling browsing of large distributed datasets. Bioinformatics 26 2204-2207.
[8] Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M. and Carey, V. (2013). Software for computing and annotating genomic ranges. PLoS Computational Biology 9 e1003118.
[9] Lawrence, M. and Wickham, H. (2012). plumbr: Mutable and dynamic data models. R package version 0.6.6.
[10] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079.
[11] Ostrouchov, G., Chen, W.-C., Schmidt, D. and Patel, P. (2012). Programming with big data in R. Available at .
[12] Pagès, H., Aboyoun, P., Gentleman, R. and DebRoy, S. (2013). Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.25.6.
[13] R Development Core Team (2010). R : A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria.
[14] Revolution Analytics and Weston, S. (2013). foreach: Foreach looping construct for R. R package version 1.4.1.
[15] Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical Software 40 1-29.
[16] Wickham, H., Lawrence, M., Cook, D., Buja, A., Hofmann, H. and Swayne, D. F. (2009). The plumbing of interactive graphics. Comput. Statist. 24 207-215. · Zbl 1232.62014 · doi:10.1007/s00180-008-0116-x
[17] Yin, T., Lawrence, M. and Cook, D. (2013). biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.9.1.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.