You are here

Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies.

Title	Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies.
Publication Type	Journal Article
Year of Publication	2020
Authors	Wheeler NR, Benchek P, Kunkle BW, Hamilton-Nelson KL, Warfe M, Fondran JR, Haines JL, Bush WS
Journal	Pac Symp Biocomput
Volume	25
Pagination	523-534
Date Published	2020
ISSN	2335-6936
Keywords	Base Sequence, Chromosome Mapping, Computational Biology, Diagnostic Tests, Routine, Genomics, High-Throughput Nucleotide Sequencing, Humans, Reproducibility of Results, Sequence Analysis, DNA, Software, Workflow
Abstract	Modern genomic studies are rapidly growing in scale, and the analytical approaches used to analyze genomic data are increasing in complexity. Genomic data management poses logistic and computational challenges, and analyses are increasingly reliant on genomic annotation resources that create their own data management and versioning issues. As a result, genomic datasets are increasingly handled in ways that limit the rigor and reproducibility of many analyses. In this work, we examine the use of the Spark infrastructure for the management, access, and analysis of genomic data in comparison to traditional genomic workflows on typical cluster environments. We validate the framework by reproducing previously published results from the Alzheimer's Disease Sequencing Project. Using the framework and analyses designed using Jupyter notebooks, Spark provides improved workflows, reduces user-driven data partitioning, and enhances the portability and reproducibility of distributed analyses required for large-scale genomic studies.
Pubmed Link	https://www.ncbi.nlm.nih.gov/pubmed/31797624?dopt=Abstract
page_expo	Internal
Alternate Journal	Pac Symp Biocomput
PubMed ID	31797624
PubMed Central ID	PMC6956992
Grant List	RF1 AG054074 / AG / NIA NIH HHS / United States U01 AG052410 / AG / NIA NIH HHS / United States U01 AG058654 / AG / NIA NIH HHS / United States U54 AG052427 / AG / NIA NIH HHS / United States

PubMed

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer