Tutorial - Massive scale of DNA sequencing data presents challenges in processing and analysis

Thursday, November 17, 2011 - 9:00am - 10:00am
Keller 3-180
Fuli Yu (Baylor College of Medicine)
Recent advancements in DNA sequencing technologies have led to wide dissemination of instrumentation, resulting data and excitement. As a result of declining costs and increasing in throughput, there is a rapid growth trajectory in the amount of sequence data production. It is predicted that DNA sequence data will soon become one of the largest data types requiring powerful infrastructure development and deployment in both software and hardware in order to enable routine and robust handling and analysis.

This tutorial will guide participants through multiple topics regarding the next generation sequencing (NGS) data production and processing. Emphasis will be placed on both didactic presentation and group discussion in the following areas: (1) What is happening; (2) The excitement; (3) Best practice-lessons from the 1000 Genomes Project; (4) Remaining bottlenecks in data handling; and (5) A view toward the future.

The HGSC has been pioneering the deployment of multiple NGS platforms (Roche 454, Illumina, SOLiD, PacBio, Ion Torrent), and spearheaded personal genomics (Waston Genome, Lupski Genome, and Beery Family), population genomics (1000 Genomes), cohort disease mapping (ARIC Studies), and Cancer Studies (TCGA, familial cancer). A great deal of experience in processing and handling NGS data and variant calling have been accumulated, which form a solid foundation to meet future challenges.

My group has been a major part of the 1000 Genomes Project for variant calling, imputation and integration for both low-coverage (~4X/genome) and exome data. We developed integrative variant analysis pipelines-Atlas2 and SNPTools (, which achieved high quality SNP and INDEL datasets in the 1000 Genomes Phase I project. I will share this experience as one example.