Bioinfomatics tools are a key component in the Next-generation Sequencing (NGS) workflow and can have a significant impact on the results. Alignment and variant calling, in particular, involve complex algorithms, each with unique strengths and weaknesses. The Broad Institute’s BWA+GATK application is among the most popular, but over the last few years more alignment+variant calling methods have been released by companies including Illumina, Edico Genome, and Sentieon. With the emergence of multiple methods comes a clear need for comparison between the results obtained by these methods so that people who use these tools can select the best one for their purpose.
The new Hap.py app available on BaseSpace Sequence Hub enables users to compare diploid genotypes at the haplotype level by generating and matching alternate sequences in a small region of the genome that contains one or more variants. Hap.py makes it easy to compare any variant call set against a range of packaged gold-standard truth sets1,2 to perform routine benchmarking.
Benchmarking variant calls has an important role in a variety of applications, such as validating a sequencing pipeline, testing new software, and routine quality assurance. In these contexts, the benchmarking workflow involves sequencing a known reference sample with a corresponding gold-standard truth set available, processing this sequence data with an alignment and variant calling pipeline, then comparing the resulting call set against the gold standard.
The Global Alliance for Genomics and Health (GA4GH) is an international collaborative effort which aims to establish standards that in turn encourage the development of interoperable bioinformatics tools. This standardization effort is vital as the field of genomics moves from academic research into clinical research applications. Among the many GA4GH initiatives, a benchmarking group was formed with the specific aim of establishing best practices for how variant benchmarking should be done to ensure accurate and reproducible results.
This GA4GH benchmarking initiative has developed best practices for small variant benchmarking which are now implemented for BaseSpace Sequence Hub users by the BaseSpace Hap.py Benchmarking application. Specifically, these best practices recommend:
- RTG-tools vcfeval as the variant comparison engine, with its ability to match even partial haplotypes between truth set and query alleles
- Quantification by the hap.py command line tool of true positive, false positive, false negative and unassessed calls
- Optional stratification of benchmarking metrics into difficult regions such as regions of low complexity or biased sequence composition
Documentation and links to the outputs from the GA4GH Benchmarking team can be found on their benchmarking tools github repository. Further details of the benchmarking best practices as developed by the GA4GH team will be described in an upcoming manuscript.
Hap.py application features
Latest truth sets — Hap.py includes the latest 2017 version of the Illumina Platinum Genomes truth sets (for samples NA12878 and NA12877) as well as the most recent truth sets from Genome in a Bottle (v3.3.2), which cover 4 alternative reference samples (Ashkenazim trio and Chinese son) in addition to NA12878. There is also the option to use a custom truth set if your chosen reference material is not covered by these truth sets.
Precision-recall curves — variant calling pipelines often apply filtering at a level chosen by the user. To fairly compare pipelines, we must compare the full range of accessible precision and recall metrics that can be achieved by varying a threshold from its most sensitive (high recall but lower precision) to its most conservative (lower recall with high precision). Precision-recall curves offer a broad view of this tradeoff and are useful, for example, when choosing thresholding levels for further analysis.
- Stratification regions— by selecting a set of regions by which to stratify benchmarking metrics, you can accurately identify where exactly pipelines may differ. For example, the effect of performing Polymerase Chain Reaction (PCR) prior to sequencing relative to a PCR-free protocol could be investigated by comparing truth set precision and recall particularly in repetitive regions, where PCR cycles may introduce additional false positive insertion-deletion (indel) calls in long homopolymers.
Relation to VCAT
BaseSpace Sequence Hub already offers the Variant Calling Assessment Tool (VCAT) which includes Hap.py. These applications are complementary and choosing one depends on your purpose. See Table 1 to determine support provided for each particular case.
Table 1: A comparison of the Hap.py and VCAT apps
|I want to:||Hap.py||VCAT|
|Run a single variant call file (VCF) against the latest Platinum Genomes truth set using GA4GH benchmarking best practices||Supported||Not supported|
|Run 2-10 VCFs at once and compare them against each other too||Not supported||Supported|
|Run 100s to 1000s of samples using the BaseSpace Sequence Hub Command Line Interface||Supported||Not supported|
|Choose an optimal threshold for filtering my VCF by seeing how precision and recall vary over a range of cutoffs||Supported||Not supported|
|Restrict the analysis to targeted regions on my data that have been sequenced with an Illumina targeted panel||Supported but requires you to upload a browser extensible data (BED) file||Supported|
Why is it called Hap.py?
Hap.py Benchmarking wraps the open-source hap.py commandline tool. Rather than simply comparing VCF records position-by-position, hap.py is able to compare local haplotypes to identify matching truth and query records even if their absolute representation differs. Hap.py is partly written in python (with its .py extension), so the haplotype comparison tool became hap.py
- Eberle MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27: 157-164. http://dx.doi.org/10.1101/gr.210500.116
- Zook JM et al. (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology, 32: 246-251. http://dx.doi.org/10.1038/nbt.2835
Next-generation sequencing (NGS) systems now produce more data than ever before. Additionally, a typical NGS workflow involves manual, time-consuming touchpoints for quality control, analysis setup, and results review. As a result, labs who perform NGS or other complex, high-volume processing of samples can be overwhelmed managing the workflows and data generated. To address these issues and simplify NGS research, we are happy to announce the new version of BaseSpace Sequence Hub. It is designed to enhance your laboratory’s efficiency and support the needs of high-throughput labs.
Included in this update are new features, including a biosample-centric data model that provides tracking of all biosample activity from lab preparation through analysis delivery. We’re also introducing the following features:
- New automation quality control features
- Automated app launches and workflows
- An updated Application Programming Interface (API) to help you streamline your next-generation sequencing (NGS) workflows
- An improved user interface that helps you access your data and perform functions more quickly
Biosample-centric Data Model
Our new biosample-centric data model enables easy tracking of all biosample activity from lab preparation through analysis delivery. Biosamples are the data containers that represent the original DNA source material. They are used to trace all sequencing activities, including lab preparation (with LIMS integration) sequencing runs, data analysis, and delivery of data.
The new data model centers on biosamples, the original source of DNA, so you can easily track all biosample activity from lab preparation, with optional laboratory information management system (LIMS) integration, to delivery of analysis results. Biosamples can be used as inputs to multiple sequencing runs, and they can contain multiple datasets, which can live within separate projects.
Important Note: Biosamples with the same name (Sample ID in the sample sheet) are automatically aggregated. The new features will aggregate all FASTQ data sets with the same Sample ID into a single biosample. It is important to name the samples in your sample sheet uniquely, otherwise they will be aggregated together. Learn more about automatic data aggregation here.
Automated Lane QC, App Launch, and Analysis QC
After sequencing, much of the work required to process biosamples can be automated in bulk. By setting up automation ahead of time using the command line interface (CLI), sequencing runs can be automatically passed or failed based on their sequencing quality, converted to FASTQ datasets, used as inputs in an app, and then be passed or failed based on their app metrics. Automation removes much of the time-consuming and error prone manual work of processing sequencing data into downstream results.
Improved User Interface
The updated interface provides quick access to all of your data from the My Data menu, while the new Action Toolbar contains new and improved app functions such as requeues, QC status changes, workflows, and collaboration tools.
The Analyses page provides a listing of all analyses in your account. The filters on this page help you quickly narrow your search for specific analyses by their current status.
The Projects and Runs pages function the same as before, providing quick access to all of your sequencing projects and instrument runs.
Advanced Automation and Integration Toolset
Alongside our updated data model, we’ve introduced version 2 of the API, which enables you to interact directly with your data and integrate systems together with your BaseSpace Sequence Hub account.
The new automation tools in version 2 of the API:
- Correspond to the new biosample-centric data model
- Improve performance and robustness of the solution
- Include new documentation
Note: The version 1 API is still fully-supported and maintained, although we are actively focusing primarily on version 2 API development. The version 1 API documentation is maintained here.
Version 2 of BaseSpaceCLI has been built using the version 2 API. BaseSpace CLI can be leveraged to read data from your BaseSpace Sequence Hub account and create new data by uploading data and launching apps. In addition, the new BaseSpace CLI can be used to create automated analysis workflows, and import biosamples.
BaseMount is a command-line tool which allows you to explore through runs, projects, biosamples, and datasets, and interact directly with the associated files exactly as you would with any other file system.
We hope the new functionality of BaseSpace Sequence Hub enables your lab to boost productivity and discovery. Visit our updated Support Site to learn more about how to use all the new features and tools. Please contact us at email@example.com if you have any questions or comments.
The BaseSpace Sequence Hub Team
- CLI documentation https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview
- CLI automated workflow creation docs https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-examples
- Link to v1 API docs https://developer.basespace.illumina.com/docs/content/documentation/rest-api/v1-api-reference
- Link to v2 API docs https://developer.basespace.illumina.com/docs/content/documentation/rest-api/api-reference
BaseSpace® Variant Interpreter competes in Next-generation Sequencing (NGS) Bioinformatics Challenge at European Congress of Pathology 2017
Differences in bioinformatics pipelines may contribute to substantial variability across labs, in terms of variant annotation, interpretation, and reporting. The lack of standardization is an emerging concern, especially given the growing availability of commercial bioinformatics software options that reduce the barrier for new labs to adopt next-generation sequencing (NGS). To demonstrate how differences in commercial software can influence analysis, organizers of the Two-Day Symposium for Molecular Biologists in Pathology at the European Congress of Pathology (ECP) 2017 set up an NGS Bioinformatics Challenge where both Illumina and QIAGEN were invited to participate.
The concept of the challenge was simple: 3 institutions in Germany (Universitätsklinikum Köln, Erlangen, and Charite in Berlin) contributed FASTQ files for a total of 12 tumor samples that were known to harbor pathogenic variants. These data were then sent to Illumina and QIAGEN 2 months before the event, and subjected to variant calling and interpretation using their commercially available offerings. Both Illumina and QIAGEN were blinded to the identity of the known variants, and reported on their findings at a round-table session at ECP where the organizers also revealed what the expected variants were, and how they had been interpreted by each institution. To add an interesting twist, each contributing institution had used a different library prep (and sequencing platform) for their samples: The Berlin samples used the AmpliSeq Colon and Lung v2 hotspot panel and were sequenced on the Ion Torrent PGM; both Erlangen and Köln samples were sequenced on an Illumina MiSeq™ System, but the Erlangen samples used the Illumina TruSight® Tumor 15 prep and the Köln samples used a custom QIAGEN amplicon panel.
Removing the NGS Analytics Data Bottleneck with Field-Programmable Gate Arrays (FPGAs)
The following is a guest blog, written by our partners at Edico Genome.
The next-generation sequencing (NGS) analysis demand is growing at an exponential rate, creating a shortage of computing power to analyze the rapidly growing body of data. Current projections1 calculate genomic data to continue doubling every seven months, a stark acceleration in comparison to Moore’s Law, which states CPU capabilities will double every two years (Figure 1, below). The void left in-between creates a bottleneck for genomics labs.
Providing an alternative to traditional CPU-based systems, Edico Genome’s DRAGEN™ (Dynamic Read Analysis for Genomics) Platform leverages FPGA (Field-Programmable Gate Array) technology to provide customers with hardware-accelerated implementation of genome pipeline algorithms. Leveraging FPGAs, DRAGEN allows customers to analyze NGS data at unprecedented speeds with extremely high accuracy2 onsite, in the cloud, or through a blended hybrid cloud.
BaseSpace Sequence Hub, hosted on Amazon Web Services, enables the cloud-based deployment of the Edico Genome DRAGEN pipeline. Edico Genome’s DRAGEN Genome Pipeline is now readily available, enabling rapid analysis of whole genome sequencing and targeted resequencing panels.
Also co-authored by Eric Allen.
Recent advancements in the Illumina TruSeq Amplicon technology enable higher multiplexing of amplicons in a single assay. Combined with next-generation sequencing (NGS) from Illumina, NGS users can perform high throughput, high sensitivity genotyping experiments on Illumina Sequencers. The new TruSeq Amplicon 3.0 BaseSpace® Sequence Hub App introduces major improvements to support a variety of amplicon sequencing applications, including the recently launched TruSeq Genotype Ne product. TruSeq Genotype Ne is a fully customizable targeted genotyping by sequencing (GBS) solution. Key GBS features of TruSeq Amplicon 3.0 include:
- Support for custom reference genomes, allowing a user to analyze amplicon data against their choice of FASTA file (previously uploaded to Sequence Hub).
- Genotypes of Interest reporting, allowing a user to generate a tabular report of genotypes for each sample, which is analogous to genotyping array outputs.
Example usage of the Genotypes of Interest feature can be found in the example Project below. The Input VCF (variant call file) in this Project (found in the test_NA12878_GOI output files) can be used as a template and customized for use with other datasets.
Deep sequencing and high throughput microarray technologies have enabled scientists to routinely generate hundreds of thousands if not millions of new data points in a single experiment. The extraordinary rate of data generation, finite resources, and focused research interests limit most investigations to follow up on only a small fraction of the data generated from next-generation sequencing (NGS) instruments.
Ten years ago, there were no services available to curate data. Researchers relied on home grown tools to perform the cumbersome task of matching information to publications, but they didn’t have the expertise to do a re-analysis. A group of entrepreneurial scientists and bioinfomaticists envisioned a need for solutions to handle the deluge of data that was coming as more and more genomes, from different kinds of species, were being sequenced. That vision manifested into NextBio® Research, a genomics software platform that could match variant to variant sets and gene expression to DNA methylation to protein-DNA binding across a spectrum of organisms, saving researchers time and resources. By leveraging biomedical ontologies coupled with proto-machine learning algorithms, dynamic data-driven applications were added to aid in the discovery of novel relationships among diseases, compounds, gene perturbations, and pathways.
Next-Generation Sequencing (NGS) users often adopt BaseSpace Sequence Hub at times of change within their organizations. They may be new to NGS, or in the process of scaling up their operations, such as having purchased a new sequencing instrument – perhaps a NovaSeq™ Series instrument. In these situations, it can be challenging to estimate data storage and compute costs, creating uncertainty in the budgeting process.
To address this concern, we are excited to announce an unlimited data storage and compute plan for BaseSpace Sequence Hub that takes the uncertainty away. The plan enables new Sequence Hub customers to choose from either the traditional pay-for-use plan or alternatively choose a fixed-price, unlimited plan covering all data storage and compute cost in the first year.
With the plan, new customers get unlimited data storage and have access to all of the apps in BaseSpace Sequence Hub without any additional cost. The plan includes Illumina-developed apps as well as third-party apps, such as the recently announced whole genome sequencing Apps from Edico Genome (coming soon). The unlimited plan eliminates any ambiguity associated with the cost of using BaseSpace Sequence Hub and allows customers to understand their usage patterns so they can comfortably estimate their expenses in subsequent years.