Bioinfomatics tools are a key component in the Next-generation Sequencing (NGS) workflow and can have a significant impact on the results. Alignment and variant calling, in particular, involve complex algorithms, each with unique strengths and weaknesses. The Broad Institute’s BWA+GATK application is among the most popular, but over the last few years more alignment+variant calling methods have been released by companies including Illumina, Edico Genome, and Sentieon. With the emergence of multiple methods comes a clear need for comparison between the results obtained by these methods so that people who use these tools can select the best one for their purpose.
The new Hap.py app available on BaseSpace Sequence Hub enables users to compare diploid genotypes at the haplotype level by generating and matching alternate sequences in a small region of the genome that contains one or more variants. Hap.py makes it easy to compare any variant call set against a range of packaged gold-standard truth sets1,2 to perform routine benchmarking.
Next-generation sequencing (NGS) systems now produce more data than ever before. Additionally, a typical NGS workflow involves manual, time-consuming touchpoints for quality control, analysis setup, and results review. As a result, labs who perform NGS or other complex, high-volume processing of samples can be overwhelmed managing the workflows and data generated. To address these issues and simplify NGS research, we are happy to announce the new version of BaseSpace Sequence Hub. It is designed to enhance your laboratory’s efficiency and support the needs of high-throughput labs.
Included in this update are new features, including a biosample-centric data model that provides tracking of all biosample activity from lab preparation through analysis delivery. We’re also introducing the following features:
- New automation quality control features
- Automated app launches and workflows
- An updated Application Programming Interface (API) to help you streamline your next-generation sequencing (NGS) workflows
- An improved user interface that helps you access your data and perform functions more quickly
Biosample-centric Data Model
Our new biosample-centric data model enables easy tracking of all biosample activity from lab preparation through analysis delivery. Biosamples are the data containers that represent the original DNA source material. They are used to trace all sequencing activities, including lab preparation (with LIMS integration) sequencing runs, data analysis, and delivery of data.
The new data model centers on biosamples, the original source of DNA, so you can easily track all biosample activity from lab preparation, with optional laboratory information management system (LIMS) integration, to delivery of analysis results. Biosamples can be used as inputs to multiple sequencing runs, and they can contain multiple datasets, which can live within separate projects.
Important Note: Biosamples with the same name (Sample ID in the sample sheet) are automatically aggregated. The new features will aggregate all FASTQ data sets with the same Sample ID into a single biosample. It is important to name the samples in your sample sheet uniquely, otherwise they will be aggregated together. Learn more about automatic data aggregation here.
Automated Lane QC, App Launch, and Analysis QC
After sequencing, much of the work required to process biosamples can be automated in bulk. By setting up automation ahead of time using the command line interface (CLI), sequencing runs can be automatically passed or failed based on their sequencing quality, converted to FASTQ datasets, used as inputs in an app, and then be passed or failed based on their app metrics. Automation removes much of the time-consuming and error prone manual work of processing sequencing data into downstream results.
Improved User Interface
The updated interface provides quick access to all of your data from the My Data menu, while the new Action Toolbar contains new and improved app functions such as requeues, QC status changes, workflows, and collaboration tools.
The Analyses page provides a listing of all analyses in your account. The filters on this page help you quickly narrow your search for specific analyses by their current status.
The Projects and Runs pages function the same as before, providing quick access to all of your sequencing projects and instrument runs.
Advanced Automation and Integration Toolset
Alongside our updated data model, we’ve introduced version 2 of the API, which enables you to interact directly with your data and integrate systems together with your BaseSpace Sequence Hub account.
The new automation tools in version 2 of the API:
- Correspond to the new biosample-centric data model
- Improve performance and robustness of the solution
- Include new documentation
Note: The version 1 API is still fully-supported and maintained, although we are actively focusing primarily on version 2 API development. The version 1 API documentation is maintained here.
Version 2 of BaseSpaceCLI has been built using the version 2 API. BaseSpace CLI can be leveraged to read data from your BaseSpace Sequence Hub account and create new data by uploading data and launching apps. In addition, the new BaseSpace CLI can be used to create automated analysis workflows, and import biosamples.
BaseMount is a command-line tool which allows you to explore through runs, projects, biosamples, and datasets, and interact directly with the associated files exactly as you would with any other file system.
We hope the new functionality of BaseSpace Sequence Hub enables your lab to boost productivity and discovery. View a video or visit our updated Support Site to learn more about how to use all the new features and tools. Please contact us at email@example.com if you have any questions or comments.
The BaseSpace Sequence Hub Team
- CLI documentation https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview
- CLI automated workflow creation docs https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-examples
- Link to v1 API docs https://developer.basespace.illumina.com/docs/content/documentation/rest-api/v1-api-reference
- Link to v2 API docs https://developer.basespace.illumina.com/docs/content/documentation/rest-api/api-reference
BaseSpace® Variant Interpreter competes in Next-generation Sequencing (NGS) Bioinformatics Challenge at European Congress of Pathology 2017
Differences in bioinformatics pipelines may contribute to substantial variability across labs, in terms of variant annotation, interpretation, and reporting. The lack of standardization is an emerging concern, especially given the growing availability of commercial bioinformatics software options that reduce the barrier for new labs to adopt next-generation sequencing (NGS). To demonstrate how differences in commercial software can influence analysis, organizers of the Two-Day Symposium for Molecular Biologists in Pathology at the European Congress of Pathology (ECP) 2017 set up an NGS Bioinformatics Challenge where both Illumina and QIAGEN were invited to participate.
The concept of the challenge was simple: 3 institutions in Germany (Universitätsklinikum Köln, Erlangen, and Charite in Berlin) contributed FASTQ files for a total of 12 tumor samples that were known to harbor pathogenic variants. These data were then sent to Illumina and QIAGEN 2 months before the event, and subjected to variant calling and interpretation using their commercially available offerings. Both Illumina and QIAGEN were blinded to the identity of the known variants, and reported on their findings at a round-table session at ECP where the organizers also revealed what the expected variants were, and how they had been interpreted by each institution. To add an interesting twist, each contributing institution had used a different library prep (and sequencing platform) for their samples: The Berlin samples used the AmpliSeq Colon and Lung v2 hotspot panel and were sequenced on the Ion Torrent PGM; both Erlangen and Köln samples were sequenced on an Illumina MiSeq™ System, but the Erlangen samples used the Illumina TruSight® Tumor 15 prep and the Köln samples used a custom QIAGEN amplicon panel.
Removing the NGS Analytics Data Bottleneck with Field-Programmable Gate Arrays (FPGAs)
The following is a guest blog, written by our partners at Edico Genome.
The next-generation sequencing (NGS) analysis demand is growing at an exponential rate, creating a shortage of computing power to analyze the rapidly growing body of data. Current projections1 calculate genomic data to continue doubling every seven months, a stark acceleration in comparison to Moore’s Law, which states CPU capabilities will double every two years (Figure 1, below). The void left in-between creates a bottleneck for genomics labs.
Providing an alternative to traditional CPU-based systems, Edico Genome’s DRAGEN™ (Dynamic Read Analysis for Genomics) Platform leverages FPGA (Field-Programmable Gate Array) technology to provide customers with hardware-accelerated implementation of genome pipeline algorithms. Leveraging FPGAs, DRAGEN allows customers to analyze NGS data at unprecedented speeds with extremely high accuracy2 onsite, in the cloud, or through a blended hybrid cloud.
BaseSpace Sequence Hub, hosted on Amazon Web Services, enables the cloud-based deployment of the Edico Genome DRAGEN pipeline. Edico Genome’s DRAGEN Genome Pipeline is now readily available, enabling rapid analysis of whole genome sequencing and targeted resequencing panels.
Also co-authored by Eric Allen.
Recent advancements in the Illumina TruSeq Amplicon technology enable higher multiplexing of amplicons in a single assay. Combined with next-generation sequencing (NGS) from Illumina, NGS users can perform high throughput, high sensitivity genotyping experiments on Illumina Sequencers. The new TruSeq Amplicon 3.0 BaseSpace® Sequence Hub App introduces major improvements to support a variety of amplicon sequencing applications, including the recently launched TruSeq Genotype Ne product. TruSeq Genotype Ne is a fully customizable targeted genotyping by sequencing (GBS) solution. Key GBS features of TruSeq Amplicon 3.0 include:
- Support for custom reference genomes, allowing a user to analyze amplicon data against their choice of FASTA file (previously uploaded to Sequence Hub).
- Genotypes of Interest reporting, allowing a user to generate a tabular report of genotypes for each sample, which is analogous to genotyping array outputs.
Example usage of the Genotypes of Interest feature can be found in the example Project below. The Input VCF (variant call file) in this Project (found in the test_NA12878_GOI output files) can be used as a template and customized for use with other datasets.
Deep sequencing and high throughput microarray technologies have enabled scientists to routinely generate hundreds of thousands if not millions of new data points in a single experiment. The extraordinary rate of data generation, finite resources, and focused research interests limit most investigations to follow up on only a small fraction of the data generated from next-generation sequencing (NGS) instruments.
Ten years ago, there were no services available to curate data. Researchers relied on home grown tools to perform the cumbersome task of matching information to publications, but they didn’t have the expertise to do a re-analysis. A group of entrepreneurial scientists and bioinfomaticists envisioned a need for solutions to handle the deluge of data that was coming as more and more genomes, from different kinds of species, were being sequenced. That vision manifested into NextBio® Research, a genomics software platform that could match variant to variant sets and gene expression to DNA methylation to protein-DNA binding across a spectrum of organisms, saving researchers time and resources. By leveraging biomedical ontologies coupled with proto-machine learning algorithms, dynamic data-driven applications were added to aid in the discovery of novel relationships among diseases, compounds, gene perturbations, and pathways.
Next-Generation Sequencing (NGS) users often adopt BaseSpace Sequence Hub at times of change within their organizations. They may be new to NGS, or in the process of scaling up their operations, such as having purchased a new sequencing instrument – perhaps a NovaSeq™ Series instrument. In these situations, it can be challenging to estimate data storage and compute costs, creating uncertainty in the budgeting process.
To address this concern, we are excited to announce an unlimited data storage and compute plan for BaseSpace Sequence Hub that takes the uncertainty away. The plan enables new Sequence Hub customers to choose from either the traditional pay-for-use plan or alternatively choose a fixed-price, unlimited plan covering all data storage and compute cost in the first year.
With the plan, new customers get unlimited data storage and have access to all of the apps in BaseSpace Sequence Hub without any additional cost. The plan includes Illumina-developed apps as well as third-party apps, such as the recently announced whole genome sequencing Apps from Edico Genome (coming soon). The unlimited plan eliminates any ambiguity associated with the cost of using BaseSpace Sequence Hub and allows customers to understand their usage patterns so they can comfortably estimate their expenses in subsequent years.