by Severine Catreux – Associate Director, Bioinformatics FPGA Development
Significant accuracy gains and speed improvements with DRAGEN v3.3, released April 2019
The DRAGEN engineering and bioinformatics team is excited to announce a new DRAGEN release, v3.3. The second of several releases scheduled for 2019, DRAGEN v3.3 contains improvements across the many pipeline offerings now supported by the DRAGEN platform. This includes accuracy improvements in the germline and somatic pipelines, new features (e.g. CNV DeNovo calling and RNA quantification) and speed gains (Somatic T/N, BCL conversion).
Please see DRAGEN v3.3 Release Notes for more details. This blog highlights the significant updates to the DRAGEN Somatic Pipeline for small variants, that are part of the v3.3 release.
As one of DRAGEN’s core pipelines, the DRAGEN Somatic Pipeline for small variants is utilized by cancer research institutes around the globe. Expanding on the existing functionality, accuracy and speed of the DRAGEN Somatic Pipeline, the v3.3 release placed a high focus on the somatic tumor/normal WGS mode, producing step-function improvements in both accuracy and speed.
During the development cycle for v3.3, the DRAGEN engineering and bioinformatics teams took a deep dive into the DRAGEN Somatic Pipeline tumor/normal mode, strengthening the existing algorithm for accuracy improvements. Specific improvements were made in the genotyping module, to replace point estimation of the variant allele frequency with continuous integration over a range of possible frequencies. This led to significant gains in both sensitivity and precision. Additionally, downstream filtering rules were improved to optimize both sensitivity and precision (less stringency on clustered variants, filter variants positioned at the edge of reads, filter variants with low median base quality and MAPQ). Finally, the indel PCR error model autocalibration module was made independent between the tumor and normal control, to allow for differences in library preparation between the tumor sample and the control sample.
These changes are precursors to further accuracy improvements planned for the DRAGEN v3.4 release, specifically in the area of liquid tumor support, where tumor-in-normal contamination will be taken into account.
Accuracy gains of DRAGEN 3.3 over previous DRAGEN versions (3.2) as well as other pipelines (GATK4 MuTect2 and Strelka2) are shown in the plot below. Gains are measured for both SNVs and indels on most datasets.
DRAGEN v3.3 delivers unprecedented fast run times on the processing of somatic T/N WGS. Users of previous DRAGEN versions will notice substantial speed gains in DRAGEN 3.3 (see graph below). For datasets that were previously HMM-limited, v3.3 delivers up to 6-fold speed improvements, with a typical 100x (tumor) and 40x (normal) run finishing within 1 hour and 40 minutes on an on-premise DRAGEN server. In the cloud, run times average at 2 hours and 30 minutes.
The run time gains were obtained from optimizations in the upstream stages of the pipeline (more efficient way of defining regions of interest and increase the MAPQ threshold of reads to pass downstream, i.e., less reads get passed downstream, without loss on sensitivity). Additionally, the accelerated HMM engines were optimized to consume less of the FPGA footprint, such that more engines could be run in parallel.
Run-time comparison for T/N WGS Somatic Calling
About the DRAGEN Somatic Pipeline
The DRAGEN Somatic Pipeline provides highly accurate, ultra-rapid secondary analysis for tumor-only and tumor/normal experiments to identify cancer-associated mutations.
The DRAGEN Somatic Pipeline offers flexible data analysis to suit the specific needs of users. DRAGEN accepts FASTQ, BAM/CRAM, and BCL files and supports NGS input from whole genome, whole exome, and targeted cancer panels. In the tumor/normal pipeline, both samples go through identical processing steps of mapping, aligning, sorting, and duplicate marking. Then, both sets of tumor and normal reads are passed through the somatic variant caller which looks for sites exhibiting a mutation in the tumor reads while showing little to no evidence of the mutation in the normal reads, thus producing a VCF file containing tumor-specific mutations. The Somatic Pipeline also reports allele frequency, allowing users to assess the prevalence of a specific mutation.
In the tumor-only pipeline, users input NGS data from a tumor sample and run it through the same pipeline as for tumor/normal analysis, but it lacks the matching normal sample. The somatic variant caller contains algorithms that distinguish low-frequency alleles from background noise. Although the resulting VCF file does not distinguish germline from somatic variants, it allows researchers and clinicians to determine if a mutation is present in a tumor sample and its allele frequency.
Have any feedback, suggestions or data that you’d like to share with the DRAGEN team? Our new community forum is an active, collaborative hub for connecting and sharing feedback.
For Research Use Only. Not for use in diagnostic procedures
Perspectives on training and on-boarding users of the Genomics England Cancer Program
By Jawahar Swaminathan, Ph.D., Program Manager – Population Genomics (aided by Keira Cheetham, Ph.D., Staff Bioinformatics Scientist)
Illumina and Genomics England announced the Bioinformatics and Clinical Interpretation partnership (BCIP) in February 2016 with the aim: “develop a platform and knowledge base that can be used to improve and automate genome interpretation.” As part of this collaboration Illumina developed a customized version of BaseSpace™ Variant Interpreter (BSVI) for cancer and rare disease, including various backend services to allow integration between the Genomics England case dispatch pipeline and Illumina systems. What followed was a rigorous schedule of meetings between Genomics England and Illumina (read as long hours, late nights, lots of coffee and many meetings at Genomics England HQ in London!) leading to development of essential features for cancer interpretation.
In June 2017, following multiple rounds of user acceptance testing and concordance checks, BSVI was adopted by Genomics England as the default interpretation solution. Illumina then began the process of on-boarding various users at the 13 Genomic Medicine Centres (GMC), the recruiting hubs for various regions of England by organizing training sessions on the use of the software with particular focus on the unique way data entered and left the system. This article is a look back on these activities and how they are helping in the development of genome interpretation software that meets the diverse needs of the Genomics England end users.
The GMC training sessions
Over the course of 2018, we carried out training and outreach activities across most of the GMCs. The GMCs are the recruitment hubs for the Genomics England 100,000 Genomes Project and comprise of multiple hospitals centered around a geographical area that has the necessary expertise. All training activities were organized by the Genomics England Cancer Interpretation team and were also attended by a representative from Genomics England.
Some humorous takeaways:
- Long hours on an early morning packed train from Cambridge (where we are situated) to our destination city, including a hurriedly eaten lunch at a busy Costa Coffee (yes almost every hospital in the UK has one of these) at the hospital before the training! Throw in the occasional aborted visit due to an alarmingly growing windscreen crack on a rental car or boarding the wrong train and you have the makings of a long and interesting day.
- Every NHS hospital looks the same. The usual 1960s concrete exterior, the same typeface on the signs and the same warren of corridors to the Clinical Genetics department
- Working out how to use the different display equipment in different hospitals before attempting to figure out internet connectivity on the slow and ageing hospital computer systems.
- Hot chocolate or a burrito on the return leg at the local train station as a treat for a job done well
- Never work with children, animals, or live demos. Although we always got the live demo to work!
All training activities were conducted by my colleague Keira Cheetham and I and involved a mix of presentations, live demos using cases specific to the GMC followed by hands-on instructions on how to use the software and send results back to Genomics England for reporting. The training was also an opportunity for us to talk about the science around interpreting cancer genomes and how Illumina is facilitating greater insights into cancers with whole genome sequencing (WGS).
This was also a great opportunity to see how the BCIP tools were used by GMC users and any feedback (both good and bad) were gratefully received. We also spoke about upcoming features in these sessions. Attendance at these events varied from 2-10 users per GMC and the venues ranged from really tight spaces (sometimes with windows!) to large meeting rooms and everything in between. However, what was consistent throughout was the motivation and dedication of the NHS staff in delivering the best possible care to their patients recruited into the Genomics England 100,000 Genomes Project cancer program.
Illumina continues to work with Genomics England to extend its BCIP tools for Rare Disease interpretation and this offering will soon be available for user acceptance testing and following that, could be used in Genomics England’s suite of clinical interpretation systems. In the meantime, the UK NHS has announced the commissioning of WGS for rare disease and cancer, to be offered throughout the health system. The outreach activities of 2018 carried out by Keira and I for cancer will keep in us good stead for the next round of training for rare disease.
The Genomics England Cancer Outreach Program by numbers
- ~76 GMC users across 11 GMCs trained
- ~ 34 hours of training imparted
- ~4000 miles travelled (all by British Rail barring Belfast Northern Ireland)
 The version of BSVI co-developed with Genomics England as part of the BCIP contains extensive customizations for their use cases and is not openly accessible to the public. Please contact your Illumina sales representative for guidance on how to use the publicly available version of BSVI.
Advancing Workflows through Relentless Innovation
We’ve been busy over the last few months! Back in May, Illumina announced the acquisition of Edico Genome and the DRAGEN™ (Dynamic Read Analysis for GENomics) technology. Since then, we have been hard at work expanding DRAGEN’s capabilities to provide more advanced, robust and performant pipelines for our customers. With the inclusion of DRAGEN into the Illumina ecosystem, we are now able to take advantage of the expertise of both teams to build out an expanded chest of tools that offer added functionality, benefits and ease-of-use.
The team has come a long way since we last published about DRAGEN on the BaseSpace™ Blog, and we are excited to share some insight into what we have been working on. Over the coming months, we will continue to post about our latest updates and activities to keep you updated.
Earlier this month, we released DRAGEN v3.2.8, which introduces a variety of new capabilities designed to deliver more insights from your data.Read More…
The Run Monitoring features in BaseSpaceTM Sequence Hub (BSSH) enable users to remotely monitor the quality of their sequencing runs and troubleshoot sequencing errors. As part of our efforts to extend real time Run Monitoring capabilities, we recently released new data quality metrics in BSSH.
% Occupancy for iSeq™ and MiniSeq™ instruments
In a previous release, we added the %Occupied measure in the Charts section of Run Monitoring for the NovaSeq™ systems. As part of this release, this metric will now be visible for iSeq and MiniSeq systems, in BaseSpace Sequence Hub. This measure can be used to understand loading concentrations on the flow cell.
For patterned and non-patterned flow cells, % Occupancy is the percentage of clusters on the flowcell that have DNA that can ultimately be sequenced. With patterned flow cells (such as iSeq), the number of nano wells on the patterned grid determines the total number of possible clusters. For non-patterned flow cells (such as MiniSeq), the total number of possible clusters is the number of non-duplicated spots identified by Real Time Analysis (RTA) during template generation.
% Pass Filter (%PF) settings for all instruments
The Flow Cell chart in BaseSpace Sequence Hub has also been updated to include the %Pass Filter (%PF) for all instruments. This additional information will allow users to determine in particular tiles of a flowcell have unusual levels of %PF.
With these enhancements, we have added capabilities that are currently not available in Sequence Analysis Viewer (SAV). SAV will be updated in the future so our users have a consistent experience across SAV and BSSH.
BaseSpace™ Sequence Hub is used by investigators around the world to facilitate and scale their sequencing and genomic data analysis operations. At Illumina, we understand that security, privacy, and confidentiality are complex issues, and we are committed to protecting our software-as-a-service (SaaS) customers’ data.
To ensure that our customers remain compliant with upcoming changes to the EU General Data Protection Regulation (GDPR), we’ve made a number of updates to privacy practices, policies and agreements that are effective May 25, 2015 for all users globally. These changes include explaining in more detail how we use your information, including your choices, rights, and controls.
Privacy and compliance is a shared responsibility between Illumina and our customers. We are responsible for the security of the BaseSpace Sequence Hub platform. Our cloud provider, Amazon Web Services (AWS) is responsible for providing the tools, services and functionality that enable both the data controller (our customers) and the data processor (Illumina) to be successful.
Figure 1: Shared responsibility Model
A short summary of our changes:
- Improved clarity and transparency.As a key part of GDPR compliance, we’ve described our data processing practices in clear language. For instruments sending Performance Data (IPD) to BaseSpace Sequence Hub, or connected in the Run Monitoring or Storage and Analysis mode, our updated Illumina®Proactive Technical Note (Link) clearly explains what data is sent to BaseSpace in each of the connectivity modes.
- Data Protection Addendum:BaseSpace Sequence Hub leverages AWS to deliver its services. The updated AWS Service Terms (Link) incorporate the GDPR Data Processing Addendum (DPA) and will automatically apply to all customers. Illumina is willing to sign a DPA for customers who ask for it.
- Opt-in & Opt-out:Sharing data with BaseSpace Sequence Hub, irrespective of connectivity mode, is entirely controlled by our customers. If you would like to opt out of sharing Instrument Performance Data (IPD), Run Monitoring, or Storage and Analysis mode, you can do so at any time.
In addition, we are continually reviewing and updating our security best practices to safeguard your data and the services we provide. We are ISO 27001 certified, which has a direct emphasis on international compliance and governance. Please review our security and data privacy whitepaper (Link) to learn more about our security practices.
We hope this makes your use of our SaaS products much easier. As always, please contact us at email@example.com if you have any questions.
The ability to monitor sequencing runs in real time helps users identify issues that prevent costly sequencing errors. Many users rely on the Sequencing Analysis Viewer (SAV) to access detailed quality metrics generated by the real-time analysis software on Illumina instruments.
BaseSpace Sequence Hub has enabled users to remotely monitor their sequencing runs with the Run Charts function with a very similar interface to that of SAV. We have recently released a synchronized update with SAV to offer an expanded set of metrics for monitoring run quality. At the same time, we have added a few capabilities previously only present in SAV. These enhancements provide a consistent experience and enable users to make informed decisions on the quality of their sequencing runs – whether they are standing in front of their instrument accessing SAV or monitoring the run remotely using BaseSpace Sequence Hub.
Expanded menu of metrics that maintains consistency with SAV
BaseSpace Sequence Hub now includes per cycle Phasing and Pre-phasing metrics, % No Call, and Median QScore measures in the Charts section of Run Monitoring. These measures were also released as part of SAV 2.4.5. % No Call & Median QScores are available for all sequencing platforms. The new Phasing/Pre-phasing metrics are available for all platforms except MiSeq and HiSeq 2000/2500.
Traditional Phasing (and pre-phasing) metrics, which were calculated once at cycle 25, are now listed as “Legacy Phasing Rate.” The new per-cycle weights are listed as “Phasing Weight” in the Run Charts.
The Charts section of Run Monitoring now includes the same menu structure as SAV 2.4.5. Now, metrics in the drop down menus only appear if they are available for the cycle, significantly improving the usability of the charts.
Extracted, Called, and Scored cycles have a minimum-maximum range
Run Monitoring now provides Extracted, Called, and Scored cycles as a minimum-maximum range during an instrument run. Previously, Run Monitoring showed only the maximum cycles. A wide spread between the leading and lagging tile might be an indication of a run problem. Now users can easily spot a problem with their run on both SAV and BaseSpace Sequence Hub.
New Metrics in Both SAV and BaseSpace Sequence Hub
In addition to the changes enumerated above, both SAV and BaseSpace Sequence Hubnow include Occupied Count (K) and % Occupied measures in the Charts section of Run Monitoring for NovaSeq systems. The Occupied Count is a measure of the number of wells on the flow cell with DNA. Adding these new metrics will help users understand their loading concentrations and identify issues with their sequencing run.
For Research Use Only. Not for use in diagnostic procedures.
Integration and interoperability between laboratory systems –or lack thereof—remains a challenge for those performing next-generation sequencing (NGS) or other genomics studies.[i] To address this challenge, we developed version 2.2 of the integration between BaseSpace Clarity LIMS and the NovaSeq 6000 instrument. This integration now supports the NovaSeq S1 flow cell.
The NovaSeq S1 flow cell delivers up to 0.5TB of output in two days and is ideally suited for high-intensity sequencing applications. Users can now sequence up to 8 human genomes or 80 exomes per run in approximately 24 hours.[ii] And now, users of both Basespace Clarity LIMS and NovaSeq 6000 instrument can access this out-of-the box integration to quickly get up and running with their system.
The NovaSeq 6000 version 2.0 Workflow in BaseSpace Clarity LIMS that supports the integration version 2.2.1
The integration helps users track samples throughout the workflow. Specifically, it:
- Supports S1, S2, and S4 flow cells per sample
- Supports different applications on the same flow cell
- Calculates samples and reagents volumes based on the flow cell type
- Creates an output file for use with liquid handling robots
- Validates every step in the workflow
The integration also tracks sequencing run information in BaseSpace Clarity LIMS to help with troubleshooting or trending:
- Run recipe files (JSON) are automatically generated to set up and initiate the run
- Sample sheets, which are compatible with BaseSpace Sequence Hub and bcl2fastq v 2.19, are automatically generated and placed directly on the NovaSeq 6000 instrument
- Sequencing run are tracked and run metrics are parsed per lane and per flow cell
If you have questions about this integration, please contact Technical Support.
For Research Use Only. Not for use in diagnostic procedures.
[i] Next-Generation Sequencing Informatics: Challenges and … http://www.bing.com/cr?IG=74008A18392242E59F11965A936C0331&CID=1B0873003B0C6EB91053783A3A0A6F0E&rd=1&h=qZ8eqx6ov_OxkAzDtTWfrbsSZM2WP_pCoQuO66f-AVI&v=1&r=http%3a%2f%2fwww.archivesofpathology.org%2fdoi%2f10.5858%2farpa.2015-0507-RA&p=DevEx,5067.1. Accessed November 14, 2017.
[ii] Illumina.com. (2017). Illumina Releases NovaSeq S4 Flow Cell and NovaSeq Xp Workflow. [online] Available at: https://www.illumina.com/company/news-center/press-releases/2017/2308795.html [Accessed 16 Nov. 2017].