Interpreting structural variation in cancer genomes

A user story from the Genomics England 100,000 Genomes Project cancer programme1

By Jawahar Swaminathan, Ph.D., Program Manager – Population Genomics

Illumina and Genomics England announced a Bioinformatics and Clinical Interpretation partnership (BCIP) in 2016 to “develop a platform and knowledge base that can be used to improve and automate genome interpretation.” In 2017, following months of directed development and rigorous testing, BaseSpace™ Variant Interpreter (BSVI) was adopted by Genomics England as the default interpretation partner for cancer cases in the 100,000 Genomes Project. A previous blog article presented a light-hearted take on the on-boarding and outreach activities by the Illumina team across the 13 constituent Genomic Medicine Centres (GMCs) of the 100,000 Genomes Project. In this article, we showcase how Dr. Patrick Tarpey and Jamie Trotman from the Cancer Genetics Group at Addenbrookes Hospital, Cambridge used the BCIP tools to interpret biologically relevant structural variants in cancer cases from the East of England GMC.

Case 1

A 11-year boy presented with a complex brain tumor initially presented as a biphasic neuroepithelial tumour (low-grade with features similar to those of a desmoplastic infantile ganglioma and high-grade astrocytic tumour). This case was recruited into the 100,000 Genomes Project, where whole-genome sequencing of the tumor and matched normal samples were performed. No variants of interest were initially identified in the static report presented back to the GMC, using analysis from the Genomics England standard pipeline. However, upon deeper analysis with the BCIP tools, Patrick and Jamie were able to identify a novel ZNF394-BRAF gene fusion by visualizing the structural variants and confirmation via filtering and review of the variant call metrics. This fusion is highly reminiscent of an activating fusion between KIAA1549-BRAF (Faulkner et al, 2015, PMID: 26222501) that is a leading cause for pilocytic astrocytoma. The fusion product was predicted to result in the formation of an unregulated kinase domain of BRAF (Exons 10-18). BRAF helps transmit chemical signals from outside the cell to the cell’s nucleus and forms part of a signaling pathway known as the RAS/MAPK pathway, which regulates cell differentiation, migration and apoptosis.

Figure 1: Data visualization showing ZNF394-BRAF fusion
Figure 2: ZNF394-BRAF gene fusion as shown in variant grid

The identified ZNF394-BRAF fusion was subsequently confirmed via an orthogonal test, led to the clinician treating the patient with MEK inhibitors. MEK is downstream from BRAF in the growth factor activation pathway (Figure 3) and the inhibitor is expected to target BRAF-activated cells.

Figure 3: Growth factor activation pathway (Picture courtesy: J. Trotman)

Case 2

This case was an ambiguous diagnosis that presented as a glioblastoma but was reviewed and diagnosed two years after treatment by histology as a pilocytic astrocytoma. Whilst the static reports from Genomics England did not find any variants of interest, a review of the case in BSVI led to the discovery of a large 26.1Mb deletion in chromosome 2, suggesting a CCDC88A-ALK fusion. ALK is a neuronal receptor tyrosine kinase plays a critical role in the development of the nervous system and is selectively expressed in the peripheral and central nervous systems. The domain architecture of ALK shows that it is primarily composed of two MAM domains (MAM1 and MAM2) and a tyrosine kinase domain. The MAM (meprin, A-5 protein, and receptor protein-tyrosine phosphatase mu) domains are predicted to play a role in homodimerization of the receptor kinase and regulate the function of the enzyme (Marchand et al, 1996, PMID: 8798668).

Figure 4: Domain organization of the ALK gene

The fusion product seen in the case suggests that the breakpoints are intronic and lead to the production of a chimeric protein that eliminates the MAM domains of ALK, thereby leading to an activated kinase. Structural variants with intronic breakpoints support the use of whole-genome sequencing in cancer, since these events are unlikely to be identified via targeted, hybrid-capture methods such as whole-exome sequencing.

Figure 5: Visualization and variant grid showing the CCDC88A-ALK gene fusion (Courtesy: J. Trotman)

Prior studies points have identified CCDC88A-ALK fusion as a recurrent partner in ependymoma-like gliomas characterized by both ependymal and astrocytic features (Olsen et al, 2015, PMID: 25795305). This is a critical finding as there are selective ALK inhibitors that could be administered in this case. The fusion was subsequently confirmed in the tumor via orthogonal methods, i.e. PCR and Sanger Sequencing.

Summary

The two examples show how visualization of cases accompanied by appropriate use of filters (MGE10KB), gene lists, coupled with a manual check of the variant calls, has resulted in the identification of biologically relevant variants and insight into disease mechanism. As a power user of the BCIP tools deployed to support the 100,000 Genomes Project, Dr. Tarpey says “BSVI analysis of cancer genomes is invaluable to access all variants (regardless of vcf filter status), and to visualise variants (particularly SVs) to inform validity. The numerous opportunities for triage facilitate appropriate analysis strategies across the diverse array of cancer types.

Biographies

Jamie Trotman is a pre-registered Clinical Scientist. His role is in the analysis and interpretation of 100,000 Genomes Project cancer programme data and report writing for the East of England GMC and the East Midlands and East of England Genomics Laboratory Hub.

Patrick Tarpey is a group leader in the Department of Clinical Genetics at the Cambridge University Hospitals NHS Trust. After a brief period in clinical diagnostics, Patrick moved to Mike Stratton’s team at the Sanger Institute to pursue a project on hereditary x-linked disease via sequencing of the entire genic X-chromosome in a cohort of 100 probands with X-linked disease. This endeavor identified multiple new disease genes which have since been incorporated into routine diagnostics.

He later migrated onto the cancer genome project and pursued multiple projects aimed at unravelling the landscape of somatically acquired variation in breast, bone and other cancer types. This led to the discovery of multiple novel cancer genes, including those of clinical potential. Patrick has a lead role in developing and expanding cancer genome services (familial and acquired) in the recently formed East Anglia and East Midlands Genomic Laboratory Hub (GLH)

For Research Use Only. Not for use in diagnostic procedures.  

1This version of BaseSpace Variant Interpreter co-developed with Genomics England as part of the BCIP contains extensive customizations for their use cases and is not openly accessible to the public. Please contact your Illumina sales representative for guidance on how to use the publicly available version of Variant Interpreter.

Somatic Pipeline Improvements with DRAGEN v3.3

by Severine Catreux – Associate Director, Bioinformatics FPGA Development

Significant accuracy gains and speed improvements with DRAGEN v3.3, released April 2019

The DRAGEN engineering and bioinformatics team is excited to announce a new DRAGEN release, v3.3. The second of several releases scheduled for 2019, DRAGEN v3.3 contains improvements across the many pipeline offerings now supported by the DRAGEN platform. This includes accuracy improvements in the germline and somatic pipelines, new features (e.g. CNV DeNovo calling and RNA quantification) and speed gains (Somatic T/N, BCL conversion).

 Please see DRAGEN v3.3 Release Notes for more details.  This blog highlights the significant updates to the DRAGEN Somatic Pipeline for small variants, that are part of the v3.3 release.

As one of DRAGEN’s core pipelines, the DRAGEN Somatic Pipeline for small variants is utilized by cancer research institutes around the globe. Expanding on the existing functionality, accuracy and speed of the DRAGEN Somatic Pipeline, the v3.3 release placed a high focus on the somatic tumor/normal WGS mode, producing step-function improvements in both accuracy and speed.

Accuracy Improvements:

During the development cycle for v3.3, the DRAGEN engineering and bioinformatics teams took a deep dive into the DRAGEN Somatic Pipeline tumor/normal mode, strengthening the existing algorithm for accuracy improvements. Specific improvements were made in the genotyping module, to replace point estimation of the variant allele frequency with continuous integration over a range of possible frequencies. This led to significant gains in both sensitivity and precision. Additionally, downstream filtering rules were improved to optimize both sensitivity and precision (less stringency on clustered variants, filter variants positioned at the edge of reads, filter variants with low median base quality and MAPQ). Finally, the indel PCR error model autocalibration module was made independent between the tumor and normal control, to allow for differences in library preparation between the tumor sample and the control sample.

These changes are precursors to further accuracy improvements planned for the DRAGEN v3.4 release, specifically in the area of liquid tumor support, where tumor-in-normal contamination will be taken into account.

Accuracy gains of DRAGEN 3.3 over previous DRAGEN versions (3.2) as well as other pipelines (GATK4 MuTect2 and Strelka2) are shown in the plot below. Gains are measured for both SNVs and indels on most datasets.

Figure 1: Comparison of False-Positives (FP) and False-Negatives (FN) between GATK4, Strelka2, DRAGEN 3.2 and DRAGEN 3.3. Lower values are better.

Figure 2: The above chart showcases sensitivity improvements in DRAGEN v3.3 in comparison to DRAGEN v3.2 for INDELs and SNPS.

Speed Gains

DRAGEN v3.3 delivers unprecedented fast run times on the processing of somatic T/N WGS. Users of previous DRAGEN versions will notice substantial speed gains in DRAGEN 3.3 (see graph below). For datasets that were previously HMM-limited, v3.3 delivers up to 6-fold speed improvements, with a typical 100x (tumor) and 40x (normal) run finishing within 1 hour and 40 minutes on an on-premise DRAGEN server. In the cloud, run times average at 2 hours and 30 minutes.

The run time gains were obtained from optimizations in the upstream stages of the pipeline (more efficient way of defining regions of interest and increase the MAPQ threshold of reads to pass downstream, i.e., less reads get passed downstream, without loss on sensitivity). Additionally, the accelerated HMM engines were optimized to consume less of the FPGA footprint, such that more engines could be run in parallel.

Run-time comparison for T/N WGS Somatic Calling

Figure 3: The above chart compares DRAGEN v3.2 (Jan. 2019) and v3.3 for tumor-normal whole genome sequencing somatic calling. DRAGEN v3.3 introduces significant speed improvements.

About the DRAGEN Somatic Pipeline

The DRAGEN Somatic Pipeline provides highly accurate, ultra-rapid secondary analysis for tumor-only and tumor/normal experiments to identify cancer-associated mutations.

Tumor/Normal Mode

The DRAGEN Somatic Pipeline offers flexible data analysis to suit the specific needs of users. DRAGEN accepts FASTQ, BAM/CRAM, and BCL files and supports NGS input from whole genome, whole exome, and targeted cancer panels. In the tumor/normal pipeline, both samples go through identical processing steps of mapping, aligning, sorting, and duplicate marking. Then, both sets of tumor and normal reads are passed through the somatic variant caller which looks for sites exhibiting a mutation in the tumor reads while showing little to no evidence of the mutation in the normal reads, thus producing a VCF file containing tumor-specific mutations. The Somatic Pipeline also reports allele frequency, allowing users to assess the prevalence of a specific mutation.


Figure 4: Tumor-Normal pipeline diagram

Tumor-only Mode

In the tumor-only pipeline, users input NGS data from a tumor sample and run it through the same pipeline as for tumor/normal analysis, but it lacks the matching normal sample. The somatic variant caller contains algorithms that distinguish low-frequency alleles from background noise. Although the resulting VCF file does not distinguish germline from somatic variants, it allows researchers and clinicians to determine if a mutation is present in a tumor sample and its allele frequency.

Figure 5: Tumor-only pipeline diagram

Have any feedback, suggestions or data that you’d like to share with the DRAGEN team? Our new community forum is an active, collaborative hub for connecting and sharing feedback.


For Research Use Only. Not for use in diagnostic procedures


Enabling Cancer Interpretation At Scale For The Genomics England 100K Genomics Project

Perspectives on training and on-boarding users of the Genomics England Cancer Program

By Jawahar Swaminathan, Ph.D., Program Manager – Population Genomics (aided by Keira Cheetham, Ph.D., Staff Bioinformatics Scientist)

Illumina and Genomics England announced the Bioinformatics and Clinical Interpretation partnership (BCIP) in February 2016 with the aim: “develop a platform and knowledge base that can be used to improve and automate genome interpretation.” As part of this collaboration Illumina developed a customized version of BaseSpaceVariant Interpreter (BSVI)[1] for cancer and rare disease, including various backend services to allow integration between the Genomics England case dispatch pipeline and Illumina systems. What followed was a rigorous schedule of meetings between Genomics England and Illumina (read as long hours, late nights, lots of coffee and many meetings at Genomics England HQ in London!) leading to development of essential features for cancer interpretation.

In June 2017, following multiple rounds of user acceptance testing and concordance checks, BSVI was adopted by Genomics England as the default interpretation solution. Illumina then began the process of on-boarding various users at the 13 Genomic Medicine Centres (GMC), the recruiting hubs for various regions of England by organizing training sessions on the use of the software with particular focus on the unique way data entered and left the system. This article is a look back on these activities and how they are helping in the development of genome interpretation software that meets the diverse needs of the Genomics England end users.

Figure 1: The Genomics England Genomic Medicine Centres (Image Courtesy: Genomics England Ltd.

The GMC training sessions

Over the course of 2018, we carried out training and outreach activities across most of the GMCs. The GMCs are the recruitment hubs for the Genomics England 100,000 Genomes Project and comprise of multiple hospitals centered around a geographical area that has the necessary expertise. All training activities were organized by the Genomics England Cancer Interpretation team and were also attended by a representative from Genomics England.

Some humorous takeaways:

  1. Long hours on an early morning packed train from Cambridge (where we are situated) to our destination city, including a hurriedly eaten lunch at a busy Costa Coffee (yes almost every hospital in the UK has one of these) at the hospital before the training! Throw in the occasional aborted visit due to an alarmingly growing windscreen crack on a rental car or boarding the wrong train and you have the makings of a long and interesting day.
  2. Every NHS hospital looks the same. The usual 1960s concrete exterior, the same typeface on the signs and the same warren of corridors to the Clinical Genetics department
  3. Working out how to use the different display equipment in different hospitals before attempting to figure out internet connectivity on the slow and ageing hospital computer systems.
  4. Hot chocolate or a burrito on the return leg at the local train station as a treat for a job done well
  5. Never work with children, animals, or live demos. Although we always got the live demo to work!

All training activities were conducted by my colleague Keira Cheetham and I and involved a mix of presentations, live demos using cases specific to the GMC followed by hands-on instructions on how to use the software and send results back to Genomics England for reporting. The training was also an opportunity for us to talk about the science around interpreting cancer genomes and how Illumina is facilitating greater insights into cancers with whole genome sequencing (WGS).

This was also a great opportunity to see how the BCIP tools were used by GMC users and any feedback (both good and bad) were gratefully received. We also spoke about upcoming features in these sessions. Attendance at these events varied from 2-10 users per GMC and the venues ranged from really tight spaces (sometimes with windows!) to large meeting rooms and everything in between. However, what was consistent throughout was the motivation and dedication of the NHS staff in delivering the best possible care to their patients recruited into the Genomics England 100,000 Genomes Project cancer program.

Illumina continues to work with Genomics England to extend its BCIP tools for Rare Disease interpretation and this offering will soon be available for user acceptance testing and following that, could be used in Genomics England’s suite of clinical interpretation systems. In the meantime, the UK NHS has announced the commissioning of WGS for rare disease and cancer, to be offered throughout the health system. The outreach activities of 2018 carried out by Keira and I for cancer will keep in us good stead for the next round of training for rare disease.

The Genomics England Cancer Outreach Program by numbers

  • ~76 GMC users across 11 GMCs trained
  • ~ 34 hours of training imparted
  • ~4000 miles travelled (all by British Rail barring Belfast Northern Ireland)

[1] The version of BSVI co-developed with Genomics England as part of the BCIP contains extensive customizations for their use cases and is not openly accessible to the public. Please contact your Illumina sales representative for guidance on how to use the publicly available version of BSVI.

Doing more with DRAGEN™ v3.2.8

Advancing Workflows through Relentless Innovation

We’ve been busy over the last few months! Back in May, Illumina announced the acquisition of Edico Genome and the DRAGEN™ (Dynamic Read Analysis for GENomics) technology. Since then, we have been hard at work expanding DRAGEN’s capabilities to provide more advanced, robust and performant pipelines for our customers. With the inclusion of DRAGEN into the Illumina ecosystem, we are now able to take advantage of the expertise of both teams to build out an expanded chest of tools that offer added functionality, benefits and ease-of-use.

The team has come a long way since we last published about DRAGEN on the BaseSpace™ Blog, and we are excited to share some insight into what we have been working on. Over the coming months, we will continue to post about our latest updates and activities to keep you updated.

Earlier this month, we released DRAGEN v3.2.8, which introduces a variety of new capabilities designed to deliver more insights from your data.

Read More…

New Sequence Quality Metrics in BaseSpace™ Sequence Hub

The Run Monitoring features in BaseSpaceTM Sequence Hub (BSSH) enable users to remotely monitor the quality of their sequencing runs and troubleshoot sequencing errors. As part of our efforts to extend real time Run Monitoring capabilities, we recently released new data quality metrics in BSSH.

 

% Occupancy for iSeq™ and MiniSeq™ instruments

 In a previous release, we added the %Occupied measure in the Charts section of Run Monitoring for the NovaSeq™ systems. As part of this release, this metric will now be visible for iSeq and MiniSeq systems, in BaseSpace Sequence Hub. This measure can be used to understand loading concentrations on the flow cell.

For patterned and non-patterned flow cells, % Occupancy is the percentage of clusters on the flowcell that have DNA that can ultimately be sequenced. With patterned flow cells (such as iSeq), the number of nano wells on the patterned grid determines the total number of possible clusters. For non-patterned flow cells (such as MiniSeq), the total number of possible clusters is the number of non-duplicated spots identified by Real Time Analysis (RTA) during template generation.

 

new metrics

 

% Pass Filter (%PF) settings for all instruments

 The Flow Cell chart in BaseSpace Sequence Hub has also been updated to include the %Pass Filter (%PF) for all instruments. This additional information will allow users to determine in particular tiles of a flowcell have unusual levels of %PF.

%PF

With these enhancements, we have added capabilities that are currently not available in Sequence Analysis Viewer (SAV). SAV will be updated in the future so our users have a consistent experience across SAV and BSSH.

 

#QB6200

Putting Your Privacy First

BaseSpace™ Sequence Hub is used by investigators around the world to facilitate and scale their sequencing and genomic data analysis operations. At Illumina, we understand that security, privacy, and confidentiality are complex issues, and we are committed to protecting our software-as-a-service (SaaS) customers’ data.

To ensure that our customers remain compliant with upcoming changes to the EU General Data Protection Regulation (GDPR), we’ve made a number of updates to privacy practices, policies and agreements that are effective May 25, 2015 for all users globally.  These changes include explaining in more detail how we use your information, including your choices, rights, and controls.

Privacy and compliance is a shared responsibility between Illumina and our customers. We are responsible for the security of the BaseSpace Sequence Hub platform. Our cloud provider, Amazon Web Services (AWS) is responsible for providing the tools, services and functionality that enable both the data controller (our customers) and the data processor (Illumina) to be successful.

 AWS-ILMN_Shared_Responsibility_Model

Figure 1: Shared responsibility Model

 

A short summary of our changes:

  • GDPR and Terms & Conditions (T&Cs). GDPR places new obligations on organizations that process EU personal data. As a result, we have updated our business operational practices. The following documents (Privacy Policy (Link), and Terms & Conditions (Link)) better explain our customers’ and users’ rights, and their relationship with Illumina. In addition all our NGS product support pages have been updated with a Privacy & Security section (Link).
  • Improved clarity and transparency.As a key part of GDPR compliance, we’ve described our data processing practices in clear language. For instruments sending Performance Data (IPD) to BaseSpace Sequence Hub, or connected in the Run Monitoring or Storage and Analysis mode, our updated Illumina®Proactive Technical Note (Link) clearly explains what data is sent to BaseSpace in each of the connectivity modes.
  • Data Protection Addendum:BaseSpace Sequence Hub leverages AWS to deliver its services. The updated AWS Service Terms (Link) incorporate the GDPR Data Processing Addendum (DPA) and will automatically apply to all customers. Illumina is willing to sign a DPA for customers who ask for it.
  • Opt-in & Opt-out:Sharing data with BaseSpace Sequence Hub, irrespective of connectivity mode, is entirely controlled by our customers. If you would like to opt out of sharing Instrument Performance Data (IPD), Run Monitoring, or Storage and Analysis mode, you can do so at any time.

In addition, we are continually reviewing and updating our security best practices to safeguard your data and the services we provide. We are ISO 27001 certified, which has a direct emphasis on international compliance and governance. Please review our security and data privacy whitepaper (Link) to learn more about our security practices.

We hope this makes your use of our SaaS products much easier. As always, please contact us at informatics@illumina.com if you have any questions.

QB#6005

Enhanced Run Monitoring in BaseSpace™ Sequence Hub

The ability to monitor sequencing runs in real time helps users identify issues that prevent costly sequencing errors. Many users rely on the Sequencing Analysis Viewer (SAV) to access detailed quality metrics generated by the real-time analysis software on Illumina instruments.

BaseSpace Sequence Hub has enabled users to remotely monitor their sequencing runs with the Run Charts function with a very similar interface to that of SAV. We have recently released a synchronized update with SAV to offer an expanded set of metrics for monitoring run quality. At the same time, we have added a few capabilities previously only present in SAV. These enhancements provide a consistent experience and enable users to make informed decisions on the quality of their sequencing runs – whether they are standing in front of their instrument accessing SAV or monitoring the run remotely using BaseSpace Sequence Hub.

Expanded menu of metrics that maintains consistency with SAV

BaseSpace Sequence Hub now includes per cycle Phasing and Pre-phasing metrics, % No Call, and Median QScore measures in the Charts section of Run Monitoring. These measures were also released as part of SAV 2.4.5. % No Call & Median QScores are available for all sequencing platforms. The new Phasing/Pre-phasing metrics are available for all platforms except MiSeq and HiSeq 2000/2500.

expanded menu.png

Traditional Phasing (and pre-phasing) metrics, which were calculated once at cycle 25, are now listed as “Legacy Phasing Rate.” The new per-cycle weights are listed as “Phasing Weight” in the Run Charts.

traditional phasing.png

Improved usability

The Charts section of Run Monitoring now includes the same menu structure as SAV 2.4.5. Now, metrics in the drop down menus only appear if they are available for the cycle, significantly improving the usability of the charts.

Extracted, Called, and Scored cycles have a minimum-maximum range

Run Monitoring now provides Extracted, Called, and Scored cycles as a minimum-maximum range during an instrument run. Previously, Run Monitoring showed only the maximum cycles. A wide spread between the leading and lagging tile might be an indication of a run problem. Now users can easily spot a problem with their run on both SAV and BaseSpace Sequence Hub.

New Metrics in Both SAV and BaseSpace Sequence Hub

In addition to the changes enumerated above, both SAV and BaseSpace Sequence Hubnow include Occupied Count (K) and % Occupied measures in the Charts section of Run Monitoring for NovaSeq systems. The Occupied Count is a measure of the number of wells on the flow cell with DNA. Adding these new metrics will help users understand their loading concentrations and identify issues with their sequencing run.

new metrics

 

For Research Use Only. Not for use in diagnostic procedures.