VCAT v4.0 – Support for DRAGEN and GA4GH best practices for small variant benchmarking

Author: Eric Allen, Associate Director of Bioinformatics at Illumina

The Variant Calling Assessment Tool (VCAT) v4.0 BaseSpace™ App has an updated hap.py and several new gold references and panel BED files.

This update includes a recent version of hap.py which (in combination with vcfeval) has been selected by the GA4GH as the recommended tool for small variant call benchmarking. For more details, see the publication at: https://www.ncbi.nlm.nih.gov/pubmed/30858580

Updated and improved tools

Gold Reference Additions and Updates

  • Updated Platinum Genomes reference data to version 2017-1.0
  • Added all the NIST GiaB gold references to v3.3.2:
    • Caucasian female: NA12878 (mother)
    • Ashkenazi Jewish trio: NA24143 (mother), NA24149 (father), NA24385 (son)
    • Han Chinese trio: NA24695 (mother), NA24694 (father), NA24631 (son)

More built-in panel BED files

  • Illumina, IDT, and Twist exome panels
  • Several AmpliSeq panels

The diagram below shows how VCAT, hap.py, and vcfeval work together.

See the screenshots below displaying some of the new features.

We hope you find VCAT 4.0 useful. Questions can be directed to basespacelabs@illumina.com.

For Research Use Only.  Not for use in diagnostic procedures.

QB#9047

MiXCR Immune Repertoire Analyzer version 2.1.11 from MiLaboratories is now available at Illumina BaseSpace™

  • Youting Sun, Senior Bioinformatics Scientist at Illumina
  • Dmitriy Chudakov, CSO at MiLaboratories https://milaboratory.com

The upgraded version of the MiLaboratories LLC flagship software product, MiXCR, is now officially available as an Illumina BaseSpace™ application.

MiXCR is a “gold standard” analytical package in the area of T-cell receptors (TCR) and immunoglobulin (IG) repertoire profiling. Analysts apply MiXCR for extracting immune repertoires  from any type of sequencing data with any level of TCR/IG coverage, ranging from perfectly enriched libraries such as multiplex PCR or targeted 5’RACE, to the “rare event” datasets containing several target entities among hundreds of millions of reads, such as RNA-Seq, and even Exome-Seq data.

In the new version, simple selection of species (Human or Mouse), template material (RNA or DNA), library type (targeted or random), and basic information on library preparation enables appropriate analysis settings for a variety of immune repertoire experiment scenarios:

MiXCR Immune Repertoire Analyzer v2.1.11 Input Form Parameters.

Extraction of both full VDJ length or CDR3  only repertoires is possible, for TCR or IG chains of interest, with or without out-of-frame and stop codon-containing clonotype variants:

MiXCR Immune Repertoire Analyzer v2.1.11 Analysis Settings.

The application also provides post-analysis metrics in the form of interactive reports, including:

Post-analysis metrics in the form of interactive reports, including basic statistics.
Spectratype with major clonotypes.
Quantile Statistics on clonotype frequencies.
Clonotypes with colorized V, D, and J segments.

High extraction efficiency for any type of sequencing data and superb accuracy should make the new MiXCR BaseSpace version a highly useful resource for many. The ability to look at basic parameters and immediately download the resulting figures for reports and publication make it really convenient for the most efficient everyday work on immune repertoires.

Analysis of amplicon data

Recommended settings for panels (a) Immune repertoire panel and (b) TCR-beta SR panel:

  • Starting material: RNA for panel (a), RNA or Genomic DNA for panel (b)
  • Library type: Targeted TCR/IG library amplification (5’RACE, Amplicon, Multiplex, etc)
  • 5’-end of the library: V gene single primer/multiplex
  • 3’-end of the library: J gene single primer/multiplex
  • Presence of PCR primers and/or adapter sequence: Absent / nearly absent / trimmed
  • Target region: CDR3

For Research Use Only.  Not for use in diagnostic procedures.

QB#9017

DRAGEN™ Enrichment App – Accurate, rapid analysis for germline and somatic exome experiments

Author: Eric Allen, Associate Director of Bioinformatics at Illumina

As part of the new DRAGEN v3.4 launch, the Illumina software development team has released a new BaseSpace-exclusive DRAGEN app –DRAGEN Enrichment v3.4. Combining the best of DRAGEN with Illumina’s legacy Enrichment 3 App, the DRAGEN Enrichment app provides ultra-rapid analysis and improved accuracy all at a lower cost per sample.

The DRAGEN Enrichment app is the preferable method for analyzing enrichment data with DRAGEN, delivering a full suite of enrichment specific metrics and reporting.

Here is what to know:

  • The DRAGEN Enrichment App is faster and more accurate vs Enrichment (Isaac/Starling) and BWA Enrichment (BWA/GATK) apps, as demonstrated via the visuals below
  • Variant Calling:
    • Small variant calling – The app includes germline and somatic (low-frequency) small variant calling (tumor only); outputs VCF and gVCF in same analysis
      • Note: Tumor-normal analysis can be conducted by first running the DRAGEN Enrichment app on all their normal and tumor samples, and then running the DRAGEN Somatic app on the resulting BAM files for the Tumor/Normal pairs.
    • Copy number variant (CNV) calling – utilize CNV baseline files based on a panel of normals
    • Structural variant calling
  • Enrichment metrics generated:
    • Read/base enrichment padded/unpadded
    • Uniformity
    • % bases covered at 1x, 10x, 20x, 50x
    • Picard HsMetrics enabled by checkbox
  • Variety of reference options supported, including hg19, GRCh38 and custom references
  • Includes built-in targeted region BED files for common enrichment panels, and accepts custom targeted region BEDs
  • Extensive reporting:
    • In-browser, PDFs, and CSVs
    • Single sample and aggregate reports
  • Integrated variant annotation (Nirvana) and variant browser

The improved small variant calling over other available BaseSpace app solutions is shown below for one replicate of Coriell sample NA12878 with 106x depth:

Analysis AppApp Execution TimeDRAGEN-only Execution TimeSNV RecallSNV PrecisionIndel RecallIndel Precision
DRAGEN Enrichment v3.4.516m 4s6m 50s95.04%99.49%86.90% 92.18%
(Isaac/Starling) Enrichment v3.1.053m 20sNA93.26%99.38%78.29% 86.90%
BWA Enrichment v2.1.21h 23m 2sNA90.66% 99.78%72.85% 89.44%


• Example sample (s01-NFE-CEX-NA12878-demo.vcf) was prepared using Nextera Flex for Enrichment Library Preparation kit with dual indices and sequenced on a NovaSeq™ S2 flow cell: https://basespace.illumina.com/s/FaxWSm2X1gwO
• Variant accuracy comparison was performed using the Variant Calling Assessment Tool v3.2.0 app.

CNV calling is also enabled in the DRAGEN Enrichment app. The screenshot below from IGV shows a 937,697 bp CNV loss found in a melanoma cancer sample (Me01/ERR174231) around the chromosomal region chr9:125239269-126176965. The sample data was obtained from NCBI’s Sequence Read Archive (accession ERR174231) using the SRA Import BaseSpace App.

Project: SRA: ERP001844 (Agilent SureSelect – Exome CNV Detection – Melanoma). Publication: Magi et al.

Somatic/low-frequency variant calling is also enabled. The table below demonstrates the usefulness of this somatic calling tool:

Variant TypeChr Pos Gene Variant HD753 – Expected VF (%) HD753 – Measured VF (DRAGEN Enrichment) (%)
SNV Low GCchr.3 178936091 PIK3CA E545K 5.63.8
SNV High GC chr.19 3118942 GNA11 Q209L 5.66
Long Deletion chr.7 55242464 EGFR ΔE746 – A750 5.33.3
Long Insertion chr.755248998 EGFR V769_D770insASV 5.63.7
SNV High GC chr.14 105246551 AKT1 E17K 55.7

Project: NovaSeq S4: Nextera Flex for Enrichment (HCC1187, HCC1395, HCC1954, HD753, Coriell Mixture). 1% VF cutoff

We’ve also incorporated many of the comprehensive metrics and reporting features built into the legacy Enrichment 3.1.0 app, including read-, base-, and target-level enrichment metrics, as well as the variant table for simple variant call browsing and filtering.

We hope this update enables you to discover new insights. Stay tuned for more app announcements, and let us know if you have any questions.

FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
QB#8389

VCAT v3.0 and FASTQ Toolkit v2.2 – Usability Improvements and Gold Reference Updates

The Variant Calling Assessment Tool (VCAT) v3.0 BaseSpace® App has improved usability, several new gold references, adds BED file stratification, and delivers an enhanced in-browser report.

Improved VCF file selection

  • File chooser now allows multi-selection of VCFs
  • Easier usage of alternative labels for VCF files
  • File chooser now defaults to the current Project
  • Improved VCF file tooltips now show analysis name

Gold Reference Additions and Updates

  • Added Platinum Genomes reference data for NA12877 (male spouse of NA12878)
  • Added NIST GiaB gold reference data for NA24385 (Ashkenazi male)
  • Added gold reference data and novelty rate calculations for hg38
  • Updated NIST Genome in a Bottle reference data to version 3.2.2
  • Updated Platinum Genomes reference data to version 2016-01

Continue reading VCAT v3.0 and FASTQ Toolkit v2.2 – Usability Improvements and Gold Reference Updates

VCAT 2.3 with Haplotype Compare and FASTQ Toolkit 2.0

In January 2015, Illumina introduced the VCAT 2.0 BaseSpace app and the associated Platinum Genomes v7 truth data set. Now we are happy to announce an upgraded Variant Calling Assessment Tool (VCAT 2.3,  https://basespace.illumina.com/apps/1800799/Variant-Calling-Assessment-Tool), which has integrated access to the open source Haplotype Compare tool, in addition to the legacy VCF-Tools based assessment engine. We also updated the truth data sets with Platinum Genomes v8. Continue reading VCAT 2.3 with Haplotype Compare and FASTQ Toolkit 2.0

Differential Methylation Analysis with the MethylKit BaseSpace Labs App

In May 2015, Illumina introduced the MethylSeq 1.0 BaseSpace app for performing analysis on bisulfite sequencing data.  Now we are happy to announce release of the MethylKit BaseSpace Labs app (https://basespace.illumina.com/apps/1550550/MethylKit), which is focused on differential methylation analysis on two groups of bisulfite sequencing samples.  This BaseSpace Labs app is based on the MethylKit R package, published in 2012 in Genome Biology (http://www.genomebiology.com/2012/13/10/r87).  The MethylKit app includes these features:

  • Coverage Stats Plot for each sample
  • Methylation Stats Plot for each sample
  • Methylation Correlation Plot
  • Differential Methylation Summary Table (Per Chromosome)
  • Differential Methylation Regions (in csv file and bigwig file)
  • Methylation Stats Summary
  • Methylation Stats Percentile Information

Continue reading Differential Methylation Analysis with the MethylKit BaseSpace Labs App

Variant calling assessment using Platinum Genomes, NIST Genome in a Bottle, and VCAT 2.0

With the rapid improvements in sequencing throughput, cost, and ease of use, it’s becoming routine to generate lots of variant calls in the form of VCF files. But how do you know if your new variant calls are accurate? How can a non-bioinformatician compare variant calls from different sequencing platforms, reagent kits, biological samples, or software pipelines? Illumina is now offering a carefully designed and highly curated data set and a corresponding BaseSpace Labs App to address these types of comparison questions.

The Platinum Genomes project was started in 2011 with the goal of creating a high confidence, “platinum” quality reference variant call set. This was accomplished by sequencing a large family to high depth using a PCR-Free sample prep to maximize variant calling sensitivity. A large set of candidate variants was obtained from multiple methods and technologies. Candidates that were pedigree consistent were included in the reference call set. Based on this approach, Illumina has derived a set of high-confidence, pedigree-validated reference variant calls for Coriell samples NA12877 and NA12878.

The full set of Platinum Genomes public data and documentation are freely available at http://www.illumina.com/platinumgenomes/ . The BaseSpace Platinum Genomes Project also has copies of the platinum VCF files.

Please cite the Platinum Genomes website and Illumina, Inc. in publications and other public usage of the Platinum Genomes data.

In addition, Illumina has upgraded the Variant Calling Assessment Tool (VCAT 2.0) BaseSpace app. The app calculates SNV and indel statistics and optionally determines the overlap between the input variant call sets. Additionally, the quality of SNV and Indel calls can be assessed based on Platinum Genomes and/or NIST Genome in a Bottle (GIAB) reference variant calls. No existing tool currently offers a simple user interface for using both of these resources. The accuracy and comparison logic in VCAT is primarily based on vcftools, a commonly used open source toolkit for analyzing variant calls. More insight into how VCAT works is available by browsing the VCAT log file.

The Platinum Genomes project is led by Epameinondas Fritzilas and the VCAT project is led by Robert Schmieder, while many other team members have contributed. Please note that while both Platinum Genomes and VCAT are freely available, Illumina does not offer technical support for either of these resources.

There are many interesting ways to use these powerful new tools together. Here’s an example:

Case study on exome sequencing: How much depth is enough?

Using the “Combine Samples” feature in BaseSpace, Nextera Rapid Capture Exome samples of approximately 50x, 100x, 200x, and 400x were created from replicates of Coriell sample NA12878. The source data is here. A BaseSpace Project containing the resulting VCF files and the VCAT 2.0 results is here. The Platinum Genomes v7 recall numbers below suggest that 50x exome depth may only find 80% of the SNVs and 70% of the indels, while exome depths greater than 200x enable finding over 95% of SNVs and over 88% of indels.

image

image

image

VCAT 2.0 also enables the analysis of samples other than NA12878 via pairwise intersect comparisons. The Venn diagrams and corresponding tables shown below are from a VCAT report from the same example BaseSpace Project. When using this feature, VCAT also creates new VCF files which represent the unique SNV and indel calls, as well as VCF files for the common calls.

image

 

image

The Unique VCF files are also indexed for browsing within the BaseSpace IGV App.  Below is a screenshot which shows two SNVs that are found in the 105x exome, but are missed in the 53x exome due to low coverage depth.

image

That’s it for now. In an upcoming blog post, we’ll look at Platinum Genomes and NIST GIAB in more detail including some comparisons.