VCAT 2.3 with Haplotype Compare and FASTQ Toolkit 2.0
In January 2015, Illumina introduced the VCAT 2.0 BaseSpace app and the associated Platinum Genomes v7 truth data set. Now we are happy to announce an upgraded Variant Calling Assessment Tool (VCAT 2.3, https://basespace.illumina.com/apps/1800799/Variant-Calling-Assessment-Tool), which has integrated access to the open source Haplotype Compare tool, in addition to the legacy VCF-Tools based assessment engine. We also updated the truth data sets with Platinum Genomes v8.
Haplotype Compare, also known as hap.py, uses haplotype and genotype information to perform more accurate variant calling accuracy assessments. The GitHub site for this command-line tool (https://github.com/Illumina/hap.py) provides the source code and other files needed for use outside of BaseSpace, and documentation on how the tool works and how it can be used. The previous version of VCAT used a VCF-Tools position-based assessment engine by default. Since version 2.2 and in version 2.3, Haplotype Compare is the default engine for gold reference comparison, while choosing VCF-Tools will run both a genotype and position-based analysis using the VCF-Tools based assessment engine.
To illustrate the advantage of haplotype-corrected comparisons, we look at the different variant representations for equivalent variants and their evaluation by Haplotype Compare.
Consider the following input VCF record:
chr1 1524531 CGAGACTGTTTTTA CGAGACTGTT,TGAGACTGTT 1|2
The corresponding variant calls in the Platinum Genomes gold-reference VCF file are these:
chr1 1524531 C T 0|1
chr1 1524540 TTTTA T 1|1
These variants produce equivalent alternate haplotype sequences even though their VCF representation differs:
Alternate Haplotype 1: CGAGACTGTT
Alternate Haplotype 2: TGAGACTGTT
Using the VCFtools-based comparison, these variants will not be matched and produce spurious false-positives and false-negatives. Using haplotype comparison, we can correctly match up records in this type of location, resulting in more accurate indel evaluations.
A couple other recently added features are Precision vs Recall plots and the percent of non-assessed SNVs and indels. A screenshot of a Precision vs Recall plot is shown below. The non-assessed metric is useful to understand what portion of the input variant calls are not being assessed for accuracy. Comparing the results from using NIST GiaB 2.18 & 0.22 to Platinum Genomes 7 and 8 shows some clear differences between these sets of reference calls. Using PG 7 or 8 results in a much smaller number of variant calls in the non-assessed categories.
We are also pleased to announce that FASTQ Toolkit v2.0 has been released. This app allows manipulation of FASTQ files, including adapter trimming, quality trimming, length filtering, format conversions and down-sampling. The updated version contains several improvements that include the following:
- Updated input form to provide more information on settings and improve usability by highlighting the default read length filtering
- New options to fix header lines, quality encoding and filenames in uploaded or imported FASTQ files
- New option to split paired-end samples into two single-end samples
- Support for Casava-generated FASTQ files with multi-digit control bits in header
- Improved behavior for trimming multiple adapters
- Improved runtime by removing the default check for read names
We hope you find VCAT 2.3 and FASTQ Toolkit 2.0 useful. Questions about VCAT and FASTQ Toolkit can be directed to firstname.lastname@example.org, while questions on Haplotype Compare can be directed to Peter Krusche at email@example.com.