Variant calling assessment using Platinum Genomes, NIST Genome in a Bottle, and VCAT 2.0
With the rapid improvements in sequencing throughput, cost, and ease of use, it’s becoming routine to generate lots of variant calls in the form of VCF files. But how do you know if your new variant calls are accurate? How can a non-bioinformatician compare variant calls from different sequencing platforms, reagent kits, biological samples, or software pipelines? Illumina is now offering a carefully designed and highly curated data set and a corresponding BaseSpace Labs App to address these types of comparison questions.
The Platinum Genomes project was started in 2011 with the goal of creating a high confidence, “platinum” quality reference variant call set. This was accomplished by sequencing a large family to high depth using a PCR-Free sample prep to maximize variant calling sensitivity. A large set of candidate variants was obtained from multiple methods and technologies. Candidates that were pedigree consistent were included in the reference call set. Based on this approach, Illumina has derived a set of high-confidence, pedigree-validated reference variant calls for Coriell samples NA12877 and NA12878.
The full set of Platinum Genomes public data and documentation are freely available at http://www.illumina.com/platinumgenomes/ . The BaseSpace Platinum Genomes Project also has copies of the platinum VCF files.
Please cite the Platinum Genomes website and Illumina, Inc. in publications and other public usage of the Platinum Genomes data.
In addition, Illumina has upgraded the Variant Calling Assessment Tool (VCAT 2.0) BaseSpace app. The app calculates SNV and indel statistics and optionally determines the overlap between the input variant call sets. Additionally, the quality of SNV and Indel calls can be assessed based on Platinum Genomes and/or NIST Genome in a Bottle (GIAB) reference variant calls. No existing tool currently offers a simple user interface for using both of these resources. The accuracy and comparison logic in VCAT is primarily based on vcftools, a commonly used open source toolkit for analyzing variant calls. More insight into how VCAT works is available by browsing the VCAT log file.
The Platinum Genomes project is led by Epameinondas Fritzilas and the VCAT project is led by Robert Schmieder, while many other team members have contributed. Please note that while both Platinum Genomes and VCAT are freely available, Illumina does not offer technical support for either of these resources.
There are many interesting ways to use these powerful new tools together. Here’s an example:
Case study on exome sequencing: How much depth is enough?
Using the “Combine Samples” feature in BaseSpace, Nextera Rapid Capture Exome samples of approximately 50x, 100x, 200x, and 400x were created from replicates of Coriell sample NA12878. The source data is here. A BaseSpace Project containing the resulting VCF files and the VCAT 2.0 results is here. The Platinum Genomes v7 recall numbers below suggest that 50x exome depth may only find 80% of the SNVs and 70% of the indels, while exome depths greater than 200x enable finding over 95% of SNVs and over 88% of indels.
VCAT 2.0 also enables the analysis of samples other than NA12878 via pairwise intersect comparisons. The Venn diagrams and corresponding tables shown below are from a VCAT report from the same example BaseSpace Project. When using this feature, VCAT also creates new VCF files which represent the unique SNV and indel calls, as well as VCF files for the common calls.
The Unique VCF files are also indexed for browsing within the BaseSpace IGV App. Below is a screenshot which shows two SNVs that are found in the 105x exome, but are missed in the 53x exome due to low coverage depth.
That’s it for now. In an upcoming blog post, we’ll look at Platinum Genomes and NIST GIAB in more detail including some comparisons.