The Variant Calling Assessment Tool (VCAT) v3.0 BaseSpace® App has improved usability, several new gold references, adds BED file stratification, and delivers an enhanced in-browser report.
Improved VCF file selection
- File chooser now allows multi-selection of VCFs
- Easier usage of alternative labels for VCF files
- File chooser now defaults to the current Project
- Improved VCF file tooltips now show analysis name
Gold Reference Additions and Updates
- Added Platinum Genomes reference data for NA12877 (male spouse of NA12878)
- Added NIST GiaB gold reference data for NA24385 (Ashkenazi male)
- Added gold reference data and novelty rate calculations for hg38
- Updated NIST Genome in a Bottle reference data to version 3.2.2
- Updated Platinum Genomes reference data to version 2016-01
In January 2015, Illumina introduced the VCAT 2.0 BaseSpace app and the associated Platinum Genomes v7 truth data set. Now we are happy to announce an upgraded Variant Calling Assessment Tool (VCAT 2.3, https://basespace.illumina.com/apps/1800799/Variant-Calling-Assessment-Tool), which has integrated access to the open source Haplotype Compare tool, in addition to the legacy VCF-Tools based assessment engine. We also updated the truth data sets with Platinum Genomes v8. Read More…
In May 2015, Illumina introduced the MethylSeq 1.0 BaseSpace app for performing analysis on bisulfite sequencing data. Now we are happy to announce release of the MethylKit BaseSpace Labs app (https://basespace.illumina.com/apps/1550550/MethylKit), which is focused on differential methylation analysis on two groups of bisulfite sequencing samples. This BaseSpace Labs app is based on the MethylKit R package, published in 2012 in Genome Biology (http://www.genomebiology.com/2012/13/10/r87). The MethylKit app includes these features:
- Coverage Stats Plot for each sample
- Methylation Stats Plot for each sample
- Methylation Correlation Plot
- Differential Methylation Summary Table (Per Chromosome)
- Differential Methylation Regions (in csv file and bigwig file)
- Methylation Stats Summary
- Methylation Stats Percentile Information
With the rapid improvements in sequencing throughput, cost, and ease of use, it’s becoming routine to generate lots of variant calls in the form of VCF files. But how do you know if your new variant calls are accurate? How can a non-bioinformatician compare variant calls from different sequencing platforms, reagent kits, biological samples, or software pipelines? Illumina is now offering a carefully designed and highly curated data set and a corresponding BaseSpace Labs App to address these types of comparison questions.
The Platinum Genomes project was started in 2011 with the goal of creating a high confidence, “platinum” quality reference variant call set. This was accomplished by sequencing a large family to high depth using a PCR-Free sample prep to maximize variant calling sensitivity. A large set of candidate variants was obtained from multiple methods and technologies. Candidates that were pedigree consistent were included in the reference call set. Based on this approach, Illumina has derived a set of high-confidence, pedigree-validated reference variant calls for Coriell samples NA12877 and NA12878.
The full set of Platinum Genomes public data and documentation are freely available at http://www.illumina.com/platinumgenomes/ . The BaseSpace Platinum Genomes Project also has copies of the platinum VCF files.
Please cite the Platinum Genomes website and Illumina, Inc. in publications and other public usage of the Platinum Genomes data.
In addition, Illumina has upgraded the Variant Calling Assessment Tool (VCAT 2.0) BaseSpace app. The app calculates SNV and indel statistics and optionally determines the overlap between the input variant call sets. Additionally, the quality of SNV and Indel calls can be assessed based on Platinum Genomes and/or NIST Genome in a Bottle (GIAB) reference variant calls. No existing tool currently offers a simple user interface for using both of these resources. The accuracy and comparison logic in VCAT is primarily based on vcftools, a commonly used open source toolkit for analyzing variant calls. More insight into how VCAT works is available by browsing the VCAT log file.
The Platinum Genomes project is led by Epameinondas Fritzilas and the VCAT project is led by Robert Schmieder, while many other team members have contributed. Please note that while both Platinum Genomes and VCAT are freely available, Illumina does not offer technical support for either of these resources.
There are many interesting ways to use these powerful new tools together. Here’s an example:
Case study on exome sequencing: How much depth is enough?
Using the “Combine Samples” feature in BaseSpace, Nextera Rapid Capture Exome samples of approximately 50x, 100x, 200x, and 400x were created from replicates of Coriell sample NA12878. The source data is here. A BaseSpace Project containing the resulting VCF files and the VCAT 2.0 results is here. The Platinum Genomes v7 recall numbers below suggest that 50x exome depth may only find 80% of the SNVs and 70% of the indels, while exome depths greater than 200x enable finding over 95% of SNVs and over 88% of indels.
VCAT 2.0 also enables the analysis of samples other than NA12878 via pairwise intersect comparisons. The Venn diagrams and corresponding tables shown below are from a VCAT report from the same example BaseSpace Project. When using this feature, VCAT also creates new VCF files which represent the unique SNV and indel calls, as well as VCF files for the common calls.
The Unique VCF files are also indexed for browsing within the BaseSpace IGV App. Below is a screenshot which shows two SNVs that are found in the 105x exome, but are missed in the 53x exome due to low coverage depth.
That’s it for now. In an upcoming blog post, we’ll look at Platinum Genomes and NIST GIAB in more detail including some comparisons.
We are happy to introduce two new Nextera Rapid Capture Exome data sets in BaseSpace:
- 12 exome samples sequenced (on 1 flow cell) on HiSeq 2500®
- 1 exome sample sequenced on MiSeq
These exome data sets demonstrate the accuracy of the HiSeq 2500 & MiSeq sequencing platforms, the improved enrichment metrics from using the new targeted region manifest v1.2, and the power and ease of use of the BaseSpace BWA Enrichment App.
Be sure to take a look and compare the difference between the data sets analyzed with manifest v1.1 and v1.2. Both manifest versions are available for use in BaseSpace now. The v1.2 manifest files will be available for download from the Illumina support web site in the near future and a URL will be provided in an update to this blog post.
Click on the links below to see the project and run folders. You will be asked to “Accept” the Run/Project into your BaseSpace account: this is the same mechanism you would use to share BaseSpace projects or runs with your colleagues/collaborators via a dedicated URL.
- HiSeq 2500: Nextera Rapid Capture Exome (12plex, CEPH Trio replicates): Project (Sample data & analysis results), Run (QC plots & run summaries).
- MiSeq v3: Nextera Rapid Capture Exome (NA12878): Project (Sample data & analysis results), Run (QC plots & run summaries).
Materials and Methods: Human Coriell CEPH trio samples NA12878, NA12891, and NA12892; Nextera Rapid Capture Exome kit; analysis with BaseSpace BWA Enrichment App.
A variety of Illumina technologies can be used to help understand cancer markers and progression. To illustrate this, we are publishing a series of Tumor-Normal datasets in BaseSpace analyzed using several approaches. Read this tech note for additional details on how to visualize this particular tumor/normal data set using the integrated set of tools in BaseSpace.
Materials and Methods: The DNA was extracted from the HCC1187 breast ductal carcinoma cell line and a matching lymphoblastoid cell line. 500 ng of DNA were prepped using an early access version of Illumina’s TruSeq DNA PCR-Free kit, and sequenced on a HiSeq 2000. The data was analyzed using a pre-release version of the Cancer Sequencing Workflow. This data is being shared in accordance with the terms of a licensing agreement with UT Southwestern, the owners of the cell lines.
To access the data, click on the link below to see the project folder. You will be asked to “Accept” the Project into your BaseSpace account: this is the same mechanism you will use to share specific real-life projects or runs with your colleagues/collaborators via a dedicated URL.
- TumorNormal_WGS_HiSeq2000_CSW_0.23: Project (Cancer Sequencing Workflow (pre-release version) output files).
Below is a preview of what can be found in the data set and the related technical note:
Summary table from the Somatic Summary Report, automatically generated using Illumina’s Cancer Sequencing Workflow (pre-release):
Circos Plot showing HCC1187 Somatic Mutations, automatically generated using Illumina’s Cancer Sequencing Workflow (pre-release):
Customized version of the Broad IGV, fully integrated into BaseSpace, and showing some VCF and BAM tracks for this Tumor-Normal dataset:
Learn more about:
- HiSeq 2000 Sequencer
- TruSeq DNA PCR-Free Sample Prep Kit
- Illumina Cancer Genomics Solutions
- Tumor-Normal Data in BaseSpace Tech Note
Note: HCC cell lines were invented by Drs. Adi F. Gazdar and John D. Minna at the University of Texas Southwestern Medical Center. Rights in and to the HCC cell lines, progeny, and unmodified derivates thereof belong to the Board of Regents of The University of Texas System. Illumina, Inc. has obtained permission from the Board of Regents of The University of Texas System through the University of Texas Southwestern Medical Center to use the HCC cell lines and publish the data and results herein displayed.
We are happy to announce the BaseSpace availability of three Nextera Rapid Capture Exome data sets:
- A standard 37Mb exome sequenced on two lanes of the HiSeq 2500®
- An expanded 62Mb exome sequenced on two lanes of the HiSeq 2500
- A standard 37Mb exome sequenced on MiSeq
These data sets demonstrate the high uniformity and accuracy of Illumina’s new Nextera Rapid Capture Exome sample prep and Illumina sequencing. The data sets also demonstrate Illumina’s updated Enrichment analysis, which is available either through a new version of MiSeq Reporter or through the new HiSeq Analysis Software product.
Click on the links below to see the project and run folders. You will be asked to “Accept” the Run/Project into your BaseSpace account: this is the same mechanism you will use to share specific real-life projects or runs with your colleagues/collaborators via a dedicated URL.
- NexteraRapidCaptureExome_HiSeq2500: Run (QC plots & summary metrics), Project (Enrichment workflow output files).
- NexteraRapidCaptureExpandedExome_HiSeq2500: Run (QC plots & summary metrics), Project (Enrichment workflow output files).
- NexteraRapidCaptureExome_MiSeq_NA18507: Run (QC plots & summary metrics), Project (Enrichment workflow output files).
Summary of the HiSeq 2500 standard exome run:
Summary of exome analysis metrics from this run:
Materials and Methods: Human Coriell samples NA18507, NA10859, and NA12144; Nextera Rapid Capture Exome and Expanded Exome; analysis with HiSeq Analysis Software and MiSeq Reporter.