Native App Report Engine Improvements

We recently released improvements to our reporting engine for native apps that make it much easier to do some common tasks.  For those of you that are new to the “Native App Engine”, it’s comprised of three main developer components that allow a developer to package up any analytic workflow and publish it in the BaseSpace App store.  

Those three components are:

  1. Form Builder – Custom input form designer
  2. Docker – Linux container technology that your app runs in
  3. Report Builder – HTML 5 template engine

It’s the same toolset that we use internally to publish apps and we are constantly looking to make improvements based on internal and external feedback. 

Report Generation

report-editor-shot2

The reporting engine, which is essentially a html template engine,  is the last step in building a native app.  When your application finishes processing, we build a meta model around all the data that was produced and allow you to bind that to an html template.  We leverage an open source technology called Liquid which is a markup template engine used in lots of other companies with similar needs.  Along with some of the basic filters that are defined in liquid, we have extended the syntax to include BaseSpace specific needs.

New Features

Find Filter

XPath operator on XML files

  • Takes in an xpath expression and returns the resultant xml text

{{ statsFile | find: "/Statistics/Overall/Stats[SampleID='10002 - R1']/NumberOfClustersPF" }}

 

Stringify Improvements

Stringify is a custom filter that allows you to serialize contents of a drop, csv, or xml segment to JSON.

var globals = {}; 
globals.sample = {{ sample | stringify }};
globals.sample.chromosomes = {{ sample.chromosomes | stringify }};
globals.sample.statsByChromosome = {{ statsFile.parse.StatisticsResequencing.Samples.SampleStatistics | stringify }}; 

 

Custom Dictionary Filters

  • where.starts_with – returns the values where the dictionary key starts with provided string
  • where.ends_with – returns the values where the dictionary key starts with provided string
  • where.contains – returns the values where the dictionary key starts with provided string
  • first – returns the first value, errors if nothing there
  • first_or_default – returns the first value or default (usually null)
{% assign_object datafile1 = result.files.where.starts_with["datafile1"].first %}
{% assign_object anyXml = result.files.where.ends_with[".xml"].first_or_default %}
{% assign_object datafile3 = result.files.where.contains["file3"].first %}

 

Break and Continue Tags

Allows breaking and continuing in Liquid loops.

{% for file in files %}
	{% if file.href == null %}
		{% continue %}
	{% endif %}
	{% if file.href == 'http://special.com' %}
		{% assign specialHref = file.href %}
		{% break %}
	{% endif %}
{% endfor %}

 

Select Columns

Allows the selection of columns in a csv by column name or index. 

{% assign grid2 = result.files[key].select['0,1,2'].take[1].parse %}
{% assign grid3 = result.files[key].select['LastName','City','Phone'].take[5].parse %}

 

Take

Ability to take a subset of rows from a csv file.

{% assign grid2 = result.files[key].select['0,1,2'].take[1].parse %}
{% assign grid3 = result.files[key].select['LastName','City','Phone'].take[5].parse %}

 

ToArray

Ability to output csv data rows to a 2-dimensional data array.

{{ result.files[key].parse.to_array | stringify }}

 

Assign Improvements

Assign now allows assignment of any liquid object, not just primitives.

{% assign myFiles = result.files %}
{% assign myCsv = myFiles.where.ends_with[".csv"].first.select["0"].take[2].parse.to_array | stringify %}
{{ myCsv }}

 

Summary

We hope developers will leverage these new features to build great interactive reports.  If you want to learn more about native apps, then read our intro post, check out our developer portal, or follow our native app tutorial.

Prokka small genome annotation is now in BaseSpace Apps.

We are pleased to announce the release of our latest BaseSpace Labs App Prokka Genome Annotation.

 

image

Prokka wraps the tool of the same name developed by Dr. Torsten Seemann of the Victoria Bioinformatics Consortium. Prokka automates the process of building an annotation of a prokaryotic genome, first running a comprehensive set of feature prediction tools then combining their output into standards-compliant files suitable for further analysis, visualization in genome browsers or submission to archives.

As input, the Prokka App requires a FASTA file which is assumed by default to contain assembled contigs from a bacterial or other prokaryotic genome, such as produced by the SPAdesVelvet de novo Assembly or DNAStar Assemble bacteria Apps. Shotgun metagenomic data can also be annotated by making the appropriate selection on the input form. An example of the App’s output can be found here.

Citation: Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. PMID:24642063

Send data from BaseSpace to NextBio Research

With the addition of NextBio products to our informatics offerings, Illumina adds one of the richest compendia of curated genomic data in existence today. As a starting point towards integrating our BaseSpace and NextBio platforms, we are proud to announce the release of the NextBio Transporter App as our latest BaseSpace Labs release. The Transporter sends analysis results from BaseSpace into NextBio Research and requires that users have an existing account with NextBio.

NextBio Transporter

Similar to the NextBio Annotates RNA-Seq App, the NextBio Transporter uses an AppResult as input to the app, and currently supports outputs from the Cufflinks or RNA Express Core Apps from Illumina. You must also specify your account and domain information in NextBio Research, and the app takes care of everything else.

As output, users are provided a link to the transported data in NextBio Research, and a QuickView is also generated which displays the information NextBio has found relating to the input data.

Transporter Output

Within NextBio Research, users can then explore connections with curated content. For example, by clicking on “Curated Studies”, we can pull up published studies that have produced results that are highly correlated with our transported dataset. NextBio Research offers an incredibly rich platform for biological information, and we are excited to now provide the ability for BaseSpace users to connect their sequencing data to the biological insights offered by NextBio.

Variant calling assessment using Platinum Genomes, NIST Genome in a Bottle, and VCAT 2.0

With the rapid improvements in sequencing throughput, cost, and ease of use, it’s becoming routine to generate lots of variant calls in the form of VCF files. But how do you know if your new variant calls are accurate? How can a non-bioinformatician compare variant calls from different sequencing platforms, reagent kits, biological samples, or software pipelines? Illumina is now offering a carefully designed and highly curated data set and a corresponding BaseSpace Labs App to address these types of comparison questions.

The Platinum Genomes project was started in 2011 with the goal of creating a high confidence, “platinum” quality reference variant call set. This was accomplished by sequencing a large family to high depth using a PCR-Free sample prep to maximize variant calling sensitivity. A large set of candidate variants was obtained from multiple methods and technologies. Candidates that were pedigree consistent were included in the reference call set. Based on this approach, Illumina has derived a set of high-confidence, pedigree-validated reference variant calls for Coriell samples NA12877 and NA12878.

The full set of Platinum Genomes public data and documentation are freely available at http://www.illumina.com/platinumgenomes/ . The BaseSpace Platinum Genomes Project also has copies of the platinum VCF files.

Please cite the Platinum Genomes website and Illumina, Inc. in publications and other public usage of the Platinum Genomes data.

In addition, Illumina has upgraded the Variant Calling Assessment Tool (VCAT 2.0) BaseSpace app. The app calculates SNV and indel statistics and optionally determines the overlap between the input variant call sets. Additionally, the quality of SNV and Indel calls can be assessed based on Platinum Genomes and/or NIST Genome in a Bottle (GIAB) reference variant calls. No existing tool currently offers a simple user interface for using both of these resources. The accuracy and comparison logic in VCAT is primarily based on vcftools, a commonly used open source toolkit for analyzing variant calls. More insight into how VCAT works is available by browsing the VCAT log file.

The Platinum Genomes project is led by Epameinondas Fritzilas and the VCAT project is led by Robert Schmieder, while many other team members have contributed. Please note that while both Platinum Genomes and VCAT are freely available, Illumina does not offer technical support for either of these resources.

There are many interesting ways to use these powerful new tools together. Here’s an example:

Case study on exome sequencing: How much depth is enough?

Using the “Combine Samples” feature in BaseSpace, Nextera Rapid Capture Exome samples of approximately 50x, 100x, 200x, and 400x were created from replicates of Coriell sample NA12878. The source data is here. A BaseSpace Project containing the resulting VCF files and the VCAT 2.0 results is here. The Platinum Genomes v7 recall numbers below suggest that 50x exome depth may only find 80% of the SNVs and 70% of the indels, while exome depths greater than 200x enable finding over 95% of SNVs and over 88% of indels.

image

image

image

VCAT 2.0 also enables the analysis of samples other than NA12878 via pairwise intersect comparisons. The Venn diagrams and corresponding tables shown below are from a VCAT report from the same example BaseSpace Project. When using this feature, VCAT also creates new VCF files which represent the unique SNV and indel calls, as well as VCF files for the common calls.

image

 

image

The Unique VCF files are also indexed for browsing within the BaseSpace IGV App.  Below is a screenshot which shows two SNVs that are found in the 105x exome, but are missed in the 53x exome due to low coverage depth.

image

That’s it for now. In an upcoming blog post, we’ll look at Platinum Genomes and NIST GIAB in more detail including some comparisons.

Deleting Samples and Projects

Our initial release of the delete feature provided users with the ability to delete Runs and Analyses, but we didn’t want to stop there.  Our users now have the ability to delete Samples and Projects from their accounts! Here’s how it works:

  • You can delete a Project from the list of all Projects in your account

Screen Shot 2014-12-19 at 3.14.55 AM

  • You can also remove a Project from its specific page

Screen Shot 2014-12-19 at 3.21.18 AM

  • Samples can be deleted from within a Project and multiple Samples can be selected at once

Screen Shot 2014-12-19 at 3.24.02 AM

  • Deleted items are first moved to the Trash, from there you may Restore items or Empty Trash

Screen Shot 2014-12-21 at 9.43.32 PM

  • If you delete a Project or Sample as a collaborator instead of an owner, the data will be un-shared with your account

For any questions, concerns, or feedback, please do not hesitate to contact us.  We are happy to help in any way that we can. Thanks!

Rounding out 2014 with new apps for the BaseSpace platform

We are looking forward to 2015 as we will continue to launch new Apps and support additional applications, but we are excited to close out 2014 with the release of three new Illumina Core Apps in BaseSpace:

image

The Amplicon-DS App enables analysis of the Illumina TruSight Tumor library prep kit. This solution is specifically design for analysis of all tumor samples, including FFPE. Using targeted TruSeq Amplicon chemistry and a unique, mirrored dual strand (“DS”) assay, researchers can easily detect low frequency somatic mutations. Amplicon-DS also leverages the mirrored dual strand design to reconcile variant calls and capture deamination events due to FFPE, providing confident measurements even in degraded samples.

The Isaac and BWA Enrichment v2.0 Apps add significant functionality over the Enrichment v1.0 Apps. Both Isaac and BWA can now analyze Nextera Rapid Capture Custom panels built in Illumina’s DesignStudio. Isaac Enrichment v2.0 includes Illumina’s own Isaac pipeline for alignment and variant calling. BWA Enrichment v2.0 incorporates the latest aligner, BWA-MEM, which provides improved accuracy (especially when calling structural variants) and increased speed. Both the Isaac and BWA Enrichment v1.0 Apps are available concurrently with v2.0 Apps in BaseSpace Cloud.

In addition to the above Illumina Core Apps we are also launching a BaseSpace Labs App called FASTQ Toolkit v1.0.

image

This App enables the user to have enhanced control over their data, allowing manipulation of FASTQ files including adapter trimming, quality trimming, length filtering, and down-sampling.Users can now down-sample or quality-trim their data and determine what effect that has on their variants, gene expression results, or bacterial classifications. Users could also assess their sample data with the FastQC App and then use that information to optimize their samples with the FASTQ Toolkit v1.0.

Specs for the FASTQ toolkit v1.0 are as follows:

Input- BaseSpace samples (max=200GB per analysis) and user specified parameters that define how the input sample(s) should be processed.

Output- Samples that can be accessed on the “Samples” page of the selected output project. In addition, the App generates a statistics summary file in JSON format that is used to generate the BaseSpace report.

Adapter Trimming-  performed using the approximate matching approach described in TagCleaner. The adapter sequence can be specified separately for the 5′- and 3′-end. Poly-A/T tails are considered repeats of As or Ts at the sequence ends. Trimming them can reduce the number of false positives during database searches, as long tails tend to align well to sequences with low complexity or sequences with tails (e.g. viral sequences) in the database.

Bases can be trimmed from either the 5′- or 3′-end. Alternatively, reads can be trimmed to a maximum read length. Quality trimming on the 3’-end is also available. Note: Aligners such as BWA and Isaac perform trimming internally during alignment. The trimming logic was adapted from BWA.

Down-sampling is performed when only a subset of the sample is needed for an application, such as de novo assembly with memory constraints, or when it is not necessary to process a full sample, like validating an approach at varying levels of genomic coverage.

Filtering- Paired-end reads are only filtered (and removed from the sample) if both reads are filtered out. Otherwise, the filtered mate is replaced by a sequence of Ns (number of Ns will be the minimum read length) to keep the order of pairs in the FASTQ files, which is necessary for many secondary analysis tools.

Nextera Mate-pair conversion- The App supports conversion of Nextera Mate-Pair oriented reads to paired-end oriented reads.

The output of the App contains a set of before and after metrics so you can quickly see the properties of your new data. The table below is an example of the results of down sampling 2,957,468 read pairs to 500,000 read pairs and at the same time performing quality trimming (< Q30) from the 3’ end of the reads.

image

A read length distribution is also provided as shown below for Read 1. The read length distribution provides the distribution of read lengths in your data before and after trimming and allows the user to quickly asses what effect the trimming had on their data.

image

Finally a read filtering summary is provided as shown below. Read filtering will only contain numbers if an option that turns on read filtering such as quality trimming (filters reads < 32 bps) is selected.

image

We are very proud of the hard work our team has put into providing these Apps for the NGS community and look forward to and even more exciting 2015.

Introducing BSFS, the BaseSpace File System

Today, together with the current release of BaseSpace, we would like to announce the release of a product that has gotten I and other developers on the BaseSpace team really excited and really busy over the past months: BaseSpace File System (abbreviated as BSFS or BaseSpace FS) – a feature that many of our developers on the BaseSpace platform have been asking for – is a way for you to directly mount your Samples and Appresults’ data residing in BaseSpace into your docker containers and access it on a strictly as-needed basis.

A range of improvements

This addition to the BaseSpace platform will bring in a great number of benefits, which I will go over now:

  • No pre-download of your Samples and AppResults

When running your apps on the BaseSpace Native App Engine with BSFS turned on, you will notice your applications executing right away upon launch. The usual pre-download step, which could take a good few hours on those very large NextSeq or HiSeq samples is now eliminated.

  • Less network data transfers

When an app executes on an input sample or app result, there is no guarantee that it will use the entire input dataset, up until today the entire input dataset had to be downloaded before any processing could happen – this is no longer the case. BSFS presents a virtual view of your data in the file system, and downloads only the data that is actually read from the files.

  • Overlap computation and network transfers

A typical data processing workflow is for an app to read data then process it in an iterative fashion. In order to make this process more efficient BSFS features a data pre-fetch mechanism: while the app is processing data at a certain location in a file, the data directly adjacent and following this location is downloaded automatically. This has the effect of mitigating issues in download speed due to network latency.

  • An Improved workflow for developers

One of the major areas of focus of the BaseSpace platform team has been to provide developers with an awesome experience, and adding BSFS to the platform will make things even more awesome!
Soon after this release we will be providing a public Amazon Machine Image (AMI) which is the same one we are using in production today. This image contains all that’s required to get started coding in BaseSpace together with BSFS. This is a huge improvement of the developer workflow as it will provide an environment that is readily usable and in which you can simply drop apps in a docker container and see them interact with your BaseSpace data within minutes of getting started!
Finally, with the download step eliminated, there is nothing left to get in the way of a highly iterative development process, where developers can work directly with their BaseSpace data.

  • New and existing apps

All new apps created in the BaseSpace developer portal will now have BSFS turned on by default. Also, we have made sure that existing apps can benefit fully from this new addition, hence if you have been following developer guidelines and conventions (ie. the /data/input drive should not be written to), enabling BSFS in your existing app should be as easy as flicking the switch.

Upon creation of a new application in the developer portal, you will notice a slightly modified launch spec callback, with a new Options array that is used to turn on bsfs:

function launchSpec(dataProvider)
{
    var ret = {
        commandLine: [ "cat", "/illumina.txt" ],
        containerImageId: "basespace/demo",
        Options: [ "bsfs.enabled=true" ]
    };
    return ret;
}

You will want to use a callback function with this new Options array, in order to enable BSFS in your existing app.

Also, as of today the 16S Metagenomics v1.0 app is running with BaseSpace FS switched on to reap all the performance benefits. In the coming weeks, we will turn on BSFS for the rest of the BaseSpace core apps.

  • Real world performance improvements

The kinds of speed-ups we are seeing on these apps are only scratching the surface for the potential speed-ups we can get. On large samples processed on a single node the performance benefits are less pronounced since the download time is dwarfed by the compute time, however multi-node applications that access part of an input sample or app result will benefit greatly as the download portion is always a major contributor to the overall execution time.

With that, I hope you will share my excitement with this announcement, and that BSFS will make your development process even more awesome in BaseSpace.

Links to more resources

BaseSpace Developer portal
BaseSpace FileSystem Developer Documentation
Using BSFS