Introducing BSFS, the BaseSpace File System

Today, together with the current release of BaseSpace, we would like to announce the release of a product that has gotten I and other developers on the BaseSpace team really excited and really busy over the past months: BaseSpace File System (abbreviated as BSFS or BaseSpace FS) – a feature that many of our developers on the BaseSpace platform have been asking for – is a way for you to directly mount your Samples and Appresults’ data residing in BaseSpace into your docker containers and access it on a strictly as-needed basis.

A range of improvements

This addition to the BaseSpace platform will bring in a great number of benefits, which I will go over now:

  • No pre-download of your Samples and AppResults

When running your apps on the BaseSpace Native App Engine with BSFS turned on, you will notice your applications executing right away upon launch. The usual pre-download step, which could take a good few hours on those very large NextSeq or HiSeq samples is now eliminated.

  • Less network data transfers

When an app executes on an input sample or app result, there is no guarantee that it will use the entire input dataset, up until today the entire input dataset had to be downloaded before any processing could happen – this is no longer the case. BSFS presents a virtual view of your data in the file system, and downloads only the data that is actually read from the files.

  • Overlap computation and network transfers

A typical data processing workflow is for an app to read data then process it in an iterative fashion. In order to make this process more efficient BSFS features a data pre-fetch mechanism: while the app is processing data at a certain location in a file, the data directly adjacent and following this location is downloaded automatically. This has the effect of mitigating issues in download speed due to network latency.

  • An Improved workflow for developers

One of the major areas of focus of the BaseSpace platform team has been to provide developers with an awesome experience, and adding BSFS to the platform will make things even more awesome!
Soon after this release we will be providing a public Amazon Machine Image (AMI) which is the same one we are using in production today. This image contains all that’s required to get started coding in BaseSpace together with BSFS. This is a huge improvement of the developer workflow as it will provide an environment that is readily usable and in which you can simply drop apps in a docker container and see them interact with your BaseSpace data within minutes of getting started!
Finally, with the download step eliminated, there is nothing left to get in the way of a highly iterative development process, where developers can work directly with their BaseSpace data.

  • New and existing apps

All new apps created in the BaseSpace developer portal will now have BSFS turned on by default. Also, we have made sure that existing apps can benefit fully from this new addition, hence if you have been following developer guidelines and conventions (ie. the /data/input drive should not be written to), enabling BSFS in your existing app should be as easy as flicking the switch.

Upon creation of a new application in the developer portal, you will notice a slightly modified launch spec callback, with a new Options array that is used to turn on bsfs:

function launchSpec(dataProvider)
{
    var ret = {
        commandLine: [ "cat", "/illumina.txt" ],
        containerImageId: "basespace/demo",
        Options: [ "bsfs.enabled=true" ]
    };
    return ret;
}

You will want to use a callback function with this new Options array, in order to enable BSFS in your existing app.

Also, as of today the 16S Metagenomics v1.0 app is running with BaseSpace FS switched on to reap all the performance benefits. In the coming weeks, we will turn on BSFS for the rest of the BaseSpace core apps.

  • Real world performance improvements

The kinds of speed-ups we are seeing on these apps are only scratching the surface for the potential speed-ups we can get. On large samples processed on a single node the performance benefits are less pronounced since the download time is dwarfed by the compute time, however multi-node applications that access part of an input sample or app result will benefit greatly as the download portion is always a major contributor to the overall execution time.

With that, I hope you will share my excitement with this announcement, and that BSFS will make your development process even more awesome in BaseSpace.

Links to more resources

BaseSpace Developer portal
BaseSpace FileSystem Developer Documentation
Using BSFS

Import data from SRA into BaseSpace

When a user generates new data, a common workflow is to compare new results with previously published ones. So how would a user of BaseSpace do this? Until now, a user would have two choices:

  1. Download their BaseSpace data and previously published data to their local machine, and then build the bioinformatics workflow to process both datasets.
  2. Re-process previously published data to send into BaseSpace using the BaseSpace FASTQ uploader.

Today, we’re excited to announce that bringing data into BaseSpace from NCBI’s Sequence Read Archive (SRA) becomes push-button easy with the SRA Import App, our next BaseSpace Labs release.

applogo_v3

Inputs

The SRA Import App lets users easily import data from any of the big three public data repositories – SRA, the European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ). The only information needed is a valid accession number.

SRA Import form

Accession numbers

From the accession number, the app will capture all associated SRA Run data, and import the FASTQ files into BaseSpace so that you may use them with other BaseSpace apps. This means that you may enter in accessions for studies (SRP*/ERP*/DRP*), experiments (SRX*/ERX*/DRX*), samples (SRS*/ERS*/DRS*), runs (SRR*/ERR*/DRR*), or submissions (SRA*/ERA*/DRA*), and the app will import any associated FASTQ files. Some samples and studies have many associated FASTQ files. We currently limit the import to 25GB of data per request, so it may help to assess the size of the study and perform the import in consideration of the data limit..

SRA Runs == BaseSpace Samples

An initial challenge was reconciling SRA’s data model with the BaseSpace data model. There is variability in how data is captured in SRA, but by and large, for most submissions, an SRA Run matches what BaseSpace would call a BaseSpace Sample. This means that if one imports an SRA Study with multiple associated Runs, the Import App will create a different BaseSpace Sample for each Run.

SraBaseSpace_DataModel

There are some cases where this is incorrect. Particularly for older runs (back in the GA and GAII days), sometimes multiple runs were required to generate enough reads to analyze as one sample. In these cases, you can use the Combine tool in the BaseSpace UI to combine multiple samples into one logical sample.

combineGif_v2_optimized

Illumina data only

The app is currently limited to importing only Illumina data. We want to ensure that data imported with this app is compatible with our Core BaseSpace apps, and that currently means that we don’t support data from other platforms.

Under the Hood

Designing one tool to handle the variability within three huge public repositories is not trivial! Data formats and standards have changed over the years, even just from Illumina, and so the app will automatically check for older Illumina formats, and make modifications to convert data in an older format into a modern format. For example, data with quality scores encoded in the Phred+64 format will be automatically converted to Phred+33, and FASTQ headers, whose format from Illumina has changed over time, are rewritten for compatibility with BaseSpace Core Apps.

We hope you find this app useful for enabling you to do more with your data on BaseSpace! There will likely be some accession numbers that the app does not handle correctly. And for those, we invite users to send feedback so that the app can be improved.


For an example import and analysis of human gut microbiome data, click here: SRP024239 – Human Gut Microbiome Dataset


 

Annotate your RNA-Seq data with NextBio Research

We’ve developed a new BaseSpace app named NextBio Annotates RNA-Seq that lets you ‘test-drive’ NextBio Research on your RNA-Seq data. With a few clicks, you can find how the most differentially expressed genes in your experiment are correlated with diseases, tissue types, public studies, and more.

NextBio Annotates RNA-Seq

NextBio Annotates RNA-Seq

 

When you run the Cufflinks Assembly and DE app on your RNA-Seq data, you receive a list of differentially expressed genes. What do these genes do? In what cell lines are they expressed? What drugs are they associated with? These are the questions that NextBio Research helps to answer.

When you run the NextBio Annotates RNA-Seq app, the most differentially expressed genes from your experiment are annotated via the NextBio Research API. The output report shows you how those genes are correlated with many public experiments. This is accomplished by the extensive curation of public data that has been performed by the NextBio curation team.

Example output from the NextBio Annotates RNA-Seq app

 

Of course this app gives just a taste of what NextBio Research can do. In the full version of the product, you can import data of many different types — not just RNA-Seq data. You can use the NextBio Correlation Engine to find how ranked lists of genes from your experiments compare to those in thousands of public experiments. If you’re interested, you can try the full version for free.

We hope this app shows you how NextBio Research can enrich your experiments, by easily annotating and correlating your data with curated public experiments.

Shotgun metagenomics can now be analyzed in the BaseSpace platform.

We are happy to announce the release of the Kraken Metagenomics App as a part of BaseSpace Apps.

image

With this BaseSpace Labs App researchers will be able to classify the presence of viruses and bacteria in their next-generation sequencing (NGS) samples. Kraken was developed by Derrick Wood in Steven Salzberg’s Lab at Johns Hopkins University. Unlike alignment-based classification methods, Kraken utilizes exact k-mer matching and a novel classification algorithm to perform taxonomic classification of NGS reads. The methods and performance have been described in detail in Genome Biology. The Kraken Metagenomics App limits the taxonomic classification to bacteria and viruses available in the MiniKraken 20140330 database.

In addition to using Kraken for classification, the app also provides a host removal feature which uses the SNAP aligner to remove human reads prior to classification. SNAP is an aligner that was developed by a team from the UC Berkeley AMP Lab, Microsoft, and UCSF. The output of the SNAP host removal step is an anonymized BAM file containing the host filtered reads. If host removal is selected then only the host filtered reads will be used by Kraken for taxonomic classification.

In order to demonstrate the performance of the App, we tested it on data described in a recent publication by Wilson et al., 2014. The authors were able to identify the presence of Leptospira in a CSF sample, using Illumina’s MiSeq desktop sequencer. This was in contrast to other methods, such as qPCR, which failed to identify the Leptospira. The data was downloaded from the SRA (Accession Number SRR1145846) and analyzed using the App. The SRA data contained 52,621 paired-end reads that remained after host filtering the original 3,063,784 paired-end reads. Reads were trimmed to remove sequencing adapter prior to analysis. The analysis completed in 30 minutes and generated results that are consistent with Wilson et al. This analysis demonstrates that the App is able to produce publication quality results with equivalent sensitivity.

The tables below show some of the basic metrics obtained from the App. Because the host removal option was selected for this analysis, 1,232 of the 52,621  reads were additionally identified as host. Of the remaining 51,389 reads, only 792 were classified as virus or bacteria. The remaining  50,597 reads which were not assigned a bacterial or viral taxonomy may have come from contamination by other organisms not found in the MiniKraken database or host reads not found in the human reference.

image

The Krona plot obtained from the Kraken Metagenomics App result is shown below. Krona plots allow for hierarchical data to be visualized with zoomable pie charts.  In the case of metagenomics data Krona plots display the taxonomic hierarchy of a sample. The taxonomic levels are represented in the radial direction and the organisms within each taxonomic level in the angular direction. Greater than 20% of the 792 reads that were assigned a taxonomy, were identified as Leptospira which is consistent with the results of Wilson et al.

image

 

With this App, researchers now have access to a high performing, sensitive, and interactive tool for analyzing their metagenomics data in BaseSpace. Researchers can use this App to perform hypothesis-free studies of the structure of bacterial and viral communities present in environmental, industrial, and biological samples. We are excited to provide this App to the BaseSpace community and look forward to feedback and suggestions for improving later versions.

 

Upcoming BaseSpace Developer Conference in San Francisco!

We want to invite all of you to the BaseSpace Developer Conference in San Francisco!  We’ve been active with many BaseSpace Developer Conferences throughout the world this year, including Heidelberg, Singapore, Bangalore, and our most recent visit to the University of Tokyo in Japan!

First of all, we would like to thank all of our developers and speakers, you all made this possible.  We hope it was a great learning experience and look forward to the apps we can bring to BaseSpace.  Also, a big shout out to the University of Tokyo for hosting the event and our Illumina team in Japan.

developer pic

The events showcase the new Native App Engine within BaseSpace with which developers can easily adapt their command-line pipelines into the BaseSpace cloud infrastructure or an infrastructure of their choice.

During the event, developers are taken through a step-by-step walkthrough where they develop two separate BaseSpace applications by the end!  For anyone that is interested in learning more about BaseSpace App development, there is a lot of documentation available on the BaseSpace Developer Portal for both Native and Web applications.

B1KY_rrCIAAW48F

We also spend time interacting with developers and users directly to brainstorm ideas and answer any questions they may have.

helping individual dev

We are hosting another BaseSpace Developer Conference in San Francisco on December 8th, if you are interested in attending you can sign up here.

To get an idea of whats in store for you when you attend one of our developer conferences, check us out on twitter at #basedev2014.

For any further questions about BaseSpace App development, please view or post on the developer forum or contact us through BaseSpace support.

Proteomics? There are Apps for that!

Hello. Aaron from AB SCIEX here, and we are adding Proteomics to BaseSpace. I bet you weren’t expecting that! But about a year ago we started working with the very nice people at Illumina and together we began to map out a grand vision of how we could better enable systems biology/translational medicine/functional genomics (did I miss one?). Both teams recognized that to really help our customers make revolutionary discoveries in biological research we needed to expand beyond our individual core competencies. For too long the omics technologies had been compartmentalized, and that really isn’t how it works in living cells.

image

In parallel, mass spectrometry-based proteomics was growing up. With the new AB SCIEX next-gen proteomics technology (a.k.a. SWATH Proteomics), we can quantify thousands of proteins in many, many samples reproducibly for the first time, so integration with genomics and transcriptomics would be more meaningful.

I am proud to announce the launch of OneOmics – an exclusive partnership to bring together AB SCIEX next-gen proteomics (NGP) and Illumina next-generation sequencing (NGS) tools in the BaseSpace cloud computing environment. There are four BaseSpace Apps in the AB SCIEX next-gen proteomics toolkit:

•             Protein Expression Extractor – for processing raw mass spectrometry data

•             Protein Expression Assembler – for protein fold-change analysis

•             Protein Expression Browser – to visualize results in biological context

•             Protein Expression Analytics – for data quality review

clip_image003

The Protein Expression Extractor and Assembler have some really fancy algorithms generating the results, and the high-powered distributed computing in BaseSpace delivers results up to 50x faster than on the usual high-end desktop computer (from 3 days down to a couple of hours!). Also the ‘don’t try this at home’ paradigm normally associated with mass spec proteomics is a thing of the past, and you don’t have to be a bioinformatics expert to process the data. It’s virtually parameter free. I told you the algorithms are fancy.

But the Protein Expression Browser is the really cool App. There’s not a mass spectrum in sight, and you don’t have to worry about any of the usual impenetrable jargon associated with mass spec proteomics. Just great visuals of showing your results in biological context.

clip_image005

Having proteomics and genomics data in the same place/cloud is a huge step forward in itself, but the AB SCIEX and Illumina teams wanted to take this a step further and integrate NGP and NGS. One of the main benefits of using BaseSpace is that there is already a community of bioinformatics developers publishing new apps, and we are really excited about the applications that our collaborators at the Institute for Systems Biology (ISB), Yale, NextBio and Advaita have been developing. Rob Moritz and his team at ISB have developed the SWATHAtlas Ion Library Generator App that can generate standard and modified SWATH Proteomics Libraries as part of their SWATHAtlas project. This means that researchers have easy access to human, yeast and MtB libraries currently, but more are on the way.

 

There are obviously many ways to combine genomics information with proteomics information.  You could simply do your gene expression with Illumina’s TopHat, Cufflinks or RNA Express Apps, and then your protein expression with AB SCIEX’s next-gen proteomics toolkit, and then integrate the results (parallel analysis). But Chris Colangelo and Rob Kitchen at Yale University are developing the RNA-Seq Translator App that takes the output from Cufflinks and converts it to protein FASTA files. That App will be available soon. These files can then be converted into an NGP library and be used as the basis for proteomics analysis (serial analysis). The transcriptomics and proteomics results can be loaded into Advaita’s iPathwayGuide, or mapped onto the genome using NextBio, and differential transcript or gene expression can be compared with differential protein expression.

OneOmics

I hope you are as excited about the possibilities as we are here at AB SCIEX and Illumina. This is only the beginning…

Related Content:
6 Ways SWATH Cloud Toolkit will help you
Illumina Press release
Genome Web Proteomonitor article
FierceBiotechIT article
GeneticEngineering & Biotechnology News article

 

Taking Out the Trash

If you’ve been using BaseSpace for a while you may have noticed that there wasn’t a way to permanently remove data from your account.  I say that in the past tense because it is no longer true.    The wait is over! “Move to Trash” is now available on Runs and Analyses.

MoveToTrash02

 

Trash Overview

This has been one of the most important features for us to get right because it has to do with removing your data and we take that very seriously.  That is why we are introducing a two-step delete process that will help prevent accidental deletes and give you the confidence you need to safely manage your data.

First, you will notice a new action available on run and analysis list and detail pages, called “Move to Trash”. On the list pages, you must first highlight the row that you want before it’s available.

 MoveToTrash

This action is very similar to moving files on your desktop to the trash or recycle bin. Just like your desktop, the data can be recovered, but it can no longer be viewed or acted upon.

Trashed Items Side-Effects:

  • If the items were shared, all share recipients will lose access to that data
  • All API access is immediately removed and will return the HTTP status code of 410 (“Gone”)
  • Any attempt to view this data on the website will take the user to an 410 error page stating the content is “Gone”
  • Data, while in the trash, can only be  “Restored” or “Emptied” by the owner.
  • Purging data will cause it to be permanently removed and cannot be undone.

 

Moving Runs to the Trash

  • Runs can be put in the trash from the list or the detail pages.
  • Runs cannot be removed if they are in a non-terminal state.  The most common non-terminal states would be: running, uploading, analyzing.
  • The dialog may also present you with the option to remove all associated analyses that used the run as input.
    • All sequencing runs will have at least 1 associated analysis unless they were failed or used just for remote monitoring. 
  • If you are not the owner of the run, moving this item to the trash will simply remove your access and cannot be undone.
    •   To restore access, just contact the owner or click on the previously sent share link if it’s still active.

 

movetotrash-animation

 

Moving Analyses to the Trash

  • Analyses can be put in the trash from the list and detail pages.
  • Analyses cannot be removed if in a non-terminal state. The most common non-terminal states would be: pending execution and running.
  • If a project is being transferred, some of the analyses may not be removed until after the transfer has been completed.
  • Apps that are leveraging data as input may fail if items are moved to the trash. 
  • If you have items in the trash, we prevent project transfers until all items in that project are restored or emptied.

 

Emptying and Restoring Items in the Trash

The trash page can be accessed from most of the project and run list pages.  The icon is always in the right side of the grid and labeled, “View Trash”.

TrashIcon

There are only two actions currently on the Trash page: Empty and Restore.

Empty will permanently delete all items, and Restore allows you to return the items back to being active.

Restored items will keep all of their original attributes except for the share recipients.

restore

User Agreement Updates

Because of all of these changes, we have also updated our User agreements to reflect the behavior of these new features. In particular, item 7 states that even though data can be removed it may have been previously shared with other users or apps and subsequently downloaded or copied.   You will be prompted to accept these new terms upon your next login.  If you have any questions, don’t hesitate to ask!

Thank you,

-Greg