When a user generates new data, a common workflow is to compare new results with previously published ones. So how would a user of BaseSpace do this? Until now, a user would have two choices:
- Download their BaseSpace data and previously published data to their local machine, and then build the bioinformatics workflow to process both datasets.
- Re-process previously published data to send into BaseSpace using the BaseSpace FASTQ uploader.
Today, we’re excited to announce that bringing data into BaseSpace from NCBI’s Sequence Read Archive (SRA) becomes push-button easy with the SRA Import App, our next BaseSpace Labs release.
The SRA Import App lets users easily import data from any of the big three public data repositories – SRA, the European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ). The only information needed is a valid accession number.
From the accession number, the app will capture all associated SRA Run data, and import the FASTQ files into BaseSpace so that you may use them with other BaseSpace apps. This means that you may enter in accessions for studies (SRP*/ERP*/DRP*), experiments (SRX*/ERX*/DRX*), samples (SRS*/ERS*/DRS*), runs (SRR*/ERR*/DRR*), or submissions (SRA*/ERA*/DRA*), and the app will import any associated FASTQ files. Some samples and studies have many associated FASTQ files. We currently limit the import to 25GB of data per request, so it may help to assess the size of the study and perform the import in consideration of the data limit..
SRA Runs == BaseSpace Samples
An initial challenge was reconciling SRA’s data model with the BaseSpace data model. There is variability in how data is captured in SRA, but by and large, for most submissions, an SRA Run matches what BaseSpace would call a BaseSpace Sample. This means that if one imports an SRA Study with multiple associated Runs, the Import App will create a different BaseSpace Sample for each Run.
There are some cases where this is incorrect. Particularly for older runs (back in the GA and GAII days), sometimes multiple runs were required to generate enough reads to analyze as one sample. In these cases, you can use the Combine tool in the BaseSpace UI to combine multiple samples into one logical sample.
Illumina data only
The app is currently limited to importing only Illumina data. We want to ensure that data imported with this app is compatible with our Core BaseSpace apps, and that currently means that we don’t support data from other platforms.
Under the Hood
Designing one tool to handle the variability within three huge public repositories is not trivial! Data formats and standards have changed over the years, even just from Illumina, and so the app will automatically check for older Illumina formats, and make modifications to convert data in an older format into a modern format. For example, data with quality scores encoded in the Phred+64 format will be automatically converted to Phred+33, and FASTQ headers, whose format from Illumina has changed over time, are rewritten for compatibility with BaseSpace Core Apps.
We hope you find this app useful for enabling you to do more with your data on BaseSpace! There will likely be some accession numbers that the app does not handle correctly. And for those, we invite users to send feedback so that the app can be improved.
For an example import and analysis of human gut microbiome data, click here: SRP024239 – Human Gut Microbiome Dataset