Tag Archive | FASTQ

Import data from SRA into BaseSpace

When a user generates new data, a common workflow is to compare new results with previously published ones. So how would a user of BaseSpace do this? Until now, a user would have two choices:

  1. Download their BaseSpace data and previously published data to their local machine, and then build the bioinformatics workflow to process both datasets.
  2. Re-process previously published data to send into BaseSpace using the BaseSpace FASTQ uploader.

Today, we’re excited to announce that bringing data into BaseSpace from NCBI’s Sequence Read Archive (SRA) becomes push-button easy with the SRA Import App, our next BaseSpace Labs release.

applogo_v3

Inputs

The SRA Import App lets users easily import data from any of the big three public data repositories – SRA, the European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ). The only information needed is a valid accession number.

SRA Import form

Accession numbers

From the accession number, the app will capture all associated SRA Run data, and import the FASTQ files into BaseSpace so that you may use them with other BaseSpace apps. This means that you may enter in accessions for studies (SRP*/ERP*/DRP*), experiments (SRX*/ERX*/DRX*), samples (SRS*/ERS*/DRS*), runs (SRR*/ERR*/DRR*), or submissions (SRA*/ERA*/DRA*), and the app will import any associated FASTQ files. Some samples and studies have many associated FASTQ files. We currently limit the import to 25GB of data per request, so it may help to assess the size of the study and perform the import in consideration of the data limit..

SRA Runs == BaseSpace Samples

An initial challenge was reconciling SRA’s data model with the BaseSpace data model. There is variability in how data is captured in SRA, but by and large, for most submissions, an SRA Run matches what BaseSpace would call a BaseSpace Sample. This means that if one imports an SRA Study with multiple associated Runs, the Import App will create a different BaseSpace Sample for each Run.

SraBaseSpace_DataModel

There are some cases where this is incorrect. Particularly for older runs (back in the GA and GAII days), sometimes multiple runs were required to generate enough reads to analyze as one sample. In these cases, you can use the Combine tool in the BaseSpace UI to combine multiple samples into one logical sample.

combineGif_v2_optimized

Illumina data only

The app is currently limited to importing only Illumina data. We want to ensure that data imported with this app is compatible with our Core BaseSpace apps, and that currently means that we don’t support data from other platforms.

Under the Hood

Designing one tool to handle the variability within three huge public repositories is not trivial! Data formats and standards have changed over the years, even just from Illumina, and so the app will automatically check for older Illumina formats, and make modifications to convert data in an older format into a modern format. For example, data with quality scores encoded in the Phred+64 format will be automatically converted to Phred+33, and FASTQ headers, whose format from Illumina has changed over time, are rewritten for compatibility with BaseSpace Core Apps.

We hope you find this app useful for enabling you to do more with your data on BaseSpace! There will likely be some accession numbers that the app does not handle correctly. And for those, we invite users to send feedback so that the app can be improved.


For an example import and analysis of human gut microbiome data, click here: SRP024239 – Human Gut Microbiome Dataset


 

FASTQ upload is now available in BaseSpace

We are excited to announce the availability of a data upload feature for FASTQ files that were previously generated on Illumina sequencing instruments. This simple-to-use feature is accessible from any project to which the user has write access by first clicking on the project and then selecting the Import tab shown below.

ProjectTab

The user will then be prompted to select their import type. The user can upload a single sample by clicking on “Sample” as shown below.

Samples

The user can then either “Drag and drop” one or more files into the webpage or click on “select files” and select which files they would like to upload from a file browser. Note that the FASTQ files need to adhere to Illumina standards, as specified below.  Data for a single sample can constitute multiple files. The total number of files per sample and their combined size are limited to 16 and 25 GB respectively. It will take 1-2 hours to upload a 25GB sample on a network with a relatively fast internet connection.

dranganddrop

The user will then see a progress bar as the file/s are uploaded. Once the progress bar completes, the user can add additional files. The user can also set the sample name and associate a genome with the sample in the upper left hand corner of the screen.

upload_screen

Once the user has imported all of the files and the files complete uploading, the user will need to click on the  “Complete Import” button (shown above) to complete the session.

FASTQ file standards

  • The uploader will only support gzipped FASTQ files generated on Illumina instruments
  • The name of the FASTQ files must conform the following convention:
    • SampleName_SampleNumber_Lane_Read_FlowCellIndex.fastq.gz (i.e. SampleName_S1_L001_R1_001.fastq.gz / SampleName_S1_L001_R2_001.fastq.gz)
  • The read descriptor in the FASTQ files must conform to the following convention:
    • @Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber:
      • Read 1 descriptor would look like this:
        @M00900:62:000000000-A2CYG:1:1101:18016:2491 1:N:0:13
      • Read 2 would have a 2 in the ReadNum field, like this:
        @M00900:62:000000000-A2CYG:1:1101:18016:2491 2:N:0:13

Quality considerations

  • The number of base calls for each read must equal the number of quality scores
  • The number of entries for Read 1 must equal the number of entries for Read 2
  • The uploader will determine if files are paired-end based on the matching file names in which the only difference is the ReadNum
  • For paired-end reads, the descriptor must match for every entry for both reads 1 and 2
  • Each read has passed filter

Upload parameters

  • Only one sample can be uploaded at a time
  • A maximum of 16 files can be uploaded in a session
  • The size of the uploaded files cannot exceed 25 GB
  • A detailed description of how to use the uploader can be found in the BaseSpace user guide