Introducing BSFS, the BaseSpace File System

Today, together with the current release of BaseSpace, we would like to announce the release of a product that has gotten I and other developers on the BaseSpace team really excited and really busy over the past months: BaseSpace File System (abbreviated as BSFS or BaseSpace FS) – a feature that many of our developers on the BaseSpace platform have been asking for – is a way for you to directly mount your Samples and Appresults’ data residing in BaseSpace into your docker containers and access it on a strictly as-needed basis.

A range of improvements

This addition to the BaseSpace platform will bring in a great number of benefits, which I will go over now:

  • No pre-download of your Samples and AppResults

When running your apps on the BaseSpace Native App Engine with BSFS turned on, you will notice your applications executing right away upon launch. The usual pre-download step, which could take a good few hours on those very large NextSeq or HiSeq samples is now eliminated.

  • Less network data transfers

When an app executes on an input sample or app result, there is no guarantee that it will use the entire input dataset, up until today the entire input dataset had to be downloaded before any processing could happen – this is no longer the case. BSFS presents a virtual view of your data in the file system, and downloads only the data that is actually read from the files.

  • Overlap computation and network transfers

A typical data processing workflow is for an app to read data then process it in an iterative fashion. In order to make this process more efficient BSFS features a data pre-fetch mechanism: while the app is processing data at a certain location in a file, the data directly adjacent and following this location is downloaded automatically. This has the effect of mitigating issues in download speed due to network latency.

  • An Improved workflow for developers

One of the major areas of focus of the BaseSpace platform team has been to provide developers with an awesome experience, and adding BSFS to the platform will make things even more awesome!
Soon after this release we will be providing a public Amazon Machine Image (AMI) which is the same one we are using in production today. This image contains all that’s required to get started coding in BaseSpace together with BSFS. This is a huge improvement of the developer workflow as it will provide an environment that is readily usable and in which you can simply drop apps in a docker container and see them interact with your BaseSpace data within minutes of getting started!
Finally, with the download step eliminated, there is nothing left to get in the way of a highly iterative development process, where developers can work directly with their BaseSpace data.

  • New and existing apps

All new apps created in the BaseSpace developer portal will now have BSFS turned on by default. Also, we have made sure that existing apps can benefit fully from this new addition, hence if you have been following developer guidelines and conventions (ie. the /data/input drive should not be written to), enabling BSFS in your existing app should be as easy as flicking the switch.

Upon creation of a new application in the developer portal, you will notice a slightly modified launch spec callback, with a new Options array that is used to turn on bsfs:

function launchSpec(dataProvider)
{
    var ret = {
        commandLine: [ "cat", "/illumina.txt" ],
        containerImageId: "basespace/demo",
        Options: [ "bsfs.enabled=true" ]
    };
    return ret;
}

You will want to use a callback function with this new Options array, in order to enable BSFS in your existing app.

Also, as of today the 16S Metagenomics v1.0 app is running with BaseSpace FS switched on to reap all the performance benefits. In the coming weeks, we will turn on BSFS for the rest of the BaseSpace core apps.

  • Real world performance improvements

The kinds of speed-ups we are seeing on these apps are only scratching the surface for the potential speed-ups we can get. On large samples processed on a single node the performance benefits are less pronounced since the download time is dwarfed by the compute time, however multi-node applications that access part of an input sample or app result will benefit greatly as the download portion is always a major contributor to the overall execution time.

With that, I hope you will share my excitement with this announcement, and that BSFS will make your development process even more awesome in BaseSpace.

Links to more resources

BaseSpace Developer portal
BaseSpace FileSystem Developer Documentation
Using BSFS

About Gery Vessere

I am a developer on the BaseSpace team. Illumina's genomic data analysis platform.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: