You CAN Have it Both Ways

Perhaps you plugged in your MiSeq for the first time only a couple of weeks ago.  You’ve completed the requisite validation runs and constructed some handsome libraries for your next set of experiments.  Congratulations, you’re off to the races!  For some, this exciting period is all about testing new experimental designs. For others, it’s about validating previous results.  But I would wager that not enough users begin to think about implementing a long-term system for data management and backup.

You’ve no doubt heard that BaseSpace is a simple, and virtually free way to archive and share data.  But there are a number of reasons that users might also want to work with local copies of their results – to run the data through their existing HiSeq pipelines, to visualize it in their favorite browser, etc.  The truth of the matter is that there are often good reasons to keep a local copy of your data on hand in addition to using BaseSpace.  We like to call this “dual mode,” and it means that there will be copies of MiSeq Reporter reports on your MiSeq and in BaseSpace when each run completes.  If you are at all nervous about streaming data and adding another layer of complexity to your MiSeq repertoire, you can rest assured. What you’re really creating is a free layer of redundancy to your precious data.

Getting the data into BaseSpace is a nearly invisible process and, regardless of network conditions, won’t affect the progress of your local run and analysis.  Here’s how it’s done.  On MiSeq’s “Setup Options” screen, there is a checkbox that allows you to “replicate analysis locally on MiSeq.”  And for many new users this will be the best way to begin using BaseSpace.

dual_mode

It’s important to remember that the diversity of analysis applications made available through BaseSpace will grow dramatically over the coming year, so keeping your data in a place that will allow for easy reanalysis is key.  There is a similar opportunity for MiSeq runs that utilize a reference genome not currently found  in BaseSpace.  My next post will cover this topic in more detail.

So you can have it both ways – local and cloud-based storage, local and cloud based analysis. What’s stopping you?

-Jordan

BaseSpace Security

Obviously security is a key  concern in making the decision to move to cloud-based genomic storage and analysis. It’s difficult (if not impossible) to quantify security in an absolute sense. But since most researchers currently use their institutional IT infrastructure for storage and analysis, it’s possible to assess current  institutional IT security relative to that provided by BaseSpace.

BaseSpace has been built by Illumina on Amazon’s AWS cloud infrastructure. AWS hosts cloud-based services such as Netflix, Quora, Reddit and Foursquare as well as providing customer-facing services for government departments including Treasury, DOE, and State. Amazon’s security webpage can be found here. I’ve found that the most useful security overview to be this white paper: “Amazon Web Services: Overview of Security Processes“. Another useful resource is the AWS blog. Here are some key points to note about AWS:

  • Standards and accreditation: SOC 1/SSAE 16/ISAE 3402 (auditing), FISMA moderate (US Federal Government), PCI DSS Level 1 (electronic payments), ISO 27001 (international security standard), and FIPS 140-2 (encryption). (For reference, the NIH’s own data centers are rated  FISMA moderate.)
  • Data centers are protected by security staff and controlled access procedures. Staff with system access undergo background checks.
  • All hardware is located behind firewalls which are configured by default to block all traffic.
  • Operating security patches are automatically applied.
The BaseSpace team is of course writing large amounts of code that runs on the AWS infrastructure. To verify end-to-end security we have had a third party computer security firm assess our architecture for security risks, and they are also tasked with running “penetration tests” to identify potential vulnerabilities. In addition we encrypt all uploaded data using the AES256 standard, ensuring that even if all other security precautions were circumvented, the stolen data could not be read.
We believe the combination of Amazon’s comprehensive and well tested approach to platform security, overlaid with our own security precautions, ensures that BaseSpace meets or exceeds the security provided by many institutional IT infrastructures.

Let’s also examine a few of the  more general questions that are sometimes raised about the security implications the cloud:

Isn’t a big public cloud provider  a huge target, and so inevitably vulnerable to attack?

  • It’s safe to assume that the size of the prize means that AWS is under constant attack. One advantage of this is that security researchers are always (a) working to identify vulnerabilities as any discovery will be high profile and (b) informing the operator of the problem so as to be seen as one of the good guys. A recent example of this was a security issue identified in October by researchers at Germany’s Ruhr University. The vulnerability, which has not been tied to any actual attacks, was immediately addressed by AWS. And it got a lot of press for Ruhr!
  • Obviously a criminal attacker that finds a vulnerability isn’t going to tell AWS about it. But in the words of the famous cartoon “I don’t have to outrun that bear, I only have to outrun you”: if someone breaks into Amazon their target will almost certainly be  easily monetized data such as credit card numbers, not genomic data.

If my data is in the cloud, then it’s “on” the internet, and that must be risky, right?

In reality virtually all the world’s computers are connected to the internet. A computer in isolation is rare, and not terribly useful. So it’s highly likely that any existing computer that you use for storing genomic data is already connected to the internet, or at the very least on an intranet that is in turn connected to the internet. Secure isolation from the internet is typically provided by a firewall device configured to protect the internal network from outside attack. AWS computers are protected in the same way by firewalls – and AWS actively monitors its firewalls to check for vulnerabilities (a service beyond the resources of most institutions). And we also encrypt your data, something else that’s rarely done in the institutional IT setting.

My data has to travel to and from the cloud over the internet – isn’t that a big risk?

E-commerce has been with us since web retailers such as Amazon began to emerge. SSL (Secure Sockets Layer) is an internet standard that has been developed to encrypt sensitive communications as they pass over the internet. SSL is regularly updated to allow for new technologies and new threats. Every day millions of people and institutions rely on SSL to protect financial transactions. We use SSL to protect BaseSpace data uploads and downloads.  Think of it this way: most of us now access bank accounts over the internet: so just because something is accessible over the internet, doesn’t mean it’s inherently insecure – it’s all about the quality of the security being implemented.

The entire subject of genomic data storage and analysis in the cloud is undergoing constant change, and we’d really like to get your input in the comments below. Let us know your experiences and concerns – we want to learn!

– Alex.


Welcome to the Dashboard: Let’s start sharing!

Once you log into BaseSpace you will be presented with a Dashboard. From here, you can access all of the BaseSpace features that we provide.

image

Latest runs shows you the most recent runs uploaded to BaseSpace. These may be runs uploaded from your MiSeq or runs shared to you by another BaseSpace user. The latest run panel is designed to give you a quick glance at the state of your runs; quick feedback to know if everything is going well with your runs.

From the dashboard you may also access your most used runs, which are runs prioritized based of the frequency on which you access them. You may also see Shared Data, which, as the name implies, is data that has been either “shared by” you with another user or data that has been “Shared with” you.

Sharing is one of the most exciting features we are providing with BaseSpace. In this release, we give you “share by link”. If you are the owner of some data, and would like to share it with other BaseSpace users, all you need to do is navigate to your data, click the share button, and click “share by link”. You may notice that there is a “share by name” button that is currently disabled, I’ll get into more on that later. When you “share by link” you are creating a unique link to that data which you may pass on to your collaborators. As the owner of this data, you have control of creating and removing this link. Any user that clicks on the link you have generated will be able to see your data; however, as they are not owners of the data, they will not be able to create a unique link for themselves which they may pass on to others. They may pass your link to others, but then cannot generate their own link.

image                                           image                                             image

As the owner of the data, you are able to see just who has clicked on your share link and seen your data. If you navigate to your data, click on sharing, and click share info; you will see a list of all the users that have clicked on your share link.

image                       image

This mechanism is excellent for sharing interesting results with your collaborators. And as an owner of the data, you can see just who has seen your data.

As an owner, you may decide to “unshare” your data. You can do this with any data that is already shared. Navigate to the shared data, click sharing and then click unshare. When you unshare, the link to your shared data becomes invalid. Anybody that you have shared the data with as well as anybody that clicks on that shared link will no longer have access to this data.

Upon exploring our sharing features you may have come across “Share by Name”. This is a more configurable mode of Sharing we plan on adding soon. With Share by Name, you will have complete control over who has access to view your data. Similar to sharing on popular collaboration sites like DropBox and Flickr, you will be able to specify who has access to view your data by email. You will have control to manage the lists of users/emails that have access to your data.

So explore sharing and let us know how we can make this feature work for you. We would like to hear from you, either via this blog or UserVoice, on what your thoughts about sharing are. Do you have any feature requests? What type of sharing is more important for you, Share by Name or Share by Link? How may you use sharing to help you with your work?

We hope that you enjoy using BaseSpace and look forward to hearing from you.

Welcome to the BaseSpace Weblog!

Welcome to the BaseSpace Weblog, a place for BaseSpace updates directly from members of the BaseSpace team. Return here for news about new tools and functionality, useful examples of research methods and data, and general musings on all things bioinformatic!

There’s a lot of focus right now on the need for more powerful informatics for sequencing (“DNA Sequencing Caught in Deluge of Data” as the NY Times headlined this week). At Illumina we’re confident that these problems are going to be solved by smarter algorithms running on faster computers, and of course we’re working hard on that. But even more importantly we believe that sequencing users need to be able to focus 100% on discovery, not managing the complexities of informatics tools. Hence BaseSpace.

The BaseSpace team is committed to simplifying the informatics experience. We’re going to be deploying new features and updates at intervals of weeks, not the traditional bioinformatics cycle of months or years.  We’re constantly talking to users about what works and what needs improvement.  And we want to hear from you as soon as possible if we’ve screwed up… and by all means tell us if we’ve done something you really like!

Our development philosophy has been to start with the critical basic functions, execute them well, and then build on that strong foundation. Already, any MiSeq user can check a box on the instrument screen, and all run data will be securely streamed to BaseSpace. So, why would you want to do this?

  • Effortless offsite backup of all data generated during a run
  • Fast processing using the latest versions of Illumina’s informatics workflows
  • Access to your data from anywhere there’s a browser – including mobile
  • Instant sharing of any size data set with anyone that has an email address
  • Data secured by e-commerce grade encryption: SSL while in flight, AES-256 while at rest
  • Oh yeah, and it’s free!

What you see today is just the start. Right now we’re toiling away on an appstore so you can easily access best in class tools from any vendor. We’re working to get all HiSeq 2000s connected. And we’re developing some great tools of our own to put up in the appstore.

Again, don’t hesitate to let us know if there’s anything we can do to improve your BaseSpace experience!

-Alex