When we launched BaseSpace nine months ago, we were pretty optimistic that users would find the service appealing – but it really was a “build it and they will come” strategy. So we’re thrilled that the numbers around adoption prove that our optimism was justified. Two weeks ago we announced that the majority of MiSeq systems are now connected, with already half of those actively uploading data into BaseSpace. Today, I wanted to dive into those adoption metrics and provide more detail on the impressive growth rate as BaseSpace nears critical mass.
- BaseSpace now has over 2,400 users.
- 70% of all installed MiSeqs have connected to BaseSpace: this tells us that a lot of HiSeqs are going to connect in Q4 – and their runs are 300X bigger.
- Data from 10,340 MiSeq runs has been uploaded: this is huge for us in terms of being able to improve instrument and consumable performance.
- 56,688 samples have been created: we just we love it when people are multiplexing!
- 938 runs have been shared: these are positive signs that BaseSpace is being used as a tool for collaborative research and data distribution, just as we hoped.
Pleased as we are with this, we’re well aware we’ve barely scratched the surface of BaseSpace’s potential. So over the next weeks and months of 2012, we’re introducing a new full-featured UI, seamless upload for all HiSeq instruments, and BaseSpace Apps/E-commerce.
So right now, BaseSpace is an interesting service for MiSeq informatics.
But by the end of the year it will be the platform for genomic research.
Yellow: uploaded runs (performance data)
Red: uploaded runs (sequence data)
We’ve built BaseSpace on Amazon’s AWS cloud computing service simply because we found it to be the cloud platform that provides the best security, scalability, and availability. So it’s pleasing to see that the CDC has come to the same conclusion: AWS is the ideal platform for hosting critical healthcare data for collaborative analysis.
The CDC’s BioSense 2.0 program is built on AWS and implements a nationwide biosurveillance for early detection and prompt assessment of potential bioterrorism-related illness. It enables state and local healthcare authorities to upload patient Electronic Health Record (EHR) data and analyze that data using statistical tools such as R. BioSense 2.0 is compliant with all the applicable federal regulatory requirements including FISMA and HIPAA (more information can be found here).
This is a reminder that a server doesn’t have to be physically located in your facility in order to be secure. In fact the experience and resources Amazon applies to computer security far exceeds that of most (if not all) research institution IT departments. Add to that, as we’ve done with BaseSpace, independent third-party system analysis and active penetration testing, and you can be confident that your data is extremely safe.
Obviously security is a key concern in making the decision to move to cloud-based genomic storage and analysis. It’s difficult (if not impossible) to quantify security in an absolute sense. But since most researchers currently use their institutional IT infrastructure for storage and analysis, it’s possible to assess current institutional IT security relative to that provided by BaseSpace.
BaseSpace has been built by Illumina on Amazon’s AWS cloud infrastructure. AWS hosts cloud-based services such as Netflix, Quora, Reddit and Foursquare as well as providing customer-facing services for government departments including Treasury, DOE, and State. Amazon’s security webpage can be found here. I’ve found that the most useful security overview to be this white paper: “Amazon Web Services: Overview of Security Processes“. Another useful resource is the AWS blog. Here are some key points to note about AWS:
- Standards and accreditation: SOC 1/SSAE 16/ISAE 3402 (auditing), FISMA moderate (US Federal Government), PCI DSS Level 1 (electronic payments), ISO 27001 (international security standard), and FIPS 140-2 (encryption). (For reference, the NIH’s own data centers are rated FISMA moderate.)
- Data centers are protected by security staff and controlled access procedures. Staff with system access undergo background checks.
- All hardware is located behind firewalls which are configured by default to block all traffic.
- Operating security patches are automatically applied.
Let’s also examine a few of the more general questions that are sometimes raised about the security implications the cloud:
Isn’t a big public cloud provider a huge target, and so inevitably vulnerable to attack?
- It’s safe to assume that the size of the prize means that AWS is under constant attack. One advantage of this is that security researchers are always (a) working to identify vulnerabilities as any discovery will be high profile and (b) informing the operator of the problem so as to be seen as one of the good guys. A recent example of this was a security issue identified in October by researchers at Germany’s Ruhr University. The vulnerability, which has not been tied to any actual attacks, was immediately addressed by AWS. And it got a lot of press for Ruhr!
- Obviously a criminal attacker that finds a vulnerability isn’t going to tell AWS about it. But in the words of the famous cartoon “I don’t have to outrun that bear, I only have to outrun you”: if someone breaks into Amazon their target will almost certainly be easily monetized data such as credit card numbers, not genomic data.
If my data is in the cloud, then it’s “on” the internet, and that must be risky, right?
In reality virtually all the world’s computers are connected to the internet. A computer in isolation is rare, and not terribly useful. So it’s highly likely that any existing computer that you use for storing genomic data is already connected to the internet, or at the very least on an intranet that is in turn connected to the internet. Secure isolation from the internet is typically provided by a firewall device configured to protect the internal network from outside attack. AWS computers are protected in the same way by firewalls – and AWS actively monitors its firewalls to check for vulnerabilities (a service beyond the resources of most institutions). And we also encrypt your data, something else that’s rarely done in the institutional IT setting.
My data has to travel to and from the cloud over the internet – isn’t that a big risk?
E-commerce has been with us since web retailers such as Amazon began to emerge. SSL (Secure Sockets Layer) is an internet standard that has been developed to encrypt sensitive communications as they pass over the internet. SSL is regularly updated to allow for new technologies and new threats. Every day millions of people and institutions rely on SSL to protect financial transactions. We use SSL to protect BaseSpace data uploads and downloads. Think of it this way: most of us now access bank accounts over the internet: so just because something is accessible over the internet, doesn’t mean it’s inherently insecure – it’s all about the quality of the security being implemented.
The entire subject of genomic data storage and analysis in the cloud is undergoing constant change, and we’d really like to get your input in the comments below. Let us know your experiences and concerns – we want to learn!
Up until now we’ve been limiting the number of MiSeqs that are able to connect to BaseSpace. Today we’ve removed those limits and all MiSeqs will show a BaseSpace login option when a run is being setup.
Please try BaseSpace, and let us know how it goes!
Welcome to the BaseSpace Weblog, a place for BaseSpace updates directly from members of the BaseSpace team. Return here for news about new tools and functionality, useful examples of research methods and data, and general musings on all things bioinformatic!
There’s a lot of focus right now on the need for more powerful informatics for sequencing (“DNA Sequencing Caught in Deluge of Data” as the NY Times headlined this week). At Illumina we’re confident that these problems are going to be solved by smarter algorithms running on faster computers, and of course we’re working hard on that. But even more importantly we believe that sequencing users need to be able to focus 100% on discovery, not managing the complexities of informatics tools. Hence BaseSpace.
The BaseSpace team is committed to simplifying the informatics experience. We’re going to be deploying new features and updates at intervals of weeks, not the traditional bioinformatics cycle of months or years. We’re constantly talking to users about what works and what needs improvement. And we want to hear from you as soon as possible if we’ve screwed up… and by all means tell us if we’ve done something you really like!
Our development philosophy has been to start with the critical basic functions, execute them well, and then build on that strong foundation. Already, any MiSeq user can check a box on the instrument screen, and all run data will be securely streamed to BaseSpace. So, why would you want to do this?
- Effortless offsite backup of all data generated during a run
- Fast processing using the latest versions of Illumina’s informatics workflows
- Access to your data from anywhere there’s a browser – including mobile
- Instant sharing of any size data set with anyone that has an email address
- Data secured by e-commerce grade encryption: SSL while in flight, AES-256 while at rest
- Oh yeah, and it’s free!
What you see today is just the start. Right now we’re toiling away on an appstore so you can easily access best in class tools from any vendor. We’re working to get all HiSeq 2000s connected. And we’re developing some great tools of our own to put up in the appstore.
Again, don’t hesitate to let us know if there’s anything we can do to improve your BaseSpace experience!