Last year, we announced the creation of Squiggle, an algorithm for intuitive DNA sequence visualization and analysis along with an accompanying software package to enable interactive exploration of DNA sequences. This works well for small sequences, but runs into performance issues when the length of and number of DNA sequences increases. Furthermore, users must have the software installed on their computer, which may not be optimal for analyzing the sequences in parallel.
To get around these limitations and to make the tool more usable we created DNAvisualization.org such that it requires only a web browser to operate. The website currently allows users to visualize up to 135 million total letters of DNA at once (for reference, the entire genome of E. coli is about 4.6 million letters), with all of the processing happening entirely in parallel via a cloud computing architecture we optimized specifically for DNA visualization.
The key aspect of our architecture is the use of function-as-a-service (also known as serverless computing), a paradigm in which server management is handed off entirely to the cloud services provider such as Amazon Web Services or Google (among many others). In this system, you supply the code you want to run (the function) and then invoke the function with whatever inputs it requires. If you invoke the function multiple times concurrently, the provider can use its vast resources to execute code in parallel. This frees up developers to focus on developing the code and, because billing is based on how much time it takes the function to run, it has a very nice side effect: when no code is running, you don’t pay.
In our case, we made the serverless code the Squiggle software package and built an HTML+JS website to invoke the serverless code. When a user uploads DNA sequences, each sequence is processed in parallel in the cloud. After the processing is done, the visualizations are efficiently cached to reduce redundant computation and a view of the data is returned to the user. When the user zooms in, the cache is queried and returns a more detailed view of the region. This is an ideal application of serverless computing because computational demand is “bursty”: most of the time, the website won’t be getting much (or any) traffic, but when it does, it may require instant access to significant computational resources.
Because Amazon gives 400,000 seconds of compute time each month as part of its permanent free tier, it’s likely that the site can be run forever without costing a penny. In the event that the free tier is exceeded, the pricing for Amazon’s serverless computing platform is incredibly inexpensive: $0.0000166667 per second of computing. This architecture, more fully described in our paper, is ideal for low-traffic applications requiring access to short-term high-performance computing. We anticipate it being useful for other data analytics tasks, both for bioinformatics (essentially data science for biology) and visualization.
Benjamin D. Lee is a senior at Harvard studying computer science interested in research software engineering for the biological sciences. He currently works at In-Q-Tel Lab41. Read more →