Recently, I’ve been examining different data streaming options as part of my work for GA4GH. The mission of this group to provide a standard API for accessing genomic data. One of the challenges of bioinformatics is that it is currently file based. There are various file formats (which somewhat adhere to standard specifications) and to do any type of analysis you need compose multiple files into a useful data structure that you can analyze. Usually, you’ll need to use a collection of command line tools to do the processing (ETL – extract, transform, load).
Obviously, this mode of work is time consuming and frustrating especially when the data that you need is in a file whose format is just enough outside the spec that some tool in your pipeline breaks. Moreover, you might only be interested in a small subset of the information contained in a file that is multiple gigabytes.
So the ideal situation is to expose data via API’s which conform to a specification and can be queried to provide only the data you need on demand.
Great, you say, let’s build a REST API and be done with it. And this is exactly what we did at first. However, it quickly becomes obvious that the size of data that is of interest is usually on the order of megabytes and spans thousands of objects. Thus, paged, REST endpoints start to really suffer in terms of performance when a query take 5-10 mins to complete and perform 3,000 roundtrips.
So to overcome these deficiencies, we’ve been investigating different streaming implementations. In an effort to understand the state of the streaming universe, I’ve been prototyping small streaming endpoints using various frameworks and methodologies.
A little bit of history…
HTTP/1.1 was released in 1999 before the era of big data was upon us. Remember, this is when laptops were shipping with 5 gb hard drives! Many of the files that we now routinely deal with are a multiple of this size. In fact, 5 gb isn’t really “Big Data”. It’s big enough to be annoying though.
Usually on the internet, the server responds to client request with a response containing a Content-Length Header. The benefit of this is that the client knows when the response is complete. However, the client cannot start processing until all the data is received.
XMLHttpRequest across the board really slowed uptake.
While HTTP/1.1 technically addresses the streaming problem, it does so in an inefficient manner. To shore up some of these issues HTTP/2 is adding features like header compression, binary transfer, and multiplexed streams. Additionally it allows servers to send assets that the client didn’t request via server side push. If you’re interested in what the future holds, by far the best resource is http2-explained.
To benchmark the performance, I transferred the variants as specified by the GA4GH spec in JSON from the first 10,000,000 bases of chromosome 1 in the 1000 Genomes Project. In this genomic region there are 320,977 variants.
The data was dynamically read from the 1kg VCF file via pysam, parsed into GA4GH objects specified by the Protocol Buffer generated classes and serialized to JSON. Responses were not compressed unless the framework by default provided compression.
The transfers were simulated locally on the loopback interface. First, the data was transferred without restriction, then with 100ms latency imposed by netem, and finally with both 100ms of latency and connection throttled to 100Mbits. Since the processes are CPU bound when reading the VCF, the bandwidth cap had no effect.
To get an idea for what is being generated and transferred I first iterated over the VCF, generated the variant objects, and serialized these out to a file. This process took 190 seconds and generated a 224 mb file.
Current REST Server
Using the current REST API (implemented with Flask) the variants were transferred in 912 seconds, however with latency this time grew to 2513 seconds. About 205 mb were transferred during this time. The transfer time differential really shows the overhead paged requests.
- Standard API
- Acceptable performance on small datasets
- Matches the frontend model of requesting data that will appear on screen
- Easy to recover (re-request) corrupted data
- Poor performance on large datasets
- Connection latency severely impacts performance
- Backend implementation of paging can add technical overhead
HTTP/1.1 Streaming Using Chunked Encoding
Cherrypy was used for this implementation. The data was streamed in only 198 seconds, despite the data size growing to 270 mb due to the lack of compression. Even more impressive is that with 100ms of latency, the data was still streamed in just 200 seconds.
However, streaming on Cherrypy isn’t for the faint of heart. Cherrypy essentially hands the data generator to the web server and steps out of the way. What this means is if you data generator suddenly throws an exception 100,000 objects into the stream, the server immediately 500’s without giving you a chance to handle the error. Thus, if your application relies on exception handling as a means of flow control, you’ll need to refactor. Moreover, you’ll need to provide a mechanism to allow clients to restart the stream from arbitrary points. This could be accomplished through the use of server sent memento objects or offset handling. Finally, some of Cherrypy’s convenient decorators like
@json_out and automatic compression do not work with streaming endpoints. However, given that Cherrypy is a microframework, you could build these conveniences into your application easily.
- Lightning fast
- Built on widely supported and stable protocol
- Easily implemented using Python generator functions
- No error handling
- Must implement checkpointing
I used the Autobahn framework with Twisted for this demo. It’s important to note that using websockets is a much lower-level description than the other tests. It is analogous to just exposing a unix socket and communicating over it. It says nothing in terms of how the communication should take place. Thus, I simply sent all the variants to the client when it connected.
First, I attempted to use ws4py, but it turns out that the framework has a known bug which prevents the sending of data immediately after the connection is opened. Autobahn had no such bug and seems to be the most widely used server side websocket framework for python.
Twisted works to provide the event driven functions make code “asynchronous”. A lot of these functions are now available natively in Python 3.5 via
asyncio. But, to keep the playing field level I used Twisted with Python 2.7. The trick in Twisted to get a long running looping process to run asynchronously within Twisted’s main event loop is to turn your code into a generator function and call it with the
cooperate function provided by Twisted. Using this call allows control to be passed back to Twisted every few iterations, which allows the payloads you are sending to actually get sent out over the wire. If you don’t use
cooperate (even if you
yield within your loop) Twisted never gets a chance to send out the messages you generate. If you only have a couple hundred quick iterations, this is barely noticeable, but with 300K+ iterations my code would block for 3 minutes. Moreover the client wouldn’t receive all the messages making me wonder if there is some message queue limit that I was hitting. In anycase, once you realize how Twisted provides asynchronous functionality it’s not too bad.
- Mature protocol
- Connections remains open reducing overhead
- Not widely used outside of real-time web applications
- Needs a well designed sub-protocol to be practical
- Was not designed to optimize one-way data transfer speed
GRPC is an RPC framework that works over HTTP/2. Objects and services are defined using (protocol buffers). It’s a really well integrated system. With a well defined schema, the endpoints almost write themselves. And on the client side, retrieving data is nearly as simple as a local function call. GRPC is a very efficient protocol, beating the current server down transfer size by 25%. It is also quite fast taking just 1⁄3 the time of the current REST endpoint. And while it is slower than the HTTP/1.1 chunked transfer implementation used earlier, it doesn’t suffer from the same error handling shortcomings.
The biggest drawback, is that there is currently no browser based client for GRPC. This shortcoming means you cannot replace your REST APIs used by web clients with it yet. It looks like they are working on an HTML5 client though, so we’ll have to stay tuned. However, both Android and iOS frameworks are available.
- Very easy to use
- Efficient and relatively fast
- Wide variety of languages supported on both the server and client
- Only data needs to be transferred (not the schema)
- Ensures schema integrity across clients
- No browser client
- Newer HTTP/2 protocol is not in wide spread use yet
If you need to stream to the browser it’s probably best to use Chunked Transfer Encoding over HTTP/1.1. However, if you are doing systems programming and need to communicate between machines, I wouldn’t hesitate to use GRPC.