Until recently, doing time series analysis at scale was expensive and almost exclusively the domain of large enterprises. What made time series a hard and expensive problem to tackle? Until the advent of the NoSQL database, scaling up to meet increasing velocity and volumes of data generally meant scaling hardware vertically by adding CPUs, memory, or additional hard drives. When combined with database licensing models that charged per processor core, the cost of scaling was simply out of reach for most.
Fortunately, the open source community is democratising large scale data analysis rapidly, and I am lucky enough to work at a company making contributions in this space. In my talk at All Things Open this year, I'll introduce Riak TS, a key-value database optimized to store and retrieve time series data for massive data sets, and demonstrate how to use it in conjunction with three other open source tools—Python, Pandas, and Jupyter—to build a completely open source time series analysis platform. And it doesn't take all that long.
The basics you need to know to get started with Riak TS:
- Installation: where to get Riak TS, how to install it, and how to scale it up as the size of your data problem grows
- How to get started interacting with Riak TS using the built in riak-shell and Python using the Riak Python Client
- How to create a new table in Riak TS and verify that it was created
- How to query Riak TS using both the riak-shell and Python
During my talk, I'll load over 350,000 records from the Bay Area Bike Share open data set to demonstrate how fast Riak TS is at both reading and writing data. I'll use the Python Data Analysis Library and Jupyter (two open source tools that every Python programmers should know) to:
- Query Riak TS
- Convert a Riak TS resultset into a Pandas DataFrame
- Demonstrate some of the built in data analysis features of Pandas
- Use the matplotlib library to demonstrate how to create data visualizations
Riak TS is a particularly exciting addition to the open source world of databases for a couple of reasons. One, you'd be hard pressed to find a time series database that can scale from one to over 100 nodes on commodity hardware with so little effort in the ops department. And two, Riak TS automatically handles the distribution of data around your cluster of nodes, replicates your data three times to ensure high availability, and has a host of automated features that are designed specifically to maximize uptime.
For developing applications on top of Riak TS using Java, Python, Ruby, GO, Node.js, PHP, .Net, or Erlang, one of the coolest features is Riak TS’s use of ANSI compliant SQL. Using SQL makes Riak TS accessible to a wide range of developers and, importantly, business data analysts.
If you are feeling particularly motivated to start analyzing time series data, you can grab all of my example code from GitHub.