NoSQL and the next generation of big data

Javascript code close-up with neon graphic overlay

Image by:

Photo by Jen Wike Huger

This year, OSCON attendees will have the opportunity to hear Henrik Ingo speak on Selling Opensource 101.

Ingo is a senior solutions architect at MongoDB. He is active in many open source projects, and is the author of Open Life: The Philosophy of Open Source, a book on open source community ethics and business models.

In this interview, he provides insight into MongoDB and explains why it's the platform of choice for big data analytics and for building microservices.

Q&A

What do you mean by NoSQL databases? What is the difference between NoSQL and RDBMS? When and when not to use NoSQL databases? What are the advantages of NoSQL databases?

The so-called "NoSQL movement" started around 2007. Perhaps an embryonic form of NoSQL was the widespread usage of memcached, which then evolved into a number of key-value databases like Redis, Riak, Voldemort and so on. There were also academic papers published on the topic, for example by Google and Amazon describing systems used internally at those companies.

The main driver for this development was the need to shift from the old RDBMS architectures to a more scalable scale-out approach. The RDBMS products have served us well, but they were engineered in a different time for a different need. Historically, the database was there to support internal processes of a company. And even the largest companies will only have a few hundred thousand employees—and even then they won't all be using the same database system all at once. In contrast, modern web companies will have tens, maybe even hundreds of millions of users somewhat concurrently on their systems. This is a thousand-fold increase in scale and required a different architectural approach. Also the amount of data stored per user is exploding, so the world is experiencing exponential growth. The amount of data in the world is doubling every 14 months—give or take.

The name "NoSQL" simply refers to the fact that this new breed of databases rid themselves of the API and query language that legacy databases have relied on since 1974. Indeed, one of my primary motivations for joining MongoDB two years ago was the realization that, for the first time in my lifetime(!), something new is happening in the database world, and given the opportunity I realized it would have been crazy of me not to be a part of it.

"NoSQL" is not the most apt name as it only describes what the category of products does not do, and completely misses the fact that there are varieties of NoSQL databases that suit very different needs and use cases. The only thing this new generation of databases have in common is that we are not relational databases.

To further confuse matters, many NoSQL databases nowadays actually provide a SQL-like interface for querying, since there's a huge user base already familiar with SQL. Within our own ecosystem, MongoDB too has several third party solutions to provide that capability. So now we have ended up in a world where NoSQL databases support SQL. Many would say it's time for the "NoSQL" name to be retired. Instead of "NoSQL databases" I will refer to "non-relational" for the rest of our discussion to avoid controversy.

Still, there are commonalities across this non-category of non-relational databases. Having a scale-out architecture (based on some form of sharding) is one common characteristic. Same for replication or high availability, which still today is often an add on product in the world of relational databases.

Finally, it is also a feature that these databases do not have a relational data model (meaning, data is modeled as 2-dimensional tables). For instance in the case of MongoDB our JSON based data model and query language are something developers really love, and typically one of the first things mentioned as an advantage of MongoDB. It's not just that developers are familiar with JSON, but also the dynamic schema allows for more rapid and iterative development style, without heavy planning up front. About 80% of new data generated today is unstructured, so this is another reason why we need the appropriate database solutions to match.

What features make MongoDB enterprise edition the platform of choice over MongoDB community edition, especially for enterprise customers?

As a rough rule of thumb, the core differentiation is such that a developer is likely to find everything he wants in MongoDB, whereas the operations team will benefit from the management, backup and monitoring features of MongoDB Ops Manager that is included with MongoDB Enterprise Advanced. Much of this functionality is also available in the cloud-hosted solution called MongoDB Management Service. In addition MongoDB Enterprise Advanced includes a host of security features that also may become important once you start going into production, such as single sign-on integration with LDAP and Active Directory, PKI certificates and maintaining audit logs of all operations taken against the database for compliance reporting.

MongoDB Enterprise Advanced also provides 24x7 global support with 1 hour SLAs, as well as enterprise certifications and on-demand training for development and DBA teams.

MongoDB 3.0 integrates the WiredTiger engine and Ops Manager tool. What does this mean for enterprises?

This is simple: easier scaling! But they each help with scaling for different reasons.

Ops Manager helps the ops team to manage an increasing number of MongoDB processes as you scale out. MongoDB is pretty simple to install and manage as it is, but of course if you have a 100 node cluster, or even just 12 nodes, installing or upgrading each server manually doesn't make sense. Ops Manager can configure, deploy and manage entire clusters from a single console or REST API call. You can launch complex operations like a zero-downtime rolling upgrades from a click of a button, then go have coffee. And of course it does your basics like monitoring and continuous backups and such.

If you're not familiar with the value of automation, consider that users of Ops Manager commonly report 10x-20x efficiency gains in DBA time.

The WiredTiger storage engine was introduced in MongoDB 3.0, and is a revolutionary step forward for the performance of the database engine on certain classes of workloads, especially those dominated by writes to the database. By the way, it's still optional in 3.0. You have to choose it at startup.

WiredTiger was created by the original developers of BerkeleyDB, the most widely used embedded database in the world. MongoDB acquired WiredTiger in December of last year, so they are fully integrated with the rest of the team developing MongoDB.

Our database engine until now we call just the MMAPv1 engine. It has served MongoDB well and goes back to the days when MongoDB was a really simple and young non-relational database. Today however MongoDB is used by very demanding customers and by an increasingly diverse set of applications. So WiredTiger helps us address a much broader range of applications well. And it definitely brings our database engine story to the forefront of the the performance and benchmarking debates. It's a modern MVCC engine, using contemporary locking algorithms like hazard pointers to support highly concurrent workloads.

WiredTiger has much more in it than just better performance for write-intensive workloads. For example most MongoDB users store a lot of data, so they will benefit from built-in compression. The storage engine also has several sweet features that will be integrated and exposed to MongoDB in the future—the one I already know will be happening is LSM based indexes. (LSM based indexes are great for write-heavy, disk-bound workloads.)

Do you believe NoSQL databases are more suited for cloud computing than relational databases are?

Yes.

It's actually surprising that in 2015, nearly ten years after the launch of Amazon Web Services (AWS), not one of the mainstream RDBMS solutions has any kind of useful support for sharding.

To pre-empt some protests from my friends in the RDBMS world, I should clarify: I'm aware of some good solutions that are not mainstream, and some others that are not good or even useful. Basically, this is still an unsolved problem for RDBMSs.

Another difference is ease of use and configuration. Historically, in an enterprise data center, it would take you at least three months to get new servers—sometimes up to 6 or 12 months! In that context, it wasn't a big deal that installing and configuring the database software could take a day or even a week. But that is not acceptable for 2015. You can spin up a cloud instance in five minutes, and it shouldn't take much more to install MongoDB onto it, even when you do it manually. (And, like I mentioned before, our MMS service can automate the installation for you, to make it even faster. It even spins up the AWS instances for you, if you provide the keys.)

Are NoSQL implementations mostly based on the CAP theorem? Can you please explain how CAP theoram helps in choosing the right NoSQL technology?

Oh, dear. The CAP theorem is a controversial topic. Confusing at best, not useful at worst.

Still, the problem discussed by the CAP theoram is real. For a distributed database, it is challenging to manage the trade-offs between availability, scalability and consistency. There are many interesting solutions to these, and the range of solutions is much broader than just "CA" or "AP". It's a topic I personally spend a lot of time learning and discussing, and could be a topic of its own interview!

MongoDB's solution to this is quite simple and well understood. We provide a default configuration based on the "least surprises/highest productivity" philosophy: Clients write and read to a single primary server (per shard), and replication to secondaries is only to provide high availability. This provides consistent reads and non-conflicting writes, which we believe is what any developer finds natural and easy to understand. With a configuration option developers can choose to send reads to secondary nodes, but then they have to understand and manage the consequences of eventual consistency.

By the way, instead of trying to make sense of the CAP theorem, I've found that trying to understand the meaning of different consistency levels is useful. Doug Terry has written an excellent paper, Replicated Data Consistency Explained Through Baseball, that I find very useful!

What is MongoDB's perspective on big data? What value does it add to the process of building big data ecosystems?

Big Data is often characterized by the "3V" rule: Volume, Variety and Velocity. While it's kind of "consultant speak", I actually find it a useful description.

The first V is obvious: You have lots of data. In the MongoDB space it is unlikely for anyone to have petabytes of data. We do have customers with that much data, but usually it makes sense for them to store it in multiple separate MongoDB clusters - because it originates in different applications and there's no need for the data to be together. But having lots of terabytes even for a single application happens quite a lot. In my days of selling MySQL, about seven years ago, it was unusual for anyone to even reach a single terabyte!

Variety is in my opinion a strong point of MongoDB and our JSON based data model with dynamic schema. People often get stuck on the "Big" part of big data, but actually the diversity and unstructured nature of modern data is often much more challenging. In my opinion, MongoDB handles the variety part really well.

Finally, Velocity refers to the change in expectation for how fresh the data is, and how quickly it is hitting the database. Historically you could load data into your data warehouse at night time or even weekends, and then produce a report at month end. This is of course no longer acceptable - data has to be loaded into your Big Data database as it arrives. And in many cases you also need to do real-time analytics: that means queries within seconds or even milliseconds from when the data was inserted to the database.

What role does MongoDB play among the technologies one might choose for big data solutions?

We satisfy the above 3V requirements well on all fronts.

It's of course not surprising to see the two most popular technologies often used together: Hadoop and MongoDB. They complement each other rather well to power big data applications. Hadoop was historically focused on really heavy-duty processing of lots of historical data. Such Hadoop analytics can range from seconds to hours. MongoDB on the other hand is your database with indexes and does relatively fast queries in the range of minutes to milliseconds.

A typical data flow from MongoDB to Hadoop and back to MongoDB looks like the following:

MongoDB powers the online, real time operational application, serving business processes and end-users.
Hadoop consumes data from MongoDB, blending it with data from other operational systems to fuel sophisticated analytics and machine learning.
Results are loaded back to MongoDB to serve smarter operational processes such as delivering more relevant offers, faster identification of fraud or an improved customer experience.
There can also be a real-time personalization component, where the session or profile information is updated directly in MongoDB as the user uses the website or application. Hadoop was historically not used for this, but now the streaming data frameworks like Storm, Spark and Flink are doing exactly these kinds of tasks. So rather than feeding data to MongoDB at the end of an analytics job, Hadoop can now continuously update data in MongoDB.

We have lots of customers that use exactly such a big data architecture: eBay, Pearson, Orbitz, SFR, Foursquare and Square Enix are just a few. (You can read about them on our website.)

Our aggregation framework part of the query language really shines in analytics. I believe it is one reason why MongoDB is a popular choice for Big Data.

When and why choose MongoDB as a platform for building microservices?

Here I see two different approaches to architecture:

The strict interpretation of microservices is that each service is independent, and can at least theoretically each use different programming languages and databases. In such an architecture you end up with polyglot persistence, meaning a "right tool for the right job" philosophy. So if we believe in various market share studies—just for the sake of this example—let's say maybe half of your services choose to build on MongoDB, and 10–20 percent use a key value store or wide column store and some use a graph engine or text search engine.

On the other hand, you might rather go for a Database-as-a-Service model. In this approach the database—let's say it is MongoDB—is centralized to a specialized DBA team, who is also then responsible for taking backups, high availability and tuning. Each dev team building a micro service will then just use this DBaaS and won't need to worry about the database at all, making their life much easier. Of course, some services may still need something different, like a text search engine or a graph database.

Recent paradigm shifts mean many relational databases have incorporated the capability to store unstructured data. On one hand this indicates that the world has finally accepted the importance of looking at data with a different perspective, but on the other it also bridges the difference between the relational and noSQL world. How do you look at this?

Well, this is the same thing as NoSQL databases adding support for SQL, just from the opposite direction. It shows that there is demand for unstructured data, and it is only growing.

But honestly, I still don't see these additions as really competing against non-relational products. For example, PostgreSQL support for JSON looks really good, but I don't recall if I ever heard of anyone choosing between MongoDB or PostgreSQL. It seems to me these features make life easier to the existing users of the RDBMS, but fundamentally they're still an RDBMS. In particular, the lack of support for sharding to scale out your database and built-in replication for high availability or cross-region data distribution is still a big differentiator between the two worlds.

OSCON
Speaker Interview

This article is part of the Speaker Interview Series for OSCON 2015. The OSCON OSCON is everything open source—the full stack, with all of the languages, tools, frameworks, and best practices that you use in your work every day. OSCON 2015 will be held July 20-24 in Portland, Oregon..

1 Comment

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.