A great advantage of having a large network of technical friends is that they ask you for advice on things, which I love giving. One great disadvantage of people is that they rarely take my advice without justification, even though I think everybody should know better by now. A discussion I frequently have with friends (and which they don’t just blindly take my advice on), is their choice of datastore, which invariably goes something like this:
- Trust me, don’t use MongoDB.
- Why, what’s wrong with it?
- Look, how many times have I given you some advice, you didn’t listen, and later on it turned out I was right?
- Ah, so you’re saying I should use Cassandra.
So, since I keep having to justify my opinion (can you believe that? Just ridiculous.), I figured I’d do it once, in this post, and then I can just point people here when they’re about to do something dumb. If I linked you to this article, this means you.
EDIT: Apparently the self-deprecating sarcasm above wasn’t really very obvious, and it comes off as arrogant, but my intention was for it to be satire (cleary opinions should be justified, even mine). Also, the Cassandra joke was a reference to this lady. Like an ancient Greek proverb says, “the best joke is one you have to explain on your blog”.
Datastores are important
The datastore is often the most important part of an application. Code can be changed easily, and new code can be deployed without much fuss if you discover that some of your original choices were wrong, but the data model and the way it is handled is much harder to change. This means that you need to give the data model as much thought as you can when starting out, and the choice of datastore greatly influences that decision.
This post is meant to guide you through some common pitfalls, and hopefully explain why a relational database is a much saner default than the schemaless databases I see most people instinctively reach for nowadays.
To clarify, I’m not really a dogmatic proponent of any particular data store (although I still haven’t forgiven MongoDB for eating my dissertation). I recommend everyone use the right tool for the job, and all advice in this post is about the median case. There are cases where you need to use a different type of datastore because of your data requirements, but I still haven’t met anybody who needed those when starting out, so while there are exceptions to what I write here, you’re probably not them.
If you don’t have much time to read and have to go now, this is the takeaway from this post: Think before you pick a database. If you insist on not thinking, pick PostgreSQL. Trust me.
Schemas are awesome
Schemas are awesome. More importantly, they’re inevitable. There’s no application that doesn’t use a schema, as there’s no application where you only write data without the reader needing to know what kind of data it’s going to read. The schema is just implicit, in the application code, instead of explicit, in the datastore, where it should be. This means that schemaless databases aren’t really schemaless, they just kick the can down the road and let you get away with not defining a schema early on, which invariably comes back to bite you in the ass later.
When you use a schemaless database, you’re essentially saying “I don’t want to deal with the schema now, just let me write the data, my future self can deal with it when reading”. However, when would you rather get an error? When you write the data in the database and can retry, or a year later, when you read it and it turns out that that one entry had escaped notice and is still using the old format?
In almost all cases, the choice is clear, you want your application to produce an error while writing, when you can still do something about it, and ensure that your datastore will always contain “clean” data. That’s impossible to do without a datastore that enforces a schema.
Schemas as a guarantee
Schemas (and, generally, all constraints) serve as a guarantee on what your data is shaped like. For example, in a database of movies and actors in said movies, a document datastore has to either be keyed by movie and provide all actors inline, or on a separate “actors” document. In the former case, you duplicate all actors (and god help you if some actors change their names from movie to movie, e.g. if they get married), and have a very hard time producing a page of movies each actor appeared in. In the latter case, you need to create an implicit relation by embedding actor IDs in each movie, but, since deletes don’t cascade, an actor record that gets deleted (when you discover that two differently-named actors in two movies were actually the same actor that changed their name, for example) would leave “holes” in the data that could crash your application.
An RDBMS can very efficiently handle both cases above, and efficiently answer complicated questions about your data.
Schemas as a point of contact
Another important aspect of schemas is that they can serve as a point of contact between disparate services. If you have ten services accessing the same database and sharing data between themselves, the schema can guard against applications changing their assumptions without notifying all other applications. That way, an application deciding unilaterally that integers will now be represented as floats for more accuracy can’t break all other applications, as the schema will need to explicitly be changed to allow that, notifying other applications of the actual new data type in that field.
For the cases where you really need to store some free-form/prototype data, Postgres provides a very efficient JSON field, which can also index and query data inside the JSON blob. Did that blow your mind or what?
A point I see many people make is that schema migrations are hard, which is really not the case when you use competent tooling. Django, for example, makes migrations trivial, as you just change your application-level classes and the database gets migrated automatically. This is doubly important when you consider that migrations will have to be done with any datastore, relational or not, so competent tooling is a must. Schemas make migrations easier because they guarantee that no piece of data will be left not conforming, a guarantee that schemaless databases can’t make.
Most data is relational
In the overwhelming majority of business cases, there are relations between the data. Even if you just have users and each user writes a comment, that’s a relation. Relational databases are, despite the surprising misnomer, actually built to handle relations.
It will save you lots of time and headaches if the database guarantees you that, when you delete a user, the delete will cascade to all their comments as well, and that’s just one benefit. No joins means that you’ll have to collate potentially thousands of records yourself, inefficiently, because your time is better spent building your product than reinventing the database. No transactions means that you’ll spend sleepless nights trying to track down that race condition that leads to inventory disappearing rather than getting sold. No migrations means that you’ll need to come up with a whole process around ensuring your data is always what your application expects. These are all things that you get for free with a relational database.
Relational databases excel at easily providing answers to questions that weren’t predicted at the time when the data model was designed. For example, if you later need to compile a list of all the brands of all the products on your store, an RDBMS can easily do that by reading the “brands” table, whereas in a document store it is an expensive scan of all the thousands of products just to compile the list of ten brands. If you ever have to ask questions like “what are the top ten brands that sold products that cost over $10 on the weekends in June this year?”, document databases break down into expensive scans and custom in-memory aggregation code, whereas an RDBMS can easily do this in just one SQL query of a few lines.
The biggest danger, though, is that you will go too far down a rabbit hole of document-structured data that will be extremely hard to climb out from when you inevitably encounter difficult problems (such as compiling a list of all your products’ brands). Having to rewrite all your code to use a different data store is a hard task in itself, having to rewrite your code while simultaneously changing your entire data model? Good luck with that.
Premature optimization is the root of all clichés
I see many companies say “we need to be scalable, so we’ll need to use a database that has easy scalability from the get-go”, usually without realizing that many scalable datastores get it wrong. Trading efficient joins for scalability isn’t only a bad tradeoff, it’s an irrelevant tradeoff. Early on, there is no way to know which datastore performance problems (if any) you will hit first. Will your load be read-heavy? Write-heavy? Requiring lots of aggregates? Random reads? Efficient joins? Sharding?
You can’t know these things at a stage where you don’t even exactly know what your product will be, so don’t try to optimize for imagined scalability before you even have your first customer. You’ll have plenty of time figuring out what to use when you know your exact usage patterns, if the business manages to not die until then.
For all these reasons, Postgres is just a really good sane default. Until you definitely know you need more, just use Postgres, and you won’t regret it.
ACID is great
For all its unparallelizability, ACID is pretty damn nice. Being able to say “I want all these statements to execute, or none of them” is a fundamental requirement of every application, and many datastores don’t give you that. Always being sure that your database will be in a consistent state, no matter when it crashes, is paramount. You don’t want to come back to a completely corrupt database just because there was a power outage.
Similarly, you don’t want one process that’s changing data in the database to be able to see the half-completed transactions of another process. As for the D, having your data stay in the database after you put it there is a pretty nice feature to have, and one which not all databases get right (I’m looking at you, 2010 MongoDB).
Most NoSQL datastores sacrifice one or all of these in order to achieve scalability (or just because you can be really fast if you don’t actually have to store data), and you need to know which one you’re sacrificing and why.
You’re in denial
If you chose a NoSQL datastore as your main store, you probably read this post thinking “it’s not that bad, all these things are easy to guard against if only you can do X, plus you can’t beat the speed”. Think about the last bug you had to fix. Was it not also easily avoidable by just doing some thing you ended up not doing? That’s why you need to be using tools that avoid these pitfalls in the first place.
For every business that just couldn’t get the speed or scalability they needed out of Postgres, a hundred others had performance problems because they had to do their joins in the application.
These are the reasons why I recommend that people start with the safe, boring choice, which is relational databases. Did you know that document databases actually came first, back in the sixties, but people kept having problems with queries, performance, and complexity, so they invented relational databases?
Everything old is, indeed, new again.
I hope I have helped you choose PostgreSQL for your next business (you are encouraged to use whatever you like for your side-projects, those are for learning and fun). If not, at least I hope I have given you some insight on why I recommend relational datastores as the initial, default choice. Please send your flames directly to my Twitter.
P.S. I know I’ve been ragging on Mongo a lot, but everything here applies to e.g. RethinkDB as well, which I think got things right as far as NoSQL datastores go.