Saturday, January 2, 2016

Cassandra Distribution - Write & Read Path


In this article we will talk about how Cassandra writes or reads. So we are going to start with the write path.

The Write Path

  •  Writes are written to any node in the cluster (coordinator)
  •  Writes are written to commit log, then to memtable
  •  Every writes include a timestamp
  •  Memtable flushes to disk periodically (sstable)
  •  New memtable is created in memory
  •  Deletes are specific write case, called a "tombstone"

We start with the high level like the cluster level and then zoom in to see what is going on in individual level node. So when you sent a write to a Cassandra cluster, like we discussed in earlier article all nodes in the cluster are equal, so any given node could service the write request like you send in. Whatever node you happen to be talking to for that request is called coordinator node and that's because he will be doing the coordination with rest of the nodes in cluster on behalf of your query. This isn't a special node type (Cassandra doesn't have any special node type who is leader or master), this is just for that given request, it's the one who is going to do the coordination. 

Now, once a write request hits an individual node, there are only two things that kind of happen. The first that happens is that write emitation is written to something called the commit log, and the commit log is an append only data structure, so you can imagine sequential IO, this going to be very fast. Once it is written to commit log, once we got that durability, the next thing Cassandra does is, it merges the reputation into an in memory representation of your table, this is called Memtable. Once it does that it writes to the commit log, merges into the in memory representation of your table, it can then respond back to the co-ordinator to the client and say Yup! We wrote your data. Cassandra is so fast in writing data, also called write optimized database, part of the reason is simplicity of the write path. 

So it is pretty simple write path, the only thing you need to know is every time you do a write to Cassandra every column value you write gets a timestamp. This is gonna be come important in few minutes when we talk about how SSTable works in compaction.

You can imagine data getting written to the memtable, eventually memory gonna run out, (memory being a finite resource), so we have to take those memtable and flush them to disk. Cassandra does this behind the scene Asychroniously, so basically its a whole bunch of sequential IO which is going to be fast and is basically taking the in memory representation in memtable and serializing it to disk, something called SSTable. 

One of the thing to know besides this write have timestamp is also that Cassandra doesn't to any update in place or delete in place. The SSTable and Commit logs are immutable. So what Cassandra actually does when you do a Delete data, is write a special kind of record called "tombstone", which is basically a marker that says there is no data here any more as of this timestamp, because tomstone also gets the timestamp like regular data does, there is no data for that column any more.

What is SSTable

  •  Immutable data file for row storage
  •  Every write includes a timestamp of when it was written 
  •  Partition is spread across multiple SSTables
  •  Same column can be in multiple SSTables
  •  Merged through compaction, only latest timestamp is kept
  •  Deletes are written as tombstone
  •  Easy backup


So once a memtable get full it is written to disk as in SSTable, also these are imutable. So what happen when we got to many SSTable written to disk, how does that function.

SSTable are immutable files on disk. As we write SSTable to disk, we gonna have bunch of really small ones, and as an optimization we have a process called Compaction. Compaction takes small SSTable and merges them into bigger one. If we have a row1 at time 1 and a row2 at time 2, when we merge those table together, we are going to take the row that written to time 2, based on the timestamp which is included with the data and we are going to keep that and disregard the old one. So this is what keep Cassandra working very fast and this is how we avoid checking many SSTable to find the same piece of data.

One of thing great about it is it makes backup extremely trivial. Whenever an SSTable is written to disk, you just copy it all to disk and fine tune another server and you are good to go.

The Read Path

  •  Any Server may be queries, it acts as the coordinator
  •  Contact nodes with the request key
  •  On each node data is pulled from SSTable and Merged
  •  Consistency < ALL performs read repair in background (read_repair_chance)

Read works with the same tactic. From a cluster level then to an individual node level after that. Reads in Cassandra are very similar to write. Whichever node you are gonna happen to be talking to, for a given read request is going to be the co-ordinator node. Again it's not like a special node, it's just for that request, that one node is going to be the one that coordinate with the rest of the nodes in the cluster.
Now when we talk about individual node level but actually goes on, as Cassandra is gonna go to disk and it is going to look for whatever data it is, you asked for and it might have to look into multiple access table. So as we discussed compaction is running as sort of background process to merge these access table together but you might have some data some rows that across multiple access tables, where compaction hadn't had the chance to run and combine them together yet. So once it pull the data out of what could be multiple access tables, it will put that up in memory, merge them together using the timestamp which we talked about where the last write wins so the latest timestamp always wins. And if there is any unflushes data in the memtable like we talked about that got merged in as well and then we can actually send a response back to the client or to the coordinator and say hey here is the data you asked for.

Now you can imagine, because we might be reading from multiple access tables on disk, if you have a read heavy workload on Cassandra, the choice of disk (like whether you choose as steel or old spinning rust type drive ), what type of drive you choose to run Cassandra on if you have a  right heavy workload will have a serious impact on what kind of performance you get out of Cassandra. 

Another thing to keep in mind is that Compaction also gonna have an impact. You can imagine if compaction is running and there are a fewer files Cassandra have to search for a given key on disk then it's gonna be much faster as it has to do much less disk IO, which is always a good thing when we are trying to read data.

When you do a read with a consistency level something less than ALL, Cassandra is going to do something in the background and have a chance to do something in the background that's called a read repair. So we talked about Cassandra is eventually consistent system and so from time to time node may disagree actually about the value of a given piece of data, one node might not have the latest updated piece of data and so there is a configuration when you create a table in Cassandra called read_repair_chance and this is basically the chance whenever you do a read against Cassandra, that it actually going to go and talk to other replica in the cluster and make sure that everybody has the most up to date insync information. 

Mongodb explain() Query Analyzer and it's Verbosity

First creating 1 million documents: > for(i=0; i<100; i++) { for(j=0; j<100; j++) {x = []; for(k=0; k<100; k++) { x.push({a:...