Richard Jonas's


Sibling resolution in Riak

Leave a Comment
Lately I needed to brainstorm how to solve some interesting problems in a distributed key-value store called Riak. In this blog post I want to show you how to store simple data structures in a concurrent environment safely.


Riak is a key-value store implemented in Erlang (and C++). Riak is an eventually consistent database with tunable consistency level. There are some concepts we need to introduce in order to understand what the problem is.
  • n value: how many replica will be made from the data we insert. By default it is 3, so every data item will have 2 more copies. It means that the data will be stored in 3 physical nodes to handle node crashes.
  • r value: when we are reading, from how many node we need to read the same data value (if they are not the same, the read will fail). Default it quorum, more than half of the N value (2)
  • w value: when writing, how many node we need to write the data items to (default is quorum)
  • pr, pw: there are primary nodes for a data item in Riak. They will come from the consistent hashing algorithm. When Riak tries to write data, at first it will try to write in the primary nodes. If one of them is down, Riak will write the data into the next living node. When the primary node is up again, that temporary node will copy the data to its proper place (hinted hand-off). If we specify pw to 2, two primary nodes will need to give back success during writing.
So if we want to be sure that all of the data are written to their proper place we need to specify pw=3. It makes the write very slow, sometimes it is impossible but we need to pay the price for the consistency. For more detailed definitions please check the Riak documentation.


How Riak handles collision during writing is another problem. Let us assume that two process is reading a balance of a person (let it be $100) and they try to withdraw $10 and $20 from the account. So the temporary results are $90 and $80. And they write the data back into Riak. What will happen?

If we don't handle collisions the last write will win. Obviously now we don't want to make that happen. Otherwise we lose one of the transactions. Riak can handle this situation to write all the two version of the data for the same key. Then we have siblings and at the next get we have two values, not just one. So we need to resolve the problem.

The balance problem

The problem is how to store, update balance when other processes can read and write the value at the same time. In the case when we have two values ($80 and $90) we don't know how to merge the two values. If we know the last operation we can apply that operation to the other value. So if we have {$80, -$20} and {$90, -$10}, either we can apply -$20 operation to $90, or -$10 operation to $80. So we have {$70, -$10} after merging the sibling. Problem solved.

But it is not quite true. Let us suppose that a long running process added $5 to $100, the original balance. So we have {$70, -$20} and {$100, +$5}. How to resolve that problem? We need to choose a sibling on which we apply the other operation. But in that case it is not clear which sibling is the good one. In other words: most recent.

Does it ring the bell? We need to version our structure with an incrementing number, telling us how recent the data item is. So our data tuple is {version_number, balance, last_operation}. Let us see how we can resolve siblings that way.

It seems that we can handle every type of problems that way. The only precondition is that when we make the balance record we need to do that by one process. Otherwise we need to merge siblings in the very beginning. On the other hand it doesn't solve another problem: the balance should not be negative. It can happed that two processes decrease the balance at the same time with an amount that after the sibling resolution it will result in a negative balance.

Consistent sets

We can handle sets in that way, too. In the next figure we are adding 'a' to the set (represented as a list). Then we add 'b' and 'c' to the set at the same time. It is easy to resolve sibling, since we need to merge the data. The last interesting point is when we put 'e' into the set with version 4. We have two siblings: the version 4 and the version 2 sets. The merge is also easy at that case.

So if we have operations that can be applied anytime, we can easily merge the different versions of the same data structures. We cannot store list - for example - that way. Because if we remove 'a' from the list there can be another occurence of 'a', so during merging we don't know whether we need to keep 'a' or don't.

In the next version of Riak - which is Riak 2 - there will be data structures like sets and maps with strong consistency. See here. Check that because the future is the distributed computing ;)
Next PostNewer Post Previous PostOlder Post Home


Post a Comment

Note: Only a member of this blog may post a comment.