We wrap up our replication discussion of Designing Data-Intensive Applications, this time discussing leaderless replication strategies and issues, while Allen missed his calling, Joe doesn’t read the gray boxes, and Michael lives in a future where we use apps.
If you’re reading this via your podcast player, you can find this episode’s full show notes at https://www.codingblocks.net/episode162. As Joe would say, check it out and join in on the conversation.
Sponsors Educative.io – Learn in-demand tech skills with hands-on courses using live developer environments. Visit educative.io/codingblocks to get an additional 10% off an Educative Unlimited annual subscription.
Survey Says Do you have TikTok installed?
Take the survey at: https://www.codingblocks.net/episode162.
News Thank you for the latest review! iTunes: tuns3r
Check out the book!
Single Leader to Multi-Leader to Leaderless When you have leaders and followers, the leader is responsible for making sure the followers get operations in the correct orderDynamo brought the trend to the modern era (all are Dynamo inspired) but also… RiakCassandraVoldemort We talked about NoSQL Databases before: Episode 123 Data Models: Relational vs Document What exactly is NewSQL? https://en.wikipedia.org/wiki/NewSQLWhat if we just let every replica take writes? Couple ways to do this… You can write to several replicasYou can use a coordinator node to pass on the writes But how do you keep these operations in order? You don’t! Thought exercise, how can you make sure operation order not matter?Couple ideas: No partial updates, increments, version numbers
Multiple Writes, Multiple Reads What do you do if your client (or coordinator) try to write to multiple nodes…and some are down?Well, it’s an implementation detail, you can choose to enforce a “quorom”. Some number of nodes have to acknowledge the write. This ratio can be configurable, making it so some % is required for a write to be acceptedWhat about nodes that are out of date?The trick to mitigating stale data…the replicas keep a version number, and you only use the latest data – potentially by querying multiple nodes at the same time for the requested dataWe’ve talked about logical clocks before, it’s a way of tracking time via observed changes…like the total number of changes to a collection/table…no timezone or nanosecond differences
How do you keep data in sync? About those unavailable nodes…2 ways to fix them up Read Repair: When the client realizes it got stale data from one of the replicas, it can send the updated data (with the version number) back to that replica. Pretty cool! – works well for data that is read frequentlyAnti-Entropy: The nodes can also do similar background tasks, querying other replicas to see which are out of data – ordering not guaranteed!Voldemort: ONLY uses read repair – this could lead to loss of data if multiple replicas went down and the “new” data was never read from after being written
Quorums for reading and writing Quick Reminder: We are still talking about 100% of the data on each replica3 major numbers at play: Number of nodesNumber of confirmed writesNumber of reads required If you want to be safe, the nodes you write to and the ones you write too should include some overlapA common way to ensure that, keep the number of writes + the number of reads should be greater than the number of nodesExample: You have 10 nodes – if you use 5 for writing and 5 for reading…you may not have an overlap resulting in potentially stale data!Common approach – taken number of nodes (odd number) + 1, then divide that number by 2 and that’s the number of reader and writers you should have 9 Nodes – 5 writes and 5 reads – ensures non-stale dataWhen using this approach, you can handle Nodes / 2 (rounded down) number of failed nodes How would you tweak the numbers for a write heavy workload?Typically, you write and read to ALL replicas, but you only need a successful response from these numbersWhat if you have a LOT of nodes?!?Note: there’s still room for problems here – author explicitly lists 5 types of edge cases, and one category of miscellaneous timing edge cases. All variations of readers and writers getting out of sync or things happen at the same timingIf you really want to be safe, you need consensus (r = w = n) or transactions (that’s a whole other chapter)Note that if the number of required readers or writers doesn’t return an OK, then an error is returned from the operationAlso worth considering is you don’t have to have overlap – having readers + writers < nodes means you could have stale data, but at possibly lower latencies and lower probabilities of error responses
Monitoring staleness Single/Multi Leader lag is generally easy to monitor – you just query the leader and the replicas to see which operation they are onLeaderless databases don’t have guaranteed ordering so you can’t do it this wayIf the system only uses read repair (where the data is fixed up by clients only as it is read) then you can have data that is ancientIt’s hard to give a good algorithm description here because so much relies on the implementation details Paper discussing Probabilistic Bounded Staleness (PBS) http://www.bailis.org/papers/pbs-cacm2014.pdf
And when things don’t work? Multi-writes and multi-reads are great when a small % of nodes or down, or slowWhat if that % is higher? Return an error when we can’t get quorum?Accept writes and catch the unavailable nodes back up later? If you choose to continue operating, we call it “sloppy quorum” – when you allow reads or writes from replicas that aren’t the “home” nodes – the likened it to you got locked out of your house and you ask your neighbor if you can stay at their place for the nightThis increases (write) availability, at the cost of consistencyTechnically it’s not a quorum at all, but it’s the best we can do in that situation if you really care about availability – the data is stored somewhere just not where it’d normally be stored
Detecting Concurrent Writes What do you get when you write the same key at the same time with different values?Remember, we’re talking about logical clocks here so imagine that 2 clients both write version #17 to two different nodesThis may sound unlikely, but when you realize we’re talking logical clocks, and systems that can operate at reduced capacity…it happensWhat can we do about it? Last write wins: But which one is considered last? Remember, how we catch up? (Readers fix or leaders communicate) …either way, the data will eventually become consistent but we can’t say which one will win…just that one will eventually take over Note: We can take something else into account here, like clock time…but no perfect answerLWW is good when your data is immutable, like logs – Cassandra recommends using a UUID as a key for each write operation Happens-Before Relationship – (Riak has CfRDT that bundle a version vector to help with this)
This “happens-before” relationship and concurrency How do we know whether the operations are concurrent or not?
Basically if neither operation knows about the other, then they are concurrent…Three possible states if you have writes A and B A happened before BB happened before AA and B happened concurrently When there is a happens before, then you take the later valueWhen they are concurrent, then you have to figure out how to resolve the conflicts Merging concurrently written values Last write wins?Union the data?No good answer
Version vectors The collection of version numbers from all replicas is called a version vectorRiak uses dotted version vectors – the version vectors are sent back to the clients when values are read, and need to be sent back to the db when the value is written back Doing this allows the db to understand if the write was an overwrite or concurrent This also allows applications to merge siblings by reading from one replica and write to another without losing data if the siblings are merged correctly
Resources We Like Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)Past episode discussions on Designing Data-Intensive Applications (Coding Blocks)Designing Data-Intensive Applications – Data Models: Relational vs Document (episode 123)NewSQL (Wikipedia)Do not allow Jeff Bezos to return to Earth (Change.org)Man Invests $20 in Obscure Cryptocurrency, Becomes Trillionaire Overnight, at Least Temporarily (Newsweek)Quantifying Eventual Consistency with PBS (Bailis.org)Riak Distributed Data Types (Riak.com)
Tip of the Week A GitHub repo for a list of “falsehoods”: common things that people believe but aren’t true, but targeted at the kinds of assumptions that programmers might make when they are working on domains they are less familiar with. (GitHub)The Linux at command lets you easily schedule commands to run in the future. It’s really user friendly so you can be lazy with how you specify the command, for example echo "command_to_be_run" | at 09:00 or at 09:00 -f /path/to/some/executable (linuxize.com)You can try Kotlin online at play.kotlinlang.org, it’s an online editor with links to lots of examples. (play.kotlinlan.org)The Docker COPY cmd will need to be run if there are changes to files that are being copied. You can use a .dockerignore to skip files that you don’t care about to trim down on unnecessary work and build times. (doc.docker.com).