In cockroachdb why doesn't an uncertainty restart update the upper bound of the uncertainty window? - restart

I recently read this great article from the cockroachdb blog which talks about how they maintain consistency in a similar way to spanner but without atomic clocks. Here is the relevant part of the post:
When CockroachDB starts a transaction, it chooses a provisional commit
timestamp based on the current node's wall time. It also establishes
an upper bound on the selected wall time by adding the maximum clock
offset for the cluster [commit timestamp, commit timestamp + maximum
clock offset]. This time interval represents the window of
uncertainty.
...
It's only when a value is observed to be within the
uncertainty interval that CockroachDB-specific machinery kicks in. The
central issue here is that given the clock offsets, we can't say for
certain whether the encountered value was committed before our
transaction started. In such cases, we simply make it so by performing
an uncertainty restart, bumping the provisional commit timestamp just
above the timestamp encountered. Crucially, the upper bound of the
uncertainty interval does not change on restart, so the window of
uncertainty shrinks. Transactions reading constantly updated data from
many nodes may be forced to restart multiple times, though never for
longer than the uncertainty interval, nor more than once per node.
Specifically I don't understand why the upper bound of the window of uncertainty doesn't also have to be bumped during an uncertainty restart. Here is an example to illustrate my confusion:
Imagine we have two writes A and B (on different nodes). Write A has a timestamp 3 and B a timestamp of 5. Assuming a maximum clock offset of 3 units of time, if we start a transaction on a node that's clock currently reads 1 we will construct an uncertainty window of [1, 4]. When we come across write A we will perform an uncertainty restart to include it and reduce the uncertainty window to (3, 4]. When we come across write B we will ignore it as it lies above the upper-bound of the uncertainty window. But, as our maximum clock offset is 3 units and A & B's timestamps are less than three units apart, B could have happened before A. But we have included A and not B in our transaction so we don't have consistency.
Thanks in advance for pointing out what I am missing!

Great question. This is pretty subtle; I'll try to explain what's going on.
First let's look at the transactions involved. We have two writing transactions, A (ts=3) and B (ts=5). If there was any overlap in the keys touched by these two transactions, they would not be able to commit "out of order" and we could be certain that transaction A happened before transaction B. But if there were no overlapping keys, then there is no point of coordination that would guarantee this order except the clock, and since the timestamps are close enough together it's ambiguous which one "happened first".
What we'd like to do in the presence of this ambiguity is to assign some sort of order, to deem transactions to have happened in the order implied by their timestamp. This is safe to do as long as we know that no one observed a state in which B was written but not A. This is enforced by the mechanisms inside CockroachDB that govern read/write interactions (mainly the poorly-named TimestampCache). If transaction B was followed by a transaction C at timestamp 6 which read keys A and B, transaction A would no longer be allowed to commit at timestamp 3 (it would be pushed to timestamp 7).
This works as long as the database can see the relationships between reads and writes. There are sometimes situations in which we can't see this, if write B has some out-of-band causal dependency on A but there was never a transaction that touched both keys. This is the anomaly that we call "causal reverse" and is a difference between CockroachDB's serializable isolation and Spanner's linearizable (or strict serializable) isolation level.
Updating the upper bound of the uncertainty interval would be one way to avoid the causal reverse anomaly, although it would have severe practical drawbacks - you may have to keep restarting over and over until you find a gap of at least your max clock offset between writes on all the keys your transaction touches. This may never succeed.

Related

How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency?

I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
The User Service, is used to update users' data in the system(address, First Name, phone number...).
The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
User X changed his address 1 year ago
the retention policy of the changelog is 6 months
today we add BillingService to the system.
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
We have a changelog with a retention policy of time T
A consumer service failed more than time T
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!
If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?
This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.
In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.
If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).
There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.
Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.
As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.

How to run a Kafka Canary Consumer

We have a Kafka queue with two consumers, both read from the same partition (fan-out scenario). One of those consumers should be the canary and process 1% of the messages, while the other processes the 99% remaining ones.
The idea is to make the decision based on a property of the message, eg the message ID or timestamp (e.g. mod 100), and accept or drop based on that, just with a reversed logic for canary and non-canary.
Now we are facing the issue of how to do so robustly, e.g. reconfigure percentages while running and avoid loosing messages or processing them twice. It appears this escalates to a distributed consensus problem to keep the decision logic in sync, which we would very much like to avoid, even though we could just use ZooKeeper for that.
Is this a viable strategy, or are there better ways to do this? Possibly one that avoids consensus?
Update: Unfortunately the Kafka Cluster is not under our control, and we cannot make any changes.
Update 2 Latency of messages is not a huge issues, a few hundred 100ms added are okay and won't be noticed.
I dont see any way to change the "sampling strategy" across 2 machines without "ignoring" or double-processing records. Since different Kafka consumers could be in different positions in the partition, and could also get the new config at different times, you'd inevitably run into one of 2 scenarios:
Double processing of the same record by both machines
"Skipping" a record because neither machine thinks it should "own" it when it sees it.
I'd suggest a small change to your architecture instead:
Have the 99% machine (the non-canary) pick up all records, then decide for every record if it wants to handle it, or if it belongs to the canary
If it belongs to the canary, send the record to a 2nd topic (from the 99% machine)
Canary machine only listens on the 2nd topic, and processes every arriving record
And now you have a pipeline setup where decisions are only ever made in one point and no records are missed or double processed.
The obvious downside is somewhat higher latency on the canary machine. If you absolutely cannot tolerate the latency push the decision of which topic to produce to upstream to producers? (I don't know how feasible that is to you)
Variant in case a 2nd topic isnt allowed
If (as youve stated above) you cant have a 2nd topic, you could still make the decision only on the 99% machine, then for records that need to go to the canary, re-produce them into the origin partition with some sort of "marker" (either in the payload or as a kafka header, up to you).
The 99% machine will ignore any incoming records with a marker, and the canary machine will only process records with a marker.
Again, the major downside is added latency.

Postgres Sequence jumping by 32

We use sequences to maintain the number of orders at a particular business unit.
Last few we days, we have noticed that there have been has strange jumps ranging from 1 to 32 in the sequence numbers, multiple times a day.
The sequence which we have configured has a cache_value of 1.
Why is this happening and how can we resolve this?
I couldn't find much documentation regarding the same.
PostgreSQL sequences can jump by 32 when you promote a secondary server to primary, and sometimes when PostgreSQL crashes hard and has to recover.
I have only seen it skip 32, and I haven't been able to change it by changing the sequence's cache value.
https://www.postgresql.org/message-id/1296642753.8673.29.camel%40gibralter, and the response https://www.postgresql.org/message-id/20357.1296659633%40sss.pgh.pa.us, indicate that this is expected behavior in a crash condition. I'm unsure if this only occurs if you're using Streaming Replication or not.
I thought it was a safety feature, in case the primary had committed records that had not been replicated to the secondary at the time of promotion. Then a human would be able to manually dump and copy the records from the old-primary to the new-primary before wiping the old-primary.
This could indicate that your primary is crashing and recovering several times per day, which is not a good state to be in.

What is CRDT in Distributed Systems?

I am a newbie in Distributed systems and I am trying to get an insight on the concept of CRDT.
I realize that it has three notations :
Conflict-free Replicated Data Type
Convergent Replicated Data Type
Commutative Replicated Data Type
Can anyone give an example where we use CRDT in distributed systems?
Thanks a lot in advance.
CRDTs are inspired by the work of Marc Shapiro. In distributed computing, a conflict-free replicated data type (abbreviated CRDT) is a type of specially-designed data structure used to achieve strong eventual consistency (SEC) and monotonicity (absence of rollbacks). There are two alternative routes to ensuring SEC: operation-based CRDTs and state-based CRDTs.
CRDTs on different replicas can diverge from one another but at the end they can be safely merged providing an eventually consistent value. In other words, CRDTs have a merge method that is idempotent, commutative and associative.
The two alternatives are equivalent, as one can emulate the other, but operation-based CRDTs require additional guarantees from the communication middleware. CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronization. This would lead to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings.
Some examples where they are used
Riak is the most popular open source library of CRDT's and is used by Bet365 and League of Legends. Below are some useful links that supports Riak.
1- Bet365 (Uses Erlang and Riak)
http://www.erlang-factory.com/static/upload/media/1434558446558020erlanguserconference2015bet365michaelowen.pdf
2- League of Legends uses the Riak CRDT implementation for its in-game chat system (which handles 7.5 million concurrent users and 11,000 messages per second)
3- Roshi implemented by SoundCloud that supports a LWW time-stamped Set:
-Blog post: https://developers.soundcloud.com/blog/roshi-a-crdt-system-for-timestamped-events
CRDTs use Math to enforce consistency across a distributed cluster, without having to worry about consensus and associated latency/unavailability.
The set of values that a CRDT can take at anytime come under the category of a semi-lattice (specifically a join semi-lattice), which is a POSET (partially-ordered set) with a least upper bound function (LUB).
In simple terms, a POSET is a collection of items in which not all are comparable. E.g. in an array of pairs: {(2,4), (4, 5), (2, 1), (6, 3)}, (2,4) is < (4,5), but can't be compared with (6,3) (since one element is larger and the other smaller). Now, a semi-lattice is a POSET in which given 2 pairs, even if you can't compare the two, you can find a element greater than both (LUB).
Another condition is that updates to this datatype need to be increasing, CRDTs have monotonically increasing state, where clients never observe state rollback.
This excellent article uses the array I used above as an example. For a CRDT maintaining those values, if 2 replicas are trying to achieve consensus between (4,5) and (6,3), they can pick a LUB = (6,5) as consensus and assign both replicas to it. Since the values
are increasing, this is a good value to settle on.
There's 2 ways for CRDTs to keep in sync with each other across replicas, they can transfer state across periodically (convergent replicated data type), or they can transfer updates (deltas) across as they get them (commutative replicated data type). The former takes more bandwidth.
SoundCloud's Roshi is a good example (though no-longer in development it seems), they store data associated with a timestamp, where the timestamp is obviously incrementing. Any updates coming in with a timestamp lesser or equal than the one stored is discarded, which ensures idempotency (repeated writes are OK) and commutativity (out of order writes are ok. Commutativity is a=b means b=a, which in this case means update1 followed by update2 is same as update2 followed by update1)
Writes are sent to all clusters, and if certain nodes fail to respond due to an issue like slowness or partition, they're expected to catch up later via a read-repair, which ensures that the values converge. The convergence can be achieved via 2 protocols as I mentioned above, propagate state or updates to other replicas. I believe Roshi does the former. As part of the read-repair, replicas exchange state, and because data adheres to the semi-lattice property, they converge.
PS. Systems using CRDTs are eventually consistent, i.e they adopt AP (highly available and partition-tolerant) in the CAP theorem.
Another excellent read on the subject.
Those three expansions of the acronym all mean basically the same thing.
A CRDT is convergent if the same operations applied in a different sequence produces (converges to) the same result. That is, the operations can be commutated - it's a commutative RDT. The reason that the operations can be applied in a different sequence and still get the same result is that the operations are conflict-free.
So CRDT means the same thing, whichever of the three expansions you use - though personally I prefer "Convergent".

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_secondsĀ¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.