How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency? - apache-kafka

I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
The User Service, is used to update users' data in the system(address, First Name, phone number...).
The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
User X changed his address 1 year ago
the retention policy of the changelog is 6 months
today we add BillingService to the system.
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
We have a changelog with a retention policy of time T
A consumer service failed more than time T
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!

If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?

This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.
In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.
If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).
There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.
Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.
As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.

Related

Stateful Kafka Stream - How to restore state?

If I am running my stream application - appA on Machine A and then I moved it to Machine B; will it remember the earlier state?
When I write simple consumer it remembers the last offset and it gets stored in __consumer_offsets itself on Broker. So no matter where I start the Consumer it will pick up from that place.
Is there such a construct for stateful stream processing applications? If I am calculating the continuous Profit and Loss of my portfolio I need to start from where it was the last run and then start aggregating new transactions to that earlier P&L number. I cannot afford to process all messages again from the start of time. I have been having a hard time in finding an article around this that explains how to solve this problem.
No, it won't remember state unless you move the statestore as well (state.dir configuration).
The changelog topic will need read from the earliest offsets to rebuild the state.
There's presentations about running Kafka Steams in Kubernetes that cover some aspects of this, since Kubernetes can stop and relocate its pods... But kubernetes also has volume management features that may not be available in your scenario.
It might therefore be best to run your job on both machines to start, then you have fault tolerance, high availability with a warm standby replica / partitioned state.

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

How to run a Kafka Canary Consumer

We have a Kafka queue with two consumers, both read from the same partition (fan-out scenario). One of those consumers should be the canary and process 1% of the messages, while the other processes the 99% remaining ones.
The idea is to make the decision based on a property of the message, eg the message ID or timestamp (e.g. mod 100), and accept or drop based on that, just with a reversed logic for canary and non-canary.
Now we are facing the issue of how to do so robustly, e.g. reconfigure percentages while running and avoid loosing messages or processing them twice. It appears this escalates to a distributed consensus problem to keep the decision logic in sync, which we would very much like to avoid, even though we could just use ZooKeeper for that.
Is this a viable strategy, or are there better ways to do this? Possibly one that avoids consensus?
Update: Unfortunately the Kafka Cluster is not under our control, and we cannot make any changes.
Update 2 Latency of messages is not a huge issues, a few hundred 100ms added are okay and won't be noticed.
I dont see any way to change the "sampling strategy" across 2 machines without "ignoring" or double-processing records. Since different Kafka consumers could be in different positions in the partition, and could also get the new config at different times, you'd inevitably run into one of 2 scenarios:
Double processing of the same record by both machines
"Skipping" a record because neither machine thinks it should "own" it when it sees it.
I'd suggest a small change to your architecture instead:
Have the 99% machine (the non-canary) pick up all records, then decide for every record if it wants to handle it, or if it belongs to the canary
If it belongs to the canary, send the record to a 2nd topic (from the 99% machine)
Canary machine only listens on the 2nd topic, and processes every arriving record
And now you have a pipeline setup where decisions are only ever made in one point and no records are missed or double processed.
The obvious downside is somewhat higher latency on the canary machine. If you absolutely cannot tolerate the latency push the decision of which topic to produce to upstream to producers? (I don't know how feasible that is to you)
Variant in case a 2nd topic isnt allowed
If (as youve stated above) you cant have a 2nd topic, you could still make the decision only on the 99% machine, then for records that need to go to the canary, re-produce them into the origin partition with some sort of "marker" (either in the payload or as a kafka header, up to you).
The 99% machine will ignore any incoming records with a marker, and the canary machine will only process records with a marker.
Again, the major downside is added latency.

Is keep logging messages in group communication service or paxos practical?

In the case of network partition or node crash, most of the distributed atomic broadcast protocols (like Extended Virtual Synchrony or Paxos), require running nodes, to keep logging messages, until the crashed or partitioned node rejoins the cluster. When a node rejoins the cluster, replay of logged messages are enough to regain the current state.
My question is, if the partitioned/crash node takes really long time to join the cluster again, then eventually logs will overflow. This seem to be a very practical issue, but still no one in their paper talks about it. Is there a very obvious solution to this which I am missing? Or my understanding in incorrect.
You don't really need to remember the whole log. Imagine for example that the state you were synchronizing between the nodes was something like an SQL table with a row of the form (id: int, name: string) and the commands that would be written into the logs were in a form "insert row with id=x and name=y", "delete row where id=z", "set name=a where id=1000",...
Once such commands were committed, all you really care about is the final table. Then once a node which was offline for a long time goes online, it would only need to download the table + few entries from the log that were committed while the table was being downloaded.
This is called "log compaction", check out the chapter 7 in the Raft paper for more info.
There are a few potential solutions to the infinite log problem but one of the more popular ones for replicated state machines is to periodically snap-shot the full replicated state machine and delete all history prior to that point. A node that has been offline too long would then just discard all of their information, download the snapshot, and start replaying the replicated logs from that point.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_secondsĀ¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.