Prediction/Estimation of missing intervals inside Apache Kafka process - apache-kafka

Goal is to process raw readings (15min and 1h interval) from external remote meters (assets) in real time.
Process is defined using simple Apache Kafka producer/consumer and multiple Spring Boot microservices to deduplicate messages, transform (map) readings to our system (instead external codes insert internal IDS and similar stuff) and insert in TimescaleDB (extension of PostgreSql).
Everything seems fine, but there is requirement to perform real time prediction/estimation of missing intervals.
Simple example for one meter and 15 minute readings:
On day 1 we got all readings. We process them and have them ingested in our DB.
On day 2 we are missing all readings - so process is not even
started for this meter.
On day 3 we again got all readings - but only for day 3. Now we need
to predict that whole day 2 is missing and create empty readings and
then estimate them by some algorithm (that is not that important
now).
My question here, is there any way or idea how to do this without querying existing database in one of the microservices and checking if something is missing?
Is it possible to check previous messages in Kafka topics and based on that do the prediction/estimation (kafka streams? - I don't get them at all) and is that even smart to do, or there is any other way/idea to do it?

Personal opinion disclaimer
It is not reasonably possible to check previous messages in Kafka Streams. If you are hellbent on doing it, you could probably try to seek messages and re-consume them but Kafka will fight you every step on the way. The mental model is, that you are transforming or aggregating data that comes in in real time. If you need to query something about previous data, you ought to have collected that information when that data was coming through.
What could work (rather well even) is to separate the prediction of missing data from the transformation.
Create two consumers for the stream.
Have one topology (or whatever it is that does your transformations already) transform the data and load it back into Kafka and from there to timescaledb.
Have one topology (or another microservice) that does what is needed to predict missing data. Your usecase of backfilling a missing day could be handled by something like a count based on daily windows
Make that trigger your backfilling either as part of that topology or as a subsequent microservice and load that data to timescaledb as well.
Are you already using Kafka Streams for the transformations? This would be a classical usecase.
The recognition of missing data not so much
As far as I understand it does not require high throughput. More the opposite. You want to know if there is no data.
As far as I understand it latency is not a (main) concern.
Kafka Streams could be useful if you need to take automated action within seconds after data stops coming in. But even then, you could just write throughput metrics and trigger alerts in this case.
Pther than that, it is a very stateful problem and stream processing is at its best if you can treat every message separately reduce them in a "standard" manner like sums or counts.
I got the impression, that a delay of a few hours / a day is not that tragic and currently the backfilling might be done manually. In this case the cot of Kafka Streams would outweigh the benefits.

Related

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

Kafka vs. MongoDB for time series data

I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that #Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
Cassandra [best processing speed, best/good data size limits, worst query flexibility]
TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed
P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!

Should Apache Flink be used for historical querying

I'm designing an event-sourced architecture based around Kafka, and using Flink for stream processing.
One use case will be the querying (filtering and sorting of results) of historical trade data that has passed through the Kafka topic over time. e.g. "Give me all trades in the last 5 years with these attributes, sorted by xx". Total trade history will be around 10m, increasing by say 1m/year.
Is Flink itself the right tool for such historical queries, and able to do so with reasonable performance (a few seconds)? Or am I better feeding the events from Kafka into an indexable/queryable data store like MongoDB/RDBMS, and using that for historical queries?
Doing the former feels like it'll adhere more closely to a Kappa Architecture, whereas resorting to a historical db feels like I'm moving away from that back towards a Lambda architecture.
Flink is well suited to process historic data from a Kafka topic (or any other data source) due to its support for event-time processing, i.e., time-based processing based on timestamps in the records not based on the clock of the processing machine (aka processing-time).
If you only want to perform analytics, you might want to have a look at Flink's SQL support.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_secondsĀ¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.

How to build a fault-tolerant app in Storm?

The short version of the question: how to build a fail-safe word count program (topology) in Twitter Storm that produces accurate results even when failure occurs? Is that even possible?
Long version: I am studying Twitter Storm and trying to understand how it should be used. I have followed the tutorial and find it a very simple concept. But the word count example outlined in the tutorial is not fault tolerant (because bolts save some data in memory). Saving the same data in back-end DB however leads to double counting if an event is re-submitted to the start of chain (which happens when some of the bolts fail).
Should I see Twitter Storm as real-time platform for producing partially accurate results and still depend on MapReduce to get the accurate ones?
It really depends on what kind of failure your trying to hege against. There are a few things that you can do:
Storm bolts are supposed to ack a tuple only after they have processed it. If you write your spouts and bolts and topology to use this, you can implement an "exactly one time" system which will guarantee accuracy.
Kafka can be a good way to put data into Storm because it uses disk persistance to keep messages around for a long time even after they are consumed. This means you can retrieve them if there's a failure by a consumer down the line.
In general though, it's difficult to guarantee that things are processed exactly once in any streaming system. This is a known problem, and it is a very difficult problem to solve efficiently.
Storm has the concept of transactional topologies. In practice, this means you will want to process items in batches, then commit to your database at the end of the batch, storing the transaction ID in the database alongside a count. This also has the practical benefit of reducing the load on your database with fewer inserts.
Batches are processed in parallel and may be replayed on failure, but are guaranteed to be committed in order. This is important because it makes it safe to write code that fetches the current count row, checks the transaction ID against the one in memory, and if the two differ (meaning it is an uncommitted batch), adding the new count to the existing one and committing that updated count.
See the following link for much more information and code examples:
https://github.com/nathanmarz/storm/wiki/Transactional-topologies