How to use Kafka Streams to Split Messages into Slow and Fast Tracks - apache-kafka

I have a stream of messages to be processed by an app written in Kafka streams, small subset of those messages require external DB lookups to be processed.
I believe this DB is too big to be streamed and too much to cache.
Is there a way to split the stream into to Fast and Slow streams so the slow one doesn't interfere with the fast one?
I have thought of the following 3 options but I was hopping there might be sth simpler or more efficient:
1) Let the messages be distributed evenly and since the volume of the ones that require reading from DB is low they wouldn't affect the overall throughput badly (latency is not a problem)
2) Use special key for the slow ones so they get assigned to one partition (I own the producer), but then it's hard to scale the slow ones and there is no guarantee that they will not interfere with the fast ones and it needs missing with producer.
3) Write the slow ones to as separate topic all together.

Related

Best Architecture for offloading high usage Key in Kafka

We have been using Kafka for various use cases and have solved various problems. But the one problem which we are frequently facing is messages with any one key will be suddenly produced more or it will take some secs of execution so that the messages for the other keys in the queue are processed in delay.
We have implemented various ways to find those keys and offloaded it to a separate queue where we will be having a topic pool. But the topics in the pool goes on increasing and we find that we are not using the topic resource in an efficient manner.
If we are having 100 such keys, then we need to create 100 such topics and this not seems to be an optimised solution.
Whether in these type of cases, we should store the data in the DB where the particular key's data resides and we need to implement our own Queue based on the data in the table or there is some other mechanisms in which we can solve this problem ?
This problem is only for the keys having high data rate and high processing time (with 3 to 5s). Can anyone suggest what will be the better architecture for these type of cases?

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

Kafka vs. MongoDB for time series data

I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that #Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
Cassandra [best processing speed, best/good data size limits, worst query flexibility]
TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed
P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!

Kafka streams app stuck in IO Wait

I have stream topology that reads from 2 topics, repartitions them and then joins them using a joinWindow of 2 days.
My question is that the kafka streams state store seems to be causing reads and writes of very very small files causing my stream threads to be stuck in IO Wait and due to this my performance is really lagging.
Any performance suggestions?
Try to use SSD disk, or?
Longer comment - as Kafka is programmed to be robust, it is very IO intensive, as every message is stored on the disk.
BTW, what in numbers means for you "is lagging"?
About what troughput are you talking?

Kafka Streams - reducing the memory footprint for large state stores

I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below.
// stream receives 1 billion+ messages per day
stream
.flatMap((key, msg) -> rekeyMessages(msg))
.groupBy((key, value) -> key)
.reduce(new MyReducer(), MY_REDUCED_STORE)
.toStream()
.to(OUTPUT_TOPIC);
// stream the compacted topic as a KTable
KTable<String, String> rekeyedTable = builder.table(OUTPUT_TOPIC, REKEYED_STORE);
// aggregation 1
rekeyedTable.groupBy(...).aggregate(...)
// aggreation 2
rekeyedTable.groupBy(...).aggregate(...)
// etc
More specifically, I'm wondering if streaming the OUTPUT_TOPIC as a KTable is causing the state store (REKEYED_STORE) to be larger than it needs to be locally. For changelog topics with a large number of unique keys, would it be better to stream these as a KStream and do windowed aggregations? Or would that not reduce the footprint like I think it would (e.g. that only a subset of the records - those in the window, would exist in the local state store).
Anyways, I can always spin up more instances of this app, but I'd like to make each instance as efficient as possible. Here's my question:
Are there any config options, general strategies, etc that should be considered for Kafka Streams app with this level of throughput?
Are there any guidelines for how memory intensive a single instance should have? Even if you have a somewhat arbitrary guideline, it may be helpful to share with others. One of my instances is currently utilizing 15GB of memory - I have no idea if that's good/bad/doesn't matter.
Any help would be greatly appreciated!
With your current pattern
stream.....reduce().toStream().to(OUTPUT_TOPIC);
builder.table(OUTPUT_TOPIC, REKEYED_STORE)
you get two stores with the same content. One for the reduce() operator and one for reading the table() -- this can be reduced to one store though:
KTable rekeyedTable = stream.....reduce(.);
rekeyedTable.toStream().to(OUTPUT_TOPIC); // in case you need this output topic; otherwise you can also omit it completely
This should reduce your memory usage notably.
About windowing vs non-windowing:
it's a matter of your required semantics; so simple switching from a non-windowed to a windowed reduce seems to be questionable.
Even if you can also go with windowed semantics, you would not necessarily reduce memory. Note, in aggregation case, Streams does not store the raw records but only the current aggregate result (ie, key + currentAgg). Thus, for a single key, the storage requirement is the same for both cases (a single window has the same storage requirement). At the same time, if you go with windows, you might actually need more memory as you get an aggregate pro key pro window (while you get just a single aggregate pro key in the non-window case). The only scenario you might save memory, is the case for which you 'key space' is spread out over a long period of time. For example, you might not get any input records for some keys for a long time. In the non-windowed case, the aggregate(s) of those records will be stores all the time, while for the windowed case the key/agg record will be dropped and new entried will be re-created if records with this key occure later on again (but keep in mind, that you lost the previous aggergate in this case -- cf. (1))
Last but not least, you might want to have a look into the guidelines for sizing an application: http://docs.confluent.io/current/streams/sizing.html