Kafka streams merging message - apache-kafka

I have a data payload, which is too big for one message. Consider an avro:
record Likes {...}
record Comments {...}
record Post {
Likes likes;
Comments comments;
string body;
}
Assume, likes and comments are large collections and if pass them together with post, it will exceed max message size, which I suppose incorrect to increase up to 10-20 MB.
I want to split one message into three: post body, comments and likes. However, I want database insert to be atomic - so I want to group and merge these messages in consumer memory.
Can I do it with kafka-streams?
Can I have a stream without output topic (as the output message will again exceed max size).
If you have any ideas assuming the same inputs (one large message exceeding configured max message size), please share

Yes, you can do it with kafka-streams, merging the messaging in the datastore, and you can have a stream without output topic. You need to be sure that three parts go to the same partition (to go to the same instance of the application), so they probably will have the same key.
You may also use three topics, for each object and then join them. (Again with the same key).
But generally Kafka is designed to handle a lot of small messages and it does not work good with large messages. May be you should consider to send not the whole info in one message, but incremental changes, only information, which was updated.

Related

Recomended message length in a kafka topic

I have a List of ids, with more or lest 400.000 ids, i need send to kafka the ids, i don't know if the best option is send the message split in n messages with x transactions, or if is better in one message adjusting like said in this post:
How can I send large messages with Kafka (over 15MB)?
This is a very generic question and it depends on how you want to process it.
If your consumer is capable of processing each of id entries quickly, then you can put a lot of them into a single message.
OTOH, if the processing is slow, it's better to publish more messages (across multiple partitions), so if you use consumer groups you wouldn't get group membership loss events etc.
Not to forget, there's also a limit on message size (as you've linked) with default of around 1mb.
In other words - you might need to perf-test on your own side, as it's hard to make a decision with only so little data.

Retrieve info from Kafka that has a field matching one value of a very long list

I am kind of new to Kafka.
I have a conceptual question.
Let's assume that there is a Kafka topic (publish subscribe) which has messages (formatted in JSON). Each message has a field called "username".
There are multiple applications consuming this topic.
Assume that we have one application that handles messages for 100,000 users. This application has the list of 100,000 user names. So our application needs to watch the topic and process the messages that have the username field that matches to any one of our 100,000 user names.
One way of doing this is we read each message published and get the username in that message and iterate through the list of 100,000 usernames we have. If one name in our list matches the username, we process that, else we ignore that message.
Is there any other, more elegant way to do this like, is there any feature in Kafka streams or consumer api to do this?
Thanks
You must consume, deserialize, and inspect every record. You can't get around consumer api basics using any higher level library, but yes, ksqlDB or Kafka Streams make such code easier to write, just not any more performant
If you want to check a field is in a list, use a Hashset

What defines the scope of a kafka topic

I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked

mark end of logical section at kafka when multiple partitions are used

I want to share a problem and a solution I used, as I think it may be beneficial for others, if people have any other solutions please share.
I have a table with 1,000,000 rows, which I want to send to kafka, and spread the data between 20 partitions.
I want to notify the consumer when producer reached end of data, I don't want to have direct connection between producer and consumer.
I know kafka is designed as logical endless stream of data, but I still need to mark the end of the specific table.
There was a suggestion to count the number of items per logical section, and send this data (to a metadata topic), so the consumer will be able to count items, and know when the logical section ended.
There are several disadvantages for this approach:
As data is spread between partitions, I can tell there are total x items at my logical section, however if there are multiple consumers (one per partition), they'll need to share a counter of consumed messages per logical section. I want to avoid this complexity. Also when consumer is stopped and resumed, it will need to know how many items were already consumed and keep context.
Regular producer session guarantees at-least-once delivery, which means I may have duplicated messages. Counting the messages will need to take this into account (and avoid counting duplicated messages).
There is also the case where I don't know in advance the number of items per logical session, (I'm also kind of consumer, consuming stream of event and signaled when data ended), so at this case, the producer will also need to have a counter, keep it when stopped and resumed etc. Having several producers will need to share the counter etc. So it adds a lot of complexity to the process.
Solution 1:
I actually want the last message at each partition indicate it is the last message.
I can do some work in advance, create some random message keys, send messages partitioned by key, and test to which partition each message is directed. As partitioning by keys is deterministic (for given number of partitions), I want to prepare a map of keys and the target partition. For example key: ‘xyz’ is directed to partition #0, key ‘hjk’ is directed to partition #1 etc, and finally have the reversed map, so for partition 0, use key ‘xyz’, for partition 1, use key ‘hjk’ etc.
Now I can send the entire table (except of the last 20 rows) with partition strategy random, so the data is spread between partitions, for almost entire table.
When I come to the last 20 rows, I’ll send them using partition key and I’ll set for each message partition key which will hash the message to a different partition. This way, each one of the 20 partitions will get one of the last 20 messages. For each one of the last 20 messages, I'll set a relevant header which will state it is the last one.
Solution 2:
Similar to solution 1, but send the entire table spread to random partitions. Now send 20 metadata messages, which I’ll direct to the 20 partitions using the partition by key strategy (by setting appropriate keys).
Solution 3:
Have additional control topic. After the table was sent entirely to the data topic, send a message to the control topic saying table is completed. The consumer will need to test the control topic from time to time, when it gets the 'end of data' message, it will know that if it reached the end of the partition, it actually reached the end of the data for that partition. This solution is less flexible and less recommended, but I wrote it as well.
Another one solution is to use open source analog of S3 (e.g. minio.io). Producers can uplod data, send message with link to object storage. Consumers will remove data frome object storage after collecting.

Consume only topic messages for a given account

I have a service calculating reputation scores for accounts. It puts the calculation results in a Kafka topic called "ReputationScores". Each message looks something like this:
{ "account" : 12345, "repScore" : 98765}
I'd like my consumer to be able to consume only those messages for a specific account.
For example, I’d like to have a single instance of a consumer consume only messages with topic “ReputationScore” for account 12345. That instance should probably be the only member of its consumer group.
Can Kafka filter based on message contents? What's the best way to do this?
Thanks for your help.
Can Kafka filter based on message contents?
Since kafka itself doesn't know what's in your data, it cannot index it, therefore it's not readily searchable. You would need to process the full topic and have an explicit check for which deserialized records you want to parse. For example, this is what a stream processing application with a simple filter operation would provide you.
If you want to preserve the ability to do lookups by a particular item, you will either need to make a partitioner that segments all data you're interested in, or create a topic per item (which really only works for certain use cases, not things like individual user accounts).
You could look at inserting all events to an in-memory database, then performing queries against that