Consume only topic messages for a given account - apache-kafka

I have a service calculating reputation scores for accounts. It puts the calculation results in a Kafka topic called "ReputationScores". Each message looks something like this:
{ "account" : 12345, "repScore" : 98765}
I'd like my consumer to be able to consume only those messages for a specific account.
For example, I’d like to have a single instance of a consumer consume only messages with topic “ReputationScore” for account 12345. That instance should probably be the only member of its consumer group.
Can Kafka filter based on message contents? What's the best way to do this?
Thanks for your help.

Can Kafka filter based on message contents?
Since kafka itself doesn't know what's in your data, it cannot index it, therefore it's not readily searchable. You would need to process the full topic and have an explicit check for which deserialized records you want to parse. For example, this is what a stream processing application with a simple filter operation would provide you.
If you want to preserve the ability to do lookups by a particular item, you will either need to make a partitioner that segments all data you're interested in, or create a topic per item (which really only works for certain use cases, not things like individual user accounts).
You could look at inserting all events to an in-memory database, then performing queries against that

Related

Retrieve info from Kafka that has a field matching one value of a very long list

I am kind of new to Kafka.
I have a conceptual question.
Let's assume that there is a Kafka topic (publish subscribe) which has messages (formatted in JSON). Each message has a field called "username".
There are multiple applications consuming this topic.
Assume that we have one application that handles messages for 100,000 users. This application has the list of 100,000 user names. So our application needs to watch the topic and process the messages that have the username field that matches to any one of our 100,000 user names.
One way of doing this is we read each message published and get the username in that message and iterate through the list of 100,000 usernames we have. If one name in our list matches the username, we process that, else we ignore that message.
Is there any other, more elegant way to do this like, is there any feature in Kafka streams or consumer api to do this?
Thanks
You must consume, deserialize, and inspect every record. You can't get around consumer api basics using any higher level library, but yes, ksqlDB or Kafka Streams make such code easier to write, just not any more performant
If you want to check a field is in a list, use a Hashset

ordering across partitions in Kafka

I am writing a kafka producer and needs help in creating partitions.
I have a group and a user table. Group contains different users and at a time a user can be a part of only one group.
There can be two types of events which I will receive as input and based on that I will add them to Kafka.
The events related to users.
The events related to groups.
Whenever an event related to a group happens, all the users in that group must be updated in bulk at consumer end.
Whenever an event related to a user happens, it must be executed as such at the consumer end.
Also, I want to maintain ordering on basis of time.
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
I am trying to figure out the possibilities I can try here.
Also, I want to maintain ordering on basis of time.
Means that topics, no matter how many, cannot have more than one partition, as you could have received messages out-of-order.
Obviously, unless you implement something like sequence ids in your messages (and can share that sequence across possibly multiple producers).
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
It sounds like a very simple messaging design, where you have a single queue (that's actually backed by a single topic with a single partition) that's consumed by multiple users. Actually any pub-sub messaging technology would be sufficient here (e.g. RabbitMQ's fanout exchanges).
The messages on the queue contain the information whether they are group updates or user updates - the consumers then filter the input depending on what they are interested in.
To discuss an alternative: single queue for group updates, and another for user updates - I understand that it would not be enough due to order demands - it's possible to get a group update independently of user update, breaking the ordering.
From the kafka documentation :
https://kafka.apache.org/documentation/#intro_consumers
Kafka only provides a total order over records within a partition, not
between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over records
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
so the best you can do is to have single partition-single topic.

Ingesting data from REST api to Kafka

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?
So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.

apache- kafka with 100 millions of topics

I'm trying to replace rabbit mq with apache-kafka and while planning, I bumped in to several conceptual planning problem.
First we are using rabbit mq for per user queue policy meaning each user uses one queue. This suits our need because each user represent some job to be done with that particular user, and if that user causes a problem, the queue will never have a problem with other users because queues are seperated ( Problem meaning messages in the queue will be dispatch to the users using http request. If user refuses to receive a message (server down perhaps?) it will go back in retry queue, which will result in no loses of message (Unless queue goes down))
Now kafka is fault tolerant and failure safe because it write to a disk.
And its exactly why I am trying to implement kafka to our structure.
but there are problem to my plannings.
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
Second, If I decide to go for topics based on operation and partition by random hash of users id, if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
So as conclusion, 1~5 millions users. We do not want to have one user blocking large number of other users being processed. Having topic per user will solve this issue, it seems like there might be an issue with zookeeper if such large number gets in (Is this true? )
what would be the best solution for structuring? Considering scalability?
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
I would advise against modeling like this.
Google around for "kafka topic limits", and you will find the relevant considerations for this subject. I think you will find you won't want to make millions of topics.
Second, If I decide to go for topics based on operation and partition by random hash of users id
Yes, have a single topic for these messages and then route those messages based on the relevant field, like user_id or conversation_id. This field can be present as a field on the message and serves as the ProducerRecord key that is used to determine which partition in the topic this message is destined for. I would not include the operation in the topic name, but in the message itself.
if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
This depends on how the users are consuming messages. You could set up a timeout, after which the message is routed to some "failed" topic. Or send messages to users in a UDP-style, without acks. There are many ways to model this, and it's tough to offer advice without knowing how your consumers are forwarding messages to your clients.
Also, if you are using Kafka Streams, make note of the StreamPartitioner interface. This interface appears in KStream and KTable methods that materialize messages to a topic and may be useful in a chat applications where you have clients idling on a specific TCP connection.

Kafka: Is it possible to share data among consumers in a consumer group?

I have multiple messages (more specifically log messages) in a certain topic which have the same id for a block of messages (these id's keep changing but remain same for a certain block of messages) and I need to find a way to group all the messages with that id or share the data contained in those messages with the same id between all the consumers in a consumer group.
So is there any way I could share data among various consumers in a consumer group?
This sounds like a sessionization use case to me. Kafka doesn't provide any means of grouping or nesting messages together so you'd have to do that yourself by keeping state in the consumer while processing and wrap the group of messages with some kind of header. Then you could push this to a new topic of wrapped message groups.
A better approach would probably be to make use of an external database or other system with more flexible means of selecting or organizing data based on fields. You can have a look at this blogpost for an example using Spark streaming + HBase.
There are two ways you can do that.
When you publish the message itself, create a message with partition key, so all the messages with same id goes to single partition. then in the consumer side it will always consumed by single consumer.[https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example]
If you use Spark-streaming in consumer side, you could use sliding window concept to group all the same id messages.[http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations]