Consumer side behavior on using coGroupByKey in Apache beam - apache-beam

I have a beam job that reads data from 2 kafka producers and does a join using a common key in both streams. I am not using the partition key used by kafka to do the join. So essentially kafka partitions data by some key in both streams, my consumer/beam job consumes this data from the two streams and extracts the actual key using which I wish to perform join into a pCollection and then I run coGroupByKey.
I see the join happen for several events, but if I query for specific events, I do not see the join happen. I have used the same window to window into the two streams. This makes me question if a consumer is getting the right data from two streams to perform this join. Let's say that consumer 0 consumes from partition 0 of both streams. Is there a chance that kafka partitions data using a key x and my consumer 0 is not getting the right data to join across the streams. I was told that coGroupByKey ensures that the right data lands in each consumer, but I am not able to visualize this. How can using coGroupByKey affect the input side behavior?

CoGroupByKey will join data across all input partitions. I suspect the issue is windowing--are the unjoined items in the same window? (CoGroupByKey does not join across windows, so items that land in separate windows do not get joined. You could look at using session windows if fixed windows don't work.)

Related

Kafka event Producer on RDBMS data & reading it at consumer in same order of producer in case of multiple topics

I have two business entities in RDBMS: Associate & AssociateServingStore. I planned to have two topics currently writing ADD/UPDATE/DELETE into AssociateTopic & AssociateServingStoreTopic, and these two topics are consumed by several downstream systems which would use for their own business needs.
Whenever an Associate/AssociateServingStore is added from UI, currently I have Associate & AssociateServingStore writing into two separate topics, and I have a single consumer at my end to read both topics, the problem is order of messages that can be read from two separate topics.. as this follows a workflow I cannot read AssociateServingStore without reading Associate first.. how do I read them in order ? (with partition key I can read data in order for same topic within partition) but here I have two separate topics and want to read in an order, first read Associate & then AssociateServingSotre and How to design it in such a way that I can read Associate before AssociateServingStore.
If I thinking as a consumer myself, I am planning to read first 50 rows of Associate and then 50 rows from AssocateServingStore and process the messages, but the problem is if I get a row in AssociateServingStore from the 50 records that are consumed which is not in already read/processed from first 50 Associate events, I will get issues on my end saying parent record not found while child insert.
How to design consumer in these cases of RDBMS business events where we have multiple topics but read them in order so that I will not fall in a situation where I might read particular child topic message before reading parent topic message and get issues during insert/update like parent record not found. Is there a way we can stage the data in a staging table and process them accordingly with timestamp ? I couldn't think of design which would guarantee the read order and process them accordingly
Any suggestions ?
This seems like a streaming join use-case, supported by some stream-processing frameworks/libraries.
For instance, with Kafka Streams or ksqlDB you can treat these topics as either tables or streams, and apply joins between tables, streams, or stream to table joins.
These joins handle all the considerations related to streams that do not happen on traditional databases, like how long to wait when time on one of the streams is more recent than the other one[1][2].
This presentation[3] goes into the details of how joins work on both Kafka Streams and ksqlDB.
[1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization
[3] https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/

Kafka Joins - How can we achive it?

I have below topics in Kafka(10 partitions each) and want to join them as given in below example? Could someone please advise how this can be done?
Topic:"orders"
Topic: "referencedata"
Joined Topic:"merged_ref_orders"
Basically, we are loading all of the orders and reference data into a hash map and then joining them. This is causing lots of performance issue when there are lots of orders. I came through something such as KTable/GlobalKTable but not sure how does it operate internally.
KTable/GlobalKTable would load all the data in the heap memory? If yes, then it won't help keep the java memory low.
Could some one please advise how this scenario can be done?
I suggest to turn the referencedata topic into a KTable keyed on tickerString. This means that you will have a materialised view that it is auto-updated when new referencedata are arriving into the topic. Also it means that you can query the KTable using the key (i.e. tickerString). Then you can create a KStream consuming from orders topic and join this stream to the referencedata KTable and push the resulting join to a new topic (i.e. merged_ref_orders).
Key both the referencedata KTable and the orders topic using the same key (i.e. tickerString), so that you do not have to use a GlobalKTable. A regular KTable (much more scalable) will be enough. Using a KTable with a Rocks DB will keep your memory usage very low as it utilising the partitioned data and also stores locally to your jvm data to a RocksDB.
Two official confluent examples how to join/ enrich a KStream to a KTable:
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
https://github.com/confluentinc/kafka-streams-examples/blob/6.0.1-post/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java

Can Kafka streams deal with joining streams efficiently?

I'm new to Kafka and I'd like to know if what I'm planning is possible and reasonable to implement.
Suppose we have two sources, s1 and s2 that emit some messages to topics t1 and t2 respectively. Now, I'd like to have a sink which listens to both topics and I'd like it to process tuples of messages <m1, m2> where m1.key == m2.key.
If m1.key was never found in some message of s2, then the sink completely ignores m1.key (will never process it).
In summary, the sink will work only on keys that s1 and s2 worked on.
Some traditional and maybe naive solution would be to have some sort of cache or storage and to work on an item only when both of the messages are in the cache.
I'd like to know if Kafka offers a solution to this problem.
Most modern stream processing engines, such as Apache Flink, Kafka Streams or Spark Streaming can solve this problem for you. All three have battle tested Kafka consumers built for use cases like this.
Even within those frameworks, there are multiple different ways to achieve a streaming join like the above.
In Flink for example, one could use the Table API which has a SQL-like syntax.
What I have used in the past looks a bit like the example in this SO answer (you can just replace fromElements with a Kafka Source).
One thing to keep in mind when working with streams is that you do NOT have any ordering guarantees when consuming data from two Kafka topics t1 and t2. Your code needs to account for messages arriving in any order.
Edit - Just realised your question was probably about how you can implement the join using Kafka Streams as opposed to a stream of data from Kafka. In this case you will probably find relevant info here

Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.
The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.