Timeout for aggregated records in Kafka table? - apache-kafka

I use Kafka for processing messages. Messages can be divided on a few parts (it's a composite message). So in stream I can have for example one composite message that is divided on three parts. In other words it will be three records in Kafka stream, but it's one big message. I want use Kafka table for merge parts of composite message in one Kafka record. After merge one message will be inserted in database (Postgres). Every part has number and total number of parts. For example if I have three parts (three Kafka records) of one message in stream - every parts has field total number of parts with value 3.
How I understand, task is simple in positive scenario: aggregate parts in table, create stream from table and filter records that have equals aggregated parts size and total number of parts, map filtered in one merged message and insert it in database (Postgres).
But negative scenario is also possible. In rare cases one of parts can be not inserted in Kafka at all (or it will be inserted much later, after timeout). So for example in stream only two parts from three of one composite message will be present. And this case I must insert in database (Postgres) not fully constructed message (it will consist only two parts, not three). How can I implement this negative scenario in Kafka?

I would recommend to check out punctuations: https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor
Also note, that you can mix-and-match Processor API and DSL: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration
If you provide a store name for the KTable aggregation, you can connect the store to a custom processor that registers a punctuation. Overall, it might be better to use Processor API for the whole application instead of the DSL though.

Related

Kafka event Producer on RDBMS data & reading it at consumer in same order of producer in case of multiple topics

I have two business entities in RDBMS: Associate & AssociateServingStore. I planned to have two topics currently writing ADD/UPDATE/DELETE into AssociateTopic & AssociateServingStoreTopic, and these two topics are consumed by several downstream systems which would use for their own business needs.
Whenever an Associate/AssociateServingStore is added from UI, currently I have Associate & AssociateServingStore writing into two separate topics, and I have a single consumer at my end to read both topics, the problem is order of messages that can be read from two separate topics.. as this follows a workflow I cannot read AssociateServingStore without reading Associate first.. how do I read them in order ? (with partition key I can read data in order for same topic within partition) but here I have two separate topics and want to read in an order, first read Associate & then AssociateServingSotre and How to design it in such a way that I can read Associate before AssociateServingStore.
If I thinking as a consumer myself, I am planning to read first 50 rows of Associate and then 50 rows from AssocateServingStore and process the messages, but the problem is if I get a row in AssociateServingStore from the 50 records that are consumed which is not in already read/processed from first 50 Associate events, I will get issues on my end saying parent record not found while child insert.
How to design consumer in these cases of RDBMS business events where we have multiple topics but read them in order so that I will not fall in a situation where I might read particular child topic message before reading parent topic message and get issues during insert/update like parent record not found. Is there a way we can stage the data in a staging table and process them accordingly with timestamp ? I couldn't think of design which would guarantee the read order and process them accordingly
Any suggestions ?
This seems like a streaming join use-case, supported by some stream-processing frameworks/libraries.
For instance, with Kafka Streams or ksqlDB you can treat these topics as either tables or streams, and apply joins between tables, streams, or stream to table joins.
These joins handle all the considerations related to streams that do not happen on traditional databases, like how long to wait when time on one of the streams is more recent than the other one[1][2].
This presentation[3] goes into the details of how joins work on both Kafka Streams and ksqlDB.
[1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization
[3] https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/

mark end of logical section at kafka when multiple partitions are used

I want to share a problem and a solution I used, as I think it may be beneficial for others, if people have any other solutions please share.
I have a table with 1,000,000 rows, which I want to send to kafka, and spread the data between 20 partitions.
I want to notify the consumer when producer reached end of data, I don't want to have direct connection between producer and consumer.
I know kafka is designed as logical endless stream of data, but I still need to mark the end of the specific table.
There was a suggestion to count the number of items per logical section, and send this data (to a metadata topic), so the consumer will be able to count items, and know when the logical section ended.
There are several disadvantages for this approach:
As data is spread between partitions, I can tell there are total x items at my logical section, however if there are multiple consumers (one per partition), they'll need to share a counter of consumed messages per logical section. I want to avoid this complexity. Also when consumer is stopped and resumed, it will need to know how many items were already consumed and keep context.
Regular producer session guarantees at-least-once delivery, which means I may have duplicated messages. Counting the messages will need to take this into account (and avoid counting duplicated messages).
There is also the case where I don't know in advance the number of items per logical session, (I'm also kind of consumer, consuming stream of event and signaled when data ended), so at this case, the producer will also need to have a counter, keep it when stopped and resumed etc. Having several producers will need to share the counter etc. So it adds a lot of complexity to the process.
Solution 1:
I actually want the last message at each partition indicate it is the last message.
I can do some work in advance, create some random message keys, send messages partitioned by key, and test to which partition each message is directed. As partitioning by keys is deterministic (for given number of partitions), I want to prepare a map of keys and the target partition. For example key: ‘xyz’ is directed to partition #0, key ‘hjk’ is directed to partition #1 etc, and finally have the reversed map, so for partition 0, use key ‘xyz’, for partition 1, use key ‘hjk’ etc.
Now I can send the entire table (except of the last 20 rows) with partition strategy random, so the data is spread between partitions, for almost entire table.
When I come to the last 20 rows, I’ll send them using partition key and I’ll set for each message partition key which will hash the message to a different partition. This way, each one of the 20 partitions will get one of the last 20 messages. For each one of the last 20 messages, I'll set a relevant header which will state it is the last one.
Solution 2:
Similar to solution 1, but send the entire table spread to random partitions. Now send 20 metadata messages, which I’ll direct to the 20 partitions using the partition by key strategy (by setting appropriate keys).
Solution 3:
Have additional control topic. After the table was sent entirely to the data topic, send a message to the control topic saying table is completed. The consumer will need to test the control topic from time to time, when it gets the 'end of data' message, it will know that if it reached the end of the partition, it actually reached the end of the data for that partition. This solution is less flexible and less recommended, but I wrote it as well.
Another one solution is to use open source analog of S3 (e.g. minio.io). Producers can uplod data, send message with link to object storage. Consumers will remove data frome object storage after collecting.

Kafka Streams for count a total num?

A topic named "addcash" which has 3 partitions(the number of the kafka cluster machines is 3 too), and a lot of user recharge messages flow in it. I want to count the total money num everyday.
I learned from some articles about Kafka Streams: The Kafka Streams will run the topology as task, and the number of the task depend on the number of the topic's partitions, and every task has individual state store.
So when I count the total money num by state stroe, Is there three values, not a total value will be return? What is the right way to do it?
Thanks!
That is correct.
You have two ways to do this:
You do the partial sums, and that a follow up KTable.groupBy(...).reduce(...) and set a single global key to bring all partial aggregates together.
You can get the total sum by creating an additional single-partitions topic, write the partial results into this topic, read the data back with KafkaStreams and do a second aggregation that add those partial numbers together. You can express this with a single program using through("my-single-partition-topic"); to connect the first and second part of the aggregation. You would need to use a transform() but not DSL to do the second aggregation step for this solution.

How to merge multiple kafka streams in order to do a session windowing over all events of the resulting stream

We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
This should by possible by using groupByKey and then aggregate/reduce (specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).
The only challenge that is left, is to create this combined / merged stream.
When I look at the Kafka Streams API, there is the KStreamBuilder#merge operation for which the javadoc says: There is no ordering guarantee for records from different {#link KStream}s.. Does this mean the Session Windowing will produce incorrect results?
If yes, what is the alternative to #merge?
I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.
stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>
Number 3 would be a duplicate.
Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.
About merge() and ordering. You can use merge() safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
From what I understand, you'd need to join the streams (and use groupBy to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!