Kafka: Adding batches of old data - apache-kafka

Using Kafka for time-based events, using windowing to group events(sessioning) in Kafka Streams.
How should we handle the arrival of a set of data from a different source, which consists of old data?
Say for example, you are doing web analytics for a client.
You receive event data from a client in an event topic, where you receive all event types.
For some reason you did not receive order(purchases) data from a client, you only received pageview data from which you build sessions.
Some time later, you receive a batch of time-based orders, say from the last year, so you can match them with the sessions (associate which sessions are related to which order).
Output of this process is sent to "orders", "pageviews", "sessions", etc. topics.
However, if you just add them to the (end of the) topic, they will be "unordered", so even if you recalculate data, your results will not be correct.
This is somehow similar to the streaming out-of-order events problem, but allowing a much longer time (e.g. a year)
A possibility would be to "delete and rewrite": on a topic with compaction, delete all data from that client, and resending it again in order, and then launch a recalculation for that client.
But that's quite cumbersome.
Is there a better way to handle this?


Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

How to process events which are out of order using Kafka Streams

I have an application where events are sent on a Kafka topic based on user actions like User Login, user's Intermediate actions (optional) and User Logout. Each event has some information in a event object along with userId , for example a Login Event has loginTime; Add Note has notes (Intermediate actions). Similarly a Logout event has logoutTime. The requirement is to aggregate information from all these events into one object after receiving the Logout event for each user & send it on downstream.
Due to some reasons (Network delay, multiple event producer) events may not come in order (User Logout event may come before Intermediate event), So the question is how to handle such scenarios? I can not wait for Intermediate events after receiving User Logout event since Intermediate events are optional depending on user's actions.
The only option which I think here, is to wait for some time after receiving User Logout event, process Intermediate events if received within that wait time & send processed event, but again not sure how to achieve this.
Kafka does not guarantee order on topic, it guarantee order on partition. One topic can have more than one partition so every consumer that is consuming your topic will consume one partition. That is how kafka is achieving scalability. So what you are experiencing is normal behavior (it isn't bug or related to network delay or something like that). What you can do is to make sure that all messages that you want to proceed in order are sent to the same partition. You can do that by setting number of partitions to 1, that is the dumbest way. When you send message with producer, by default kafka take a look into key, take hash of it and by that hash know on which partition should send a message. You can make sure that for all messages, the key is the same. That way all hashes of keys will be the same and all messages will go to the same partition. Also, you can implement custom partitioner and override default way how kafka choose on which partition message will go. In this way, all messages will arrive in order. If you cannot do any of this actions, then you will receive events out of order and you will have to think about a way how to consume them out of order but that is not question related to kafka.
If you are not able to preserve order of event (that Logout will be last event),
you can achieve your requirements using ProcesorApi from Kafka Streams. Kafka Streams DSL can be combine with Processor API (more details here).
You can have several partitions, but all events for particular user has to be send to same Partition.
You have to implement custom Processor/Transformer.
Your processor will be put each event/activity in state store (aggregate all event from particular user under same key).
Processor API gives you ability to create some kind of scheduler (Punctuator).
You can schedule to check every X seconds events for particular user. If Logout was long ago, you get all events/activities and make some aggregation and send results to downstreams.
As said in other answers, in Kafka order is maintained on per-partition basis.
Since you are talking about user events, why don't you make UserID as your Kafka topic key? So, that all events related to a specific user will always be ordered (provided they are produced by a single producer).
You should ensure (by design) that only one Kafka producer pushes all the user change events to the given topic. In this way, you can avoid out-of order messages due to multiple producers.
From streams, you might also want to look at Windows in Kafka streams. Tumbling windows for example is non-overlapping and fixed size. You aggregate records over a period of time.
Now you may want to sort the aggregated by their timestamp (or you said you have logout time, login time etc) and act accordingly.
Simple and effective solution
Use synchronous send and set delivery.timeout.ms and retries to a maximum value.
To ensure fault tolerance set acks=all with min.insync.replicas=2 (topic configuration) and use a single producer to push to that topic.
You should also set max.block.ms to some max value so that your send() does not return immediately if there is an error in fetching the metadata (for example, when Kafka is down).
Benchmark the synchronous send with your rate and check to see if it meets your requirements or benchmark number.
This ensures that a message that came first is sent first to Kafka and then the next message is not sent until the previous message is successfully acknowledged.
If your benchmark figure is not met, try having a back-pressure
mechanism like in-memory/persistent queue.
Add event to a queue in Thread-1
Peek (not dequeue) event from the queue in Thread-2
Call producer.send(...).get() in Thread-2
Dequeue the event in Thread-2
The key is to make your frontend tracker to send ordered events to the backend service which then produces events to kafka.
You can achieve that by batching the events, and sending the batched events to the backend only after the previous batched events are successfully delivered.

Kafka Streams Sort Within Processing Time Window

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).
You can see my answer to a similar question here:
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

Ingesting data from REST api to Kafka

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?
So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?
I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.