Kafka streams: join on ingestion time - scala

I have two topics of fairly varied volumetry (could be something like 1000 events emitted in the left topic for every event in the right topic).
I'm trying to leftJoin those two topics together and I'm having the impression that the join window is computed over processing time and not ingestion time, causing the smaller stream to "run out" way too soon.
Is it possible to specify the time semantics of a stream-stream join to ingestion time (or event time)?
I could see why it's not such an easy thing to use ingestion time but it seems to be a necessity when processing historical streams.

Kafka Streams join is based on event-time, i.e., whatever TimestampExtractor returns (by default the message timestamp as stored in the topic) and you cannot modify it (you can only use a different timestamp extractor to indirectly modify the result).
Note though, that the join is executed "eagerly", and thus for all left side records, the lookup into the right stream is done immediately, what can lead to additional <key, (left-value,null)> results.
It also depends on the processing order that is base on event-time in a best effort manner. The guarantees got improve in the 2.3 release and using config parameter max.task.idle.ms might help to mitigate the issue.
It's on the mid-term roadmap to improve left- and outer-join behavior to avoid those additional result records. As Kafka is an open-source projects and you would like to pick it up, please help to get this fixed sooner :)
The only other alternative would be to implement a custom join operator via the Processor API.

Related

Is there a way to tell which event occurred first in two kafka topics

If I have two topics in kafka, is there a way to tell if one event in one topic "occured" before an event in another topic if they both come in within a millisecond of each other ie they have the same timestamp?
Background:
I am building an event sourcing based event drive architecture. Often, when an event occurs in one topic, I need to do a scan to find if a separate event has already occurred in a second topic. Likewise, if the event in the second topic comes in, I need to scan to see if the event in topic one occurred.
In order to not duplicate processing, I need a deterministic way to order the events. If the events are more than 1 millisecond apart, I can just use the timestamp in the event. But, because kafka timestamps only go to the millisecond, when two events occur close together, I can no longer use this approach.
In reality, I don't care which topic occured "first", ie if kafka posted one before another, even if they came in a different order, I don't care. I just need a deterministic way to order them.
In reality, I can use some method, such as arranging the events by topic alphabetically, but was hoping there was a built-in mechanism. (don't want to introduce weird bugs because I always process event A before event B; unlikely, but I've seen it happen)
PS I am open to other ideas. I'm thinking this approach because it was possible in redis streams. However, because of things I can't control, I am restricted to kafka. I do want to avoid using an external data store as then I need to start worrying about data synchronization in there.
You're going to run into synchronization issues, regardless. For example - you could try using a stream-topic join in Kafka Streams. If the event doesn't exist for the join, then it hasn't happened yet, but then you're reliant on having absolutely zero lag in the consumer processes building that KTable.
You could try storing nanoseconds as part of the value or header when you create the record if you need higher precision, but again, you're going to either need absolute zero lag or very precise consumer poll events with some comparison window as Kafka does not provide any processing guarantees across multiple topics

Apache Flink - Partitioning the stream equally as the input Kafka topic

I would like to implement in Apache Flink the following scenario:
Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type.
In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field "a" as key, partition 2 has the field "b" as key, etc. In Flink I would like to apply different business logics depending on the events, so I thought I should split the stream in some way. To achieve what's described in the picture, I thought to do something like that using just one consumer (I don't see why I should use more):
FlinkKafkaConsumer<..> consumer = ...
DataStream<..> stream = flinkEnv.addSource(consumer);
stream.keyBy("a").map(new AEventMapper()).addSink(...);
stream.keyBy("b").map(new BEventMapper()).addSink(...);
stream.keyBy("c").map(new CEventMapper()).addSink(...);
stream.keyBy("d").map(new DEventMapper()).addSink(...);
(a) Is it correct? Also, if I would like to process each Flink partition in parallel, since I'm just interested to process in-order the events sorted by the same Kafka partition, and not considering them globally, (b) how can I do? I know the existence of the method setParallelism(), but I don't know where to apply it in this scenario.
I'm looking for an answer about questions marked (a) and (b). Thank you in advance.
If you can build it like this, it will perform better:
Specifically, what I'm proposing is
Set the parallelism of the entire job to exactly match the number of Kafka partitions. Then each FlinkKafkaConsumer instance will read from exactly one partition.
If possible, avoid using keyBy, and avoid changing the parallelism. Then the source, map, and sink will all be chained together (this is called operator chaining), and no serialization/deserialization and no networking will be needed (within Flink). Not only will this perform well, but you can also take advantage of fine-grained recovery (streaming jobs that are embarrassingly parallel can recover one failed task without interrupting the others).
You can write a general purpose EventMapper that checks to see what type of event is being processed, and then does whatever is appropriate. Or you can try to be clever and implement a RichMapFunction that in its open() figures out which partition is being handled, and loads the appropriate mapper.

Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS?

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.

How to join multiple Kafka topics?

So I have...
1st topic that has general application logs (log4j). Stores things like HTTP API requests/responses and warnings, exceptions etc... There can be multiple logs associated to one logical business request. (These logs happen within seconds of each other)
2nd topic contains commands from the above business request which other services take action on. (The commands also happen within seconds of each other, but maybe couple minutes from the original request)
3rd topic contains events generated from actions of those other services. (Most events complete within seconds, but some can take up to 3-5 days to be received)
So a single logical business request can have multiple logs, commands and events associated to it by a uuid which the microservices pass to each other.
So what are some of the technologies/patterns that can be used to read the 3 topics and join them all together as a single json document and then dump them to lets say Elasticsearch?
Streaming?
You can use Kafka Streams, or KSQL, to achieve this. Which one depends on your preference/experience with Java, and also the specifics of the joins you want to do.
KSQL is the SQL streaming engine for Apache Kafka, and with SQL alone you can declare stream processing applications against Kafka topics. You can filter, enrich, and aggregate topics. Currently only stream-table joins are supported. You can see an example in this article here
The Kafka Streams API is part of Apache Kafka, and a Java library that you can use to do stream processing of data in Apache Kafka. It is actually what KSQL is built on, and supports greater flexibility of processing, including stream-stream joins.
You can use KSQL to join the streams.
There are 2 constructs in KSQL Table/Stream.
Currently, the Join is supported for a Stream & a table. So you need to identify the which is a good fit for what?
You don't need windowing for joins.
Benefits of using KSQL.
KSQL is easy to set up.
KSQL is SQL language which helps you to query your data quickly.
Drawback.
It's not production ready but in April-2018 the release is coming up.
Its little buggy right now but certainly will improve in a few months.
Please have a look.
https://github.com/confluentinc/ksql
Same as question Is it possible to use multiple left join in Confluent KSQL query? tried to join stream with more than 1 tables , if not then whats the solution?
And seems like you can not have multiple join keys within same query.

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!