Kafka - Difference between Events with batch data and Streams - apache-kafka

What is the fundamental difference between an event with a batch of data attached and a kafka stream that occasionally sends data ? Can they be used interchangeably ? When should I use the first and when the latter ? Could you provide some simple use cases ?
Note: There is some info in the comments of this question but I would ask for a more well rounded answer.

I assume that with "difference" between streams and events with batched data you are thinking of:
Stream: Every event of interest is sent immediately to the stream. Those individual events are therefore fine-grained, small(er) in size.
Events with data batch: Multiple individual events get aggregated into a larger batch, and when the batch reaches a certain size, a certain time has passed, or a business transaction has completed, the batch event is sent to the stream. Those batch events are therefore more coarse-grained and large(r) in size.
Here is a list of characteristics that I can think of:
Realtime/latency: End-to-end processing time will typically be smaller for individual events, and longer for batch events, because the publisher may wait with sending batch events until enough individual events have accumulated.
Throughput: Message brokers differ in performance characteristics regarding max. # of in/out events / sec at comparable in/out amounts of data. For example, comparing Kinesis vs. Kafka, Kinesis has a lower max. # of in/out events / sec it can handle than a finely tuned Kafka cluster. So if you were to use Kinesis, batch events may make more sense to achieve the desired throughput in terms of # of individual events. Note: From what I know, the Kinesis client library has a feature to transparently batch individual events if desired/possible to increase throughput.
Order and correlation: If multiple individual events belong to one business transaction and need to be processed by consumers together and/or possibly in order, then batch events may make this task easier because all related data becomes available to consumers at once. With individual events, you would have to put appropriate measures in place like selecting appropriate partition keys to guarantee that individual events get processed in order and possibly by the same consumer worker instance.
Failure case: If batch events contain independent individual events, then it may happen that a subset of individual events in a batch fails to process (irrelevant whether temporary or permanent failure). In such a case, consumers may not be able to simply retry the entire event because parts of the batch event has already caused state changes. Explicit logic (=additional effort) may be necessary to handle partial processing failure of batch events.
To answer your question whether the two can be used interchangeably, I would say in theory yes, but depending on the specific use case, one of the two approaches will likely result better performance or result in less complex design/code/configuration.
I'll edit my answer if I can think of more differentiating characteristics.

Related

process pubsub messages in constant rate. Using streaming and serverless

The scenario:
I have thousands of requests I need to issue each day.
I know the number at the beginning of the day and hopefully I want to send all the data about the requests to pubsub. Message per request.
I want to make the requests in constant rate. for example if I have 172800 requests, I want to process 2 in each second.
The ultimate way will involved pubsub push and cloud run.
Using pull with long running instances is also an option.
Any other option are also welcome.
I want to avoid running in a loop and fetch records from a database with limit.
This is how I am doing it today.
You can use batch and flow control settings for fine-tuning Pub/Sub performance which will help in processing messages at a constant rate.
Batching
A batch, within the context of Cloud Pub/Sub, refers to a group of one or more messages published to a topic by a publisher in a single publish request. Batching is done by default in the client library or explicitly by the user. The purpose for this feature is to allow for a higher throughput of messages while also providing a more efficient way for messages to travel through the various layers of the service(s). Adjusting the batch size (i.e. how many messages or bytes are sent in a publish request) can be used to achieve the desired level of throughput.
Features specific to batching on the publisher side include setElementCountThreshold(), setRequestByteThreshold(), and setDelayThreshold() as part of setBatchSettings() on a publisher client (the naming varies slightly in the different client libraries). These features can be used to finely tune the behavior of batching to find a better balance among cost, latency, and throughput.
Note: The maximum number of messages that can be published in a single batch is 1000 messages or 10 MB.
An example of these batching properties can be found in the Publish with batching settings documentation.
Flow Control
Flow control features on the subscriber side can help control the unhealthy behavior of tasks on the pipeline by allowing the subscriber to regulate the rate at which messages are ingested. These features provide the added functionality to adjust how sensitive the service is to sudden spikes or drops of published throughput.
Some features that are helpful for adjusting flow control and other settings on the subscriber are setMaxOutstandingElementCount(), setMaxOutstandingRequestBytes(), and setMaxAckExtensionPeriod().
Examples of these settings being used can be found in the Subscribe with flow control documentation.
For more information refer to this link.
If you are having long running instances as subscribers, then you will need to set relevant FlowControl settings for example .setMaxOutstandingElementCount(1000L)
Once you have set it to the desired number (for example 1000), this should control the maximum amount of messages the subscriber receives before pausing the message stream, as explained in the code below from this documentation:
// The subscriber will pause the message stream and stop receiving more messsages from the
// server if any one of the conditions is met.
FlowControlSettings flowControlSettings =
FlowControlSettings.newBuilder()
// 1,000 outstanding messages. Must be >0. It controls the maximum number of messages
// the subscriber receives before pausing the message stream.
.setMaxOutstandingElementCount(1000L)
// 100 MiB. Must be >0. It controls the maximum size of messages the subscriber
// receives before pausing the message stream.
.setMaxOutstandingRequestBytes(100L * 1024L * 1024L)
.build();

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

How do you ensure that events are applied in order to read model?

This is easy for projections that subscribe to all events from the stream, you just keep version of the last event applied on your read model. But what do you do when projection is composite of multiple streams? Do you keep version of each stream that is partaking in the projection. But then what about the gaps, if you are not subscribing to all events? At most you can assert that version is greater than the last one. How do others deal with this? Do you respond to every event and bump up version(s)?
For the EventStore, I would suggest using the $all stream as the default stream for any read-model subscription.
I have used the category stream that essentially produces the snapshot of a given entity type but I stopped doing so since read-models serve a different purpose.
It might be not desirable to use the $all stream as it might also get events, which aren't domain events. Integration events could be an example. In this case, adding some attributes either to event contracts or to the metadata might help to create an internal (JS) projection that will create a special all stream for domain events, or any event category in that regard, where you can subscribe to. You can also use a negative condition, for example, filter out all system events and those that have the original stream name starting with Integration.
As well as processing messages in the correct order, you also have the problem of resuming a projection after it is restarted - how do you ensure you start from the right place when you restart?
The simplest option is to use an event store or message broker that both guarantees order and provides some kind of global stream position field (such as a global event number or an ordered timestamp with a disambiguating component such as MongoDB's Timestamp type). Event stores where you pull the events directly from the store (such as eventstore.org or homegrown ones built on a database) tend to guarantee this. Also, some message brokers like Apache Kafka guarantee ordering (again, this is pull-based). You want at-least-once ordered delivery, ideally.
This approach limits write scalability (reads scale fine, using read replicas) - you can shard your streams across multiple event store instances in various ways, then you have to track the position on a per-shard basis, which adds some complexity.
If you don't have these ordering, delivery and position guarantees, your life is much harder, and it may be hard to make the system completely reliable. You can:
Hold onto messages for a while after receiving them, before processing them, to allow other ones to arrive
Have code to detect missing or out-of-order messages. As you mention, this only works if you receive all events with a global sequence number or if you track all stream version numbers, and even then it isn't reliable in all cases.
For each individual stream, you keep things in order by fetching them from a data store that knows the correct order. A way of thinking of this is that your query the data store, and you get a Document Message back.
It may help to review Greg Young's Polyglot Data talk.
As for synchronization of events in multiple streams; a thing that you need to recognize is that events in different streams are inherently concurrent.
You can get some loose coordination between different streams if you have happens-before data encoded into your messages. "Event B happened in response to Event A, therefore A happened-before B". That gets you a partial ordering.
If you really do need a total ordering of everything everywhere, then you'll need to be looking into patterns like Lamport Clocks.

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Kafka Stream: KTable materialization

How to identify when the KTable materialization to a topic has completed?
For e.g. assume KTable has few million rows. Pseudo code below:
KTable<String, String> kt = kgroupedStream.groupByKey(..).reduce(..); //Assume this produces few million rows
At somepoint in time, I wanted to schedule a thread to invoke the following, that writes to the topic:
kt.toStream().to("output_topic_name");
I wanted to ensure all the data is written as part of the above invoke. Also, once the above "to" method is invoked, can it be invoked in the next schedule OR will the first invoke always stay active?
Follow-up Question:
Constraints
1) Ok, I see that the kstream and the ktable are unbounded/infinite once the kafkastream is kicked off. However, wouldn't ktable materialization (to a compacted topic) send multiple entries for the same key within a specified period.
So, unless the compaction process attempts to clean these and retain only the latest one, the downstream application will consume all available entries for the same key querying from the topic, causing duplicates. Even if the compaction process does some level of cleanup, it is always not possible that at a given point in time, there are some keys that have more than one entries as the compaction process is catching up.
I assume KTable will only have one record for a given key in the RocksDB. If we have a way to schedule the materialization, that will help to avoid the duplicates. Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
2) Perhaps a ReadOnlyKeyValueStore would allow a controlled retrieval from the store, but it still lacks the way to schedule the retrieval of key, value and write to a topic, which requires additional coding.
Can the API be improved to allow a controlled materialization?
A KTable materialization never finishes and you cannot "invoke" a to() either.
When you use the Streams API, you "plug together" a DAG of operators. The actual method calls, don't trigger any computation but modify the DAG of operators.
Only after you start the computation via KafkaStreams#start() data is processed. Note, that all operators that you specified will run continuously and concurrently after the computation gets started.
There is no "end of a computation" because the input is expected to be unbounded/infinite as upstream application can write new data into the input topics at any time. Thus, your program never terminates by itself. If required, you can stop the computation via KafkaStreams#close() though.
During execution, you cannot change the DAG. If you want to change it, you need to stop the computation and create a new KafkaStreams instance that takes the modified DAG as input
Follow up:
Yes. You have to think of a KTable as a "versioned table" that evolved over time when entries are updated. Thus, all updates are written to the changelog topic and sent downstream as change-records (note, that KTables do some caching, too, to "de-duplicate" consecutive updates to the same key: cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html).
will consume all available entries for the same key querying from the topic, causing duplicates.
I would not consider those as "duplicates" but as updates. And yes, the application needs to be able to handle those updates correctly.
if we have a way to schedule the materialization, that will help to avoid the duplicates.
Materialization is a continuous process and the KTable is updated whenever new input records are available in the input topic and processed. Thus, at any point in time there might be an update for a specific key. Thus, even if you have full control when to send updates to the changelog topic and/or downstream, there might be a new update later on. That is the nature of stream processing.
Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
As mentioned above, caching is used to save resources.
Can the API be improved to allow a controlled materialization?
If the provided KTable semantics don't meet your requirement, you can always write a custom operator as a Processor or Transformer, attach a key-value store to it, and implement whatever you need.