How often put() is triggered in Kafka Connect sink tasks? - apache-kafka

Can I control the intervals at which the put() method of my Kafka Connect Sink tasks is triggered? What is the expected behavior of the Kafka Connect framework in this respect? Ideally, I would like to specify, for example, "don't call me unless you have X new records/Y new bytes, or Z milliseconds passed since the last invocation". This could potentially make the batching logic within the sink task simpler (quoting the documentation, "in many cases internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the downstream data store).

Today, put from a SinkTask is only called when deliverMessages is invoked in a WorkerSinkTask. The good news is that the only time deliverMessages happens is within poll so you should have some control over how often you poll for new records by overriding consumer properties.
If you want to do internal buffering, you could have a look at how the HDFSConnector is handling this in its implementation of SinkTask. However, right now, Connect will immediately put any records that get returned by the poll.
All of that said, if you are really looking to batch messages before they hit the downstream system, you might consider looking into offset.flush.interval.ms and offset.flush.timeout.ms which control how often flush() is invoked.

Related

Event sourcing - why a dedicated event store?

I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

Kafka Streams - Processor context commit

should we ever invoke processorContext.commit() in Processor implementation by ourselves? I mean invoking commit method inside scheduled Punctuator implementation or inside process method.
in which use cases should we do that, and do we need that at all? the question relates to both Kafka DSL with transform() and Processor API.
seems Kafka Streams handles it by itself, also invoking processorContext.commit() does not guarantee that it will be done immediately.
It is ok to call commit() -- either from the Processor or from a Punctuation -- that's why this API is offered.
While Kafka Streams commits on a regular (configurable) interval, you can request intermediate commits when you use it. One example use case would be, that you usually do cheap computation, but sometimes you do something expensive and want to commit asap after this operation instead of waiting for the next commit interval (to reduce the likelihood of a failure after the expensive operation and the next commit interval). Another use case would be, if you set the commit interval to MAX_VALUE what effectively "disables" regular commits and to decide when to commit base on your business logic.
I guess, that calling commit() is not necessary for most use cases thought.
For the use case I am batching certain number of record in processor process method and writing the batched records to File from process function if the batch size reaches like certain number(lets say 10).
Lets say we write one batch of records to file and system crashes at the point before commit happens (Since we cann't call explicit commits). Next time the stream starts and processor processes the records from the last committed offset. This means we could be writing some duplicate data to files. Is there anyway to avoid writing duplicate data??

Kafka Stream: KTable materialization

How to identify when the KTable materialization to a topic has completed?
For e.g. assume KTable has few million rows. Pseudo code below:
KTable<String, String> kt = kgroupedStream.groupByKey(..).reduce(..); //Assume this produces few million rows
At somepoint in time, I wanted to schedule a thread to invoke the following, that writes to the topic:
kt.toStream().to("output_topic_name");
I wanted to ensure all the data is written as part of the above invoke. Also, once the above "to" method is invoked, can it be invoked in the next schedule OR will the first invoke always stay active?
Follow-up Question:
Constraints
1) Ok, I see that the kstream and the ktable are unbounded/infinite once the kafkastream is kicked off. However, wouldn't ktable materialization (to a compacted topic) send multiple entries for the same key within a specified period.
So, unless the compaction process attempts to clean these and retain only the latest one, the downstream application will consume all available entries for the same key querying from the topic, causing duplicates. Even if the compaction process does some level of cleanup, it is always not possible that at a given point in time, there are some keys that have more than one entries as the compaction process is catching up.
I assume KTable will only have one record for a given key in the RocksDB. If we have a way to schedule the materialization, that will help to avoid the duplicates. Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
2) Perhaps a ReadOnlyKeyValueStore would allow a controlled retrieval from the store, but it still lacks the way to schedule the retrieval of key, value and write to a topic, which requires additional coding.
Can the API be improved to allow a controlled materialization?
A KTable materialization never finishes and you cannot "invoke" a to() either.
When you use the Streams API, you "plug together" a DAG of operators. The actual method calls, don't trigger any computation but modify the DAG of operators.
Only after you start the computation via KafkaStreams#start() data is processed. Note, that all operators that you specified will run continuously and concurrently after the computation gets started.
There is no "end of a computation" because the input is expected to be unbounded/infinite as upstream application can write new data into the input topics at any time. Thus, your program never terminates by itself. If required, you can stop the computation via KafkaStreams#close() though.
During execution, you cannot change the DAG. If you want to change it, you need to stop the computation and create a new KafkaStreams instance that takes the modified DAG as input
Follow up:
Yes. You have to think of a KTable as a "versioned table" that evolved over time when entries are updated. Thus, all updates are written to the changelog topic and sent downstream as change-records (note, that KTables do some caching, too, to "de-duplicate" consecutive updates to the same key: cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html).
will consume all available entries for the same key querying from the topic, causing duplicates.
I would not consider those as "duplicates" but as updates. And yes, the application needs to be able to handle those updates correctly.
if we have a way to schedule the materialization, that will help to avoid the duplicates.
Materialization is a continuous process and the KTable is updated whenever new input records are available in the input topic and processed. Thus, at any point in time there might be an update for a specific key. Thus, even if you have full control when to send updates to the changelog topic and/or downstream, there might be a new update later on. That is the nature of stream processing.
Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
As mentioned above, caching is used to save resources.
Can the API be improved to allow a controlled materialization?
If the provided KTable semantics don't meet your requirement, you can always write a custom operator as a Processor or Transformer, attach a key-value store to it, and implement whatever you need.

Writing directly to a kafka state store

We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
{
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
}
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
context.put(...)
context.commit();
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.