Kafka Streams - Processor context commit - apache-kafka

should we ever invoke processorContext.commit() in Processor implementation by ourselves? I mean invoking commit method inside scheduled Punctuator implementation or inside process method.
in which use cases should we do that, and do we need that at all? the question relates to both Kafka DSL with transform() and Processor API.
seems Kafka Streams handles it by itself, also invoking processorContext.commit() does not guarantee that it will be done immediately.

It is ok to call commit() -- either from the Processor or from a Punctuation -- that's why this API is offered.
While Kafka Streams commits on a regular (configurable) interval, you can request intermediate commits when you use it. One example use case would be, that you usually do cheap computation, but sometimes you do something expensive and want to commit asap after this operation instead of waiting for the next commit interval (to reduce the likelihood of a failure after the expensive operation and the next commit interval). Another use case would be, if you set the commit interval to MAX_VALUE what effectively "disables" regular commits and to decide when to commit base on your business logic.
I guess, that calling commit() is not necessary for most use cases thought.

For the use case I am batching certain number of record in processor process method and writing the batched records to File from process function if the batch size reaches like certain number(lets say 10).
Lets say we write one batch of records to file and system crashes at the point before commit happens (Since we cann't call explicit commits). Next time the stream starts and processor processes the records from the last committed offset. This means we could be writing some duplicate data to files. Is there anyway to avoid writing duplicate data??

Related

Kafka Stream: KTable materialization

How to identify when the KTable materialization to a topic has completed?
For e.g. assume KTable has few million rows. Pseudo code below:
KTable<String, String> kt = kgroupedStream.groupByKey(..).reduce(..); //Assume this produces few million rows
At somepoint in time, I wanted to schedule a thread to invoke the following, that writes to the topic:
kt.toStream().to("output_topic_name");
I wanted to ensure all the data is written as part of the above invoke. Also, once the above "to" method is invoked, can it be invoked in the next schedule OR will the first invoke always stay active?
Follow-up Question:
Constraints
1) Ok, I see that the kstream and the ktable are unbounded/infinite once the kafkastream is kicked off. However, wouldn't ktable materialization (to a compacted topic) send multiple entries for the same key within a specified period.
So, unless the compaction process attempts to clean these and retain only the latest one, the downstream application will consume all available entries for the same key querying from the topic, causing duplicates. Even if the compaction process does some level of cleanup, it is always not possible that at a given point in time, there are some keys that have more than one entries as the compaction process is catching up.
I assume KTable will only have one record for a given key in the RocksDB. If we have a way to schedule the materialization, that will help to avoid the duplicates. Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
2) Perhaps a ReadOnlyKeyValueStore would allow a controlled retrieval from the store, but it still lacks the way to schedule the retrieval of key, value and write to a topic, which requires additional coding.
Can the API be improved to allow a controlled materialization?
A KTable materialization never finishes and you cannot "invoke" a to() either.
When you use the Streams API, you "plug together" a DAG of operators. The actual method calls, don't trigger any computation but modify the DAG of operators.
Only after you start the computation via KafkaStreams#start() data is processed. Note, that all operators that you specified will run continuously and concurrently after the computation gets started.
There is no "end of a computation" because the input is expected to be unbounded/infinite as upstream application can write new data into the input topics at any time. Thus, your program never terminates by itself. If required, you can stop the computation via KafkaStreams#close() though.
During execution, you cannot change the DAG. If you want to change it, you need to stop the computation and create a new KafkaStreams instance that takes the modified DAG as input
Follow up:
Yes. You have to think of a KTable as a "versioned table" that evolved over time when entries are updated. Thus, all updates are written to the changelog topic and sent downstream as change-records (note, that KTables do some caching, too, to "de-duplicate" consecutive updates to the same key: cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html).
will consume all available entries for the same key querying from the topic, causing duplicates.
I would not consider those as "duplicates" but as updates. And yes, the application needs to be able to handle those updates correctly.
if we have a way to schedule the materialization, that will help to avoid the duplicates.
Materialization is a continuous process and the KTable is updated whenever new input records are available in the input topic and processed. Thus, at any point in time there might be an update for a specific key. Thus, even if you have full control when to send updates to the changelog topic and/or downstream, there might be a new update later on. That is the nature of stream processing.
Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
As mentioned above, caching is used to save resources.
Can the API be improved to allow a controlled materialization?
If the provided KTable semantics don't meet your requirement, you can always write a custom operator as a Processor or Transformer, attach a key-value store to it, and implement whatever you need.

Writing directly to a kafka state store

We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
{
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
}
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
context.put(...)
context.commit();
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.

Queues: How to process dependent jobs

I am working on an application where multiple clients will be writing to a queue (or queues), and multiple workers will be processing jobs off the queue. The problem is that in some cases, jobs are dependent on each other. By 'dependent', I mean they need to be processed in order.
This typically happens when an entity is created by the user, then deleted shortly after. Obviously I want the first job (i.e. the creation) to take place before the deletion. The problem is that creation can take a lot longer than deletion, so I can't guarantee that it will be complete before the deletion job commences.
I imagine that this type of problem is reasonably common with asynchronous processing. What strategies are there to deal with it? I know that I can assign priorities to queues to have some control over the processing order, but this is not good enough in this case. I need concrete guarantees.
This may not fit your model, but the model I have used involves not providing the deletion functionality until the creation functionality is complete.
When Create_XXX command is completed, it is responsible for raising an XXX_Created event, which also gets put on the queue. This event can then be handled to enable the deletion functionality, allowing the deletion of the newly created item.
The process of a Command completing, then raising an event which is handled and creates another Command is a common method of ensuring Commands get processed in the desired order.
I think an handy feature for your use case is Job chaining:
https://laravel.com/docs/5.5/queues#job-chaining

How often put() is triggered in Kafka Connect sink tasks?

Can I control the intervals at which the put() method of my Kafka Connect Sink tasks is triggered? What is the expected behavior of the Kafka Connect framework in this respect? Ideally, I would like to specify, for example, "don't call me unless you have X new records/Y new bytes, or Z milliseconds passed since the last invocation". This could potentially make the batching logic within the sink task simpler (quoting the documentation, "in many cases internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the downstream data store).
Today, put from a SinkTask is only called when deliverMessages is invoked in a WorkerSinkTask. The good news is that the only time deliverMessages happens is within poll so you should have some control over how often you poll for new records by overriding consumer properties.
If you want to do internal buffering, you could have a look at how the HDFSConnector is handling this in its implementation of SinkTask. However, right now, Connect will immediately put any records that get returned by the poll.
All of that said, if you are really looking to batch messages before they hit the downstream system, you might consider looking into offset.flush.interval.ms and offset.flush.timeout.ms which control how often flush() is invoked.

Looking for message bus implementations that offer something between full ACID and nothing

Anyone know of a message bus implementation which offers granular control over consistency guarantees? Full ACID is too slow and no ACID is too wrong.
We're currently using Rhino ESB wrapping MSMQ for our messaging. When using durable, transactional messaging with distributed transactions, MSMQ can block the commit for considerable time while it waits on I/O completion.
Our messages fall into two general categories: business logic and denormalisation. The latter account for a significant percentage of message bus traffic.
Business logic messages require the guarantees of full ACID and MSMQ has proven quite adequate for this.
Denormalisation messages:
MUST be durable.
MUST NOT be processed until after the originating transaction completes.
MAY be processed multiple times.
MAY be processed even if the originating transaction rolls back, as long as 2) is adhered to.
(In some specific cases the durability requirements could probably be relaxed, but identifying and handling those cases as exceptions to the rule adds complexity.)
All denormalisation messages are handled in-process so there is no need for IPC.
If the process is restarted, all transactions may be assumed to have completed (committed or rolled back) and all denormalisation messages not yet processed must be recovered. It is acceptable to replay denormalisation messages which were already processed.
As far as I can tell, messaging systems which deal with transactions tend to offer a choice between full ACID or nothing, and ACID carries a performance penalty. We're seeing calls to TransactionScope#Commit() taking as long as a few hundred milliseconds in some cases depending on the number of messages sent.
Using a non-transactional message queue causes messages to be processed before their originating transaction completes, resulting in consistency problems.
Another part of our system which has similar consistency requirements but lower complexity is already using a custom implementation of something akin to a transaction log, and generalising that for this use case is certainly an option, but I'd rather not implement a low-latency, concurrent, durable, transactional messaging system myself if I don't have to :P
In case anyone's wondering, the reason for requiring durability of denormalisation messages is that detecting desyncs and fixing desyncs can be extremely difficult and extremely expensive respectively. People do notice when something's slightly wrong and a page refresh doesn't fix it, so ignoring desyncs isn't an option.
It's not exactly the answer you're looking for, but Jonathan Oliver has written extensively on how to avoid using distributed transactions in messaging and yet maintain transactional integrity:
http://blog.jonathanoliver.com/2011/04/how-i-avoid-two-phase-commit/
http://blog.jonathanoliver.com/2011/03/removing-2pc-two-phase-commit/
http://blog.jonathanoliver.com/2010/04/idempotency-patterns/
Not sure if this helps you but, hey.
It turns out that MSMQ+SQL+DTC don't even offer the consistency guarantees we need. We previously encountered a problem where messages were being processed before the distributed transaction which queued them had been committed to the database, resulting in out-of-date reads. This is a side-effect of using ReadCommitted isolation to consume the queue, since:
Start transaction A.
Update database table in A.
Queue message in A.
Request commit of A.
Message queue commits A
Start transaction B.
Read message in B.
Read database table in B, using ReadCommitted <- gets pre-A data.
Database commits A.
Our requirement is that B's read of the table block on A's commit, which requires Serializable transactions, which carries a performance penalty.
It looks like the normal thing to do is indeed to implement the necessary constraints and guarantees oneself, even though it sounds like reinventing the wheel.
Anyone got any comments on this?
If you want to do this by hand, here is a reliable approach. It satisfies (1) and (2), and it doesn't even need the liberties that you allow in (3) and (4).
Producer (business logic) starts transaction A.
Insert/update whatever into one or more tables.
Insert a corresponding message into PrivateMessageTable (part of the domain, and unshared, if you will). This is what will be distributed.
Commit transaction A. Producer has now simply and reliably performed its writes including the insertion of a message, or rolled everything back.
Dedicated distributer job queries a batch of unprocessed messages from PrivateMessageTable.
Distributer starts transaction B.
Mark the unprocessed messages as processed, rolling back if the number of rows modified is different than expected (two instances running at the same time?).
Insert a public representation of the messages into PublicMessageTable (a publically exposed table, in whatever way). Assign new, strictly sequential Ids to the public representations. Because only one process is doing these inserts, this can be guaranteed. Note that the table must be on the same host to avoid 2PC.
Commit transaction B. Distributor has now distributed each message to the public table exactly once, with strictly sequantial Ids.
A consumer (there can be several) queries the next batch of messages from PublicMessageTable with Id greater than its own LastSeenId.
Consumer starts transaction C.
Consumer inserts its own representation of the messages into its own table ConsumerMessageTable (thus advancing LastSeenId). Insert-ignore can help protect against multiple instances running. Note that this table can be in a completely different server.
Commit transaction C. Consumer has now consumed each message exactly once, in the same order the messages were made publically available, without ever skipping a message.
We can do whatever we want based on the consumed messages.
Of course, this requires very careful implementation.
It is even suitable for database clusters, as long as there is only a single write node, and both reads and writes perform causality checks. It may well be that having one of these is sufficient, but I'd have to consider the implications more carefully to make that claim.