How to update data in Kafka/Kafka stream?

How to update data in Kafka/Kafka stream? - apache-kafka

Lets suppose there is Kafka topic orders. Data is stored in JSON format:
{
"order_id": 1,
"status": 1
}
Status defines status of order (pending - 1, completed - 2).
How to change it on completed when it is finished?
As I know Kafka topic immutable and I can not change message JSON, just create a new message with chnaged value, right?

If your order changes state, a process that is changing the state should generate a new message with the new state in the topic. The kafka streams application can react on new messages, do transformations aggregations or similar and output the modified/aggregated messages in new topics... So you need a kafka producer that when the order state changes, produces a message to the order topic.

Related

KAFKA client library (confluent-kafka-go): synchronisation between consumer and producer in the case of auto.offset.reset = latest

I have a use case where I want to implement synchronous request / response on top of kafka. For example when the user sends an HTTP request, I want to produce a message on a specific kafka input topic that triggers a dataflow eventually resulting in a response produced on an output topic. I want then to consume the message from the output topic and return the response to the caller.
The workflow is:
HTTP Request -> produce message on input topic -> (consume message from input topic -> app logic -> produce message on output topic) -> consume message from output topic -> HTTP Response.
To implement this case, upon receiving the first HTTP request I want to be able to create on the fly a consumer that will consume from the output topic, before producing a message on the input topic. Otherwise there is a possibility that messages on the output topic are "lost". Consumers in my case have a random group.id and have auto.offset.reset = latest for application reasons.
My question is how I can make sure that the consumer is ready before producing messages. I make sure that I call SubscribeTopics before producing messages. but in my tests so far when there are no committed offsets and kafka is resetting offsets to latest, there is a possibility that messages are lost and never read by my consumer because kafka sometimes thinks that the consumer registered after the messages have been produced.
My workaround so far is to sleep for a bit after I create the consumer to allow kafka to proceed with the commit reset workflow before I produce messages.
I have also tried to implement logic in a rebalance call back (triggered by a consumer subscribing to a topic), in which I am calling assign with offset = latest for the topic partition, but this doesn't seem to have fixed my issue.
Hopefully there is a better solution out there than sleep.

Most HTTP client libraries have an implicit timeout. There's no guarantee your consumer will ever consume an event or that a downstream producer will send data to the "response topic".
Instead, have your initial request immediately return 201 Accepted status (or 400, for example, if you do request validation) with some tracking ID. Then require polling GET requests by-id for status updates either with 404 status or 200 + some status field within the request body.
You'll need a database to store intermediate state.

Kafka client and aggregated events

In event-driven design we strive to find out events that we interested of. Using Kafka we can easily subscribe (a new group.id) to a topic and start consuming events. If retention policy is default one we could consume also one week old messages if specify auto.offset.reset=earliest. Right? But what if we want to start from the very beginning? I guess that KTable should be used but I'm not sure what will happened when a new client has subscribed to a stateful stream. Could you tell me is it true that the new subscriber will receive all aggregated messages?

You can't consume data that has been deleted.
That's why KTables are built on top of compacted topics, which will store the latest keys for each record, and have infinite retention.
If you want to read the "current state" of the table, to get all aggregated messages, then you can use Interactive Queries.
not sure what will happened when a new client has subscribed to a stateful stream
It needs to read the entire compacted topic, starting from the beginning (earliest available offset, not necessarily the first ever produced message) since it cannot easily find where in the topic that each unique key may start.

How to request data from producer at beginning position that does not exist in Kafka?

I have a database with time series data and this data is sent to Kafka.
Many consumers build aggregations and reporting based on this data.
My Kafka cluster stores data with TTL for 1 day.
But how I can build a new report and run a new consumer from 0th position that does not exist in Kafka but exists in source storage.
For example - some callback for the producer if I request an offset that does not exist in Kafka?
If it is not possible please advise other architectural solutions. I want to use the same codebase to aggregate this data.

For example - some callback for the producer if I request an offset
that does not exist in Kafka?
If the data does not exist in Kafka, you cannot consume it much less do any aggregation on top of it.
Moreover, there is no concept of a consumer requesting a producer. Producer sends data to Kafka broker(s) and consumers consume from those broker(s). There is no direct interaction between a producer and a consumer as such.
Since you say that the data still exists in the source DB, you can fetch your data from there and reproduce it to Kafka.
When you produce that data again, they will be new messages which will be eventually consumed by the consumers as usual.
In case you would like to differentiate between initial consumption and re-consumption, you can produce these messages to a new topic and have your consumers consume from them.
Other way is to increase your TTL (I suppose you mean retention in Kafka when you say TTL) and then you can seek back to a timestamp in the consumers using the offsetsForTimes(Map<TopicPartition,Long> timestampToSearch) and seek(TopicPartition topicPartition, long offset) methods.

Kafka Streams - ensuring state store is up-to-date when producing & consuming to/from the same topic

I have an application which produces to a Kafka topic and consumes from the same topic. It consumes using a KStream and creates a materialization such that state can be queried. Consider the following scenario:
The application produces to the topic.
Immediately after producing, it queries the local state store.
Is it possible to guarantee that the state store query will return the state reflected by the message produced in (1)? It seems that there will be a lag in the state store reflecting the true state, since the message must undergo a roundtrip to the topic and back again via the consumer.

python KafkaConsumer and Producer at the same time

In a python program I would like to write some messages to Kafka, then read the response with the same number of messages from a remote app on a different topic. The problem is that by the time I am done with sending messages, the other end already start responding and when I start reading, I only get the tailing part of the message batch, or no messages at all, depending on timing. This contradict to my understanding of the package, i.e. I thought if I create a consumer with auto_offset_reset='latest', and subscribe for a topic, then it remembers the offset at subscription time and when I get to iterate over the consumer object, it will start reading messages from that offset.
Here is what I do:
I create a consumer first and subscribe for the out topic:
consumer = KafkaConsumer(
bootstrap_servers=host+':'+broker_port,
group_id = "0",
auto_offset_reset='latest',
consumer_timeout_ms=10000
)
consumer.subscribe(topics=(topic_out))
then create a producer and send messages to topic_in:
producer = KafkaProducer(
bootstrap_servers=host+':'+broker_port
)
future = producer.send(topic,json.dumps(record).encode('utf-8'))
future.get(timeout=5)
Then I start reading from the consumer:
results = []
for msg in consumer:
message = json.loads(msg.value)
results.append(message)
I tried consumer.pause() before sending, consumer.resume() after sending - does not help.
Is there something I am missing in the configuration, or I misunderstand how Consumer works?
Thanks in advance!

Sounds like you have a race condition.
One solution would be to store a local dictionary or sqlite table that's built from consuming this "lookup topic", then when you consume from the "action topic", you are doing lookups locally rather than starting a consumer to scan over the "lookup topic" for the data you need.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse