Is it possible in Spring Kafka to send a messages that will expire on a per message (not per template or higher) basis - apache-kafka

I am trying to use Kafka as a request-response system between two clients much like RabbitMQ and I was wondering if it is possible to set the expiration of a message so that after it is posted it will automatically get deleted from the Kafka servers.
I'm trying to do it on a per message level as well (but even if it were per-topic it is okay, but I'd like to use the same template if possible).
I was checking ProducerRecord, but all it had was timestamp. I also don't see any mention of it in KafkaHeaders

Kafka records are deleted in segments (a group of messages) based on overall topic retention.
Spring is just a client. It doesn't control the server side logic of the log cleaner.

Related

kafka message expiry event - how to capture

I am a beginner to Kafka, and recently started using in my projects at work. One important thing that I wanna know is, whether it is possible to capture event(s) when messages expire in kafka. The intent is to trap these expired messages and back them up in a backup store.
I believe the goal you want to achieve is similar to Apache Kafka Tiered Storage which is still under development in Open Source Apache Kafka
Messages don't expire. I think you could think of two different scenarios when you think about messages that expire.
A topic is configured with cleanup.policy = delete . After retention.ms or retention.bytes it looks like messages expire. However what actually happens is that a whole log segment, whose newest message is older than retention.ms or if the partitions retention.bytes is exceeded, will be deleted. It will only be considered for deletion if it is not the active segment which Kafka currently writes to.
A topic is configured with cleanup.policy = tombstone. When two log segments are merged, Kafka will make sure that only the latest version for each distinct key will be kept. To "delete" messages one would send a message with a key to target a message and with an empty value - also called a tombstone.
There's no hook or event you could subscribe to, in order to figure out if either of these two cases will happen. You'd have to take of the logic on the client side, which is hard because the Kafka API does not expose any details about the log segments within a partition.

Graphql subscriptions in a distributed system with Kafka (and spring boot)

I have the following situation:
I have 5 instances of the same service, all in the same kafka consumer group. One of them has a websocket connection to the client (the graphql subscription). I use graphql-java and Spring Boot.
When that connection is opened, I produce events from any of the 5 instances (with a message key defined so they go to the same partition and ordered) and I need for all those events to be consumed by the same instance that opened that connection. Not by the other 4.
Even if the partition assignment plays in my favor, a reassignment can by done at any time, leaving me without luck
My implementation is using reactor-kafka but I think it's just an implementation detail.
The options I see are:
Start to listen on that topic with a new group id each time, so that service always receives the messages from that topic (but the 5 in the other group id too)
Create a new topic for each websocket connection, so only the producer knows that topic (but the topic id should be sent in the kafka events so that the producers of those events know where to publish them)
If I receive the message and I'm not the one with the connection, don't ACK it. But this would make things slow and seems hacky
Start using something different altogether like Redis PubSub to receive all messages in all consumers and check for the connection.
I see there's an implementation for node but I don't see how it is solving the problem.
A similar question explains how to program a subscription but doesn't talk about this distributed thing.
Is the cleanest approach any of the one I suggested? Is there an approach with Kafka that I'm not seeing? Or am I misunderstanding some piece?
I ended up using 1 consumer group id per listener with a topic specifically for those events.

Filter Stream (Kafka?) to Dynamic Number of Clients

We are currently designing a big data processing chain. We are currently looking at NiFi/MiNiFi to ingest data, do some normalization, and then export to a DB. After the normalization we are planning to fork the data so that we can have a real time feed which can be consumed by clients using different filters.
We looking at both NiFi and/or Kafka to send data to the clients, but are having design problems with both of them.
With NiFi we are considering adding a websocket server which listens for new connections and adds their filter to a custom stateful processor block. That block would filter the data and tag it with the appropriate socket id if it matched a users filter and then generate X number of flow files to be sent to the matching clients. That part seems like it would work, but we would also like to be able to queue data in the event that a client connection drops for a short period of time.
As an alternative we are looking at Kafka, but it doesn't appear to support the idea of a dynamic number of clients connecting. If we used kafka streams to filter the data it appears we would need 1 topic per client? Which would likely eventually overrun our Zookeeper instance. Or we could use NiFi to filter and then insert into different partitions sort of like here: Kafka work queue with a dynamic number of parallel consumers. But there is still a limit to the number of partitions that are supported correct? Not to mention we would have to juggle the producer and consumers to read from the right partition as we scaled up and down.
Am I missing anything for either NiFi or Kafka? Or is there another project out there I should consider for this filtered data sending?

Ingesting data from REST api to Kafka

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?
So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.

Using Kafka to Transfer Files between two clients

I have kafka cluster setup between to machines (machine#1 and machine#2) and the configuration is the following:
1) Each machine is configured to have one broker and one zookeeper running.
2) Server and zookeeper properties are configured to have a multi-broker, mulit-node zookeeper.
I currently have the following understanding of KafkaProducer and KafkaConsumer:
1) If I send a file from machine#1 to machine#2, it's broken down in lines using some default delimiter (LF or \n).
2) Therefore, if machine#1 publishes 2 different files to the same topic, that doesn't mean that machine#2 will receive the two files. Instead, every line will be appended to the topic log partitions and a machine#2 will read it from the log partitions in the order of arrival. i.e. the order is not the same as
file1-line1
file1-line2
end-of-file1
file2-line1
file2-line2
end-of-file2
but it might be something like:
file1-line1
file2-line1
file1-line2
end-of-file1
file-2-line2
end-of-file2
Assuming that the above is correct (i'm happy to be wrong), I believe simple Producer Consumer usage to transfer files is not the correct approach (Probably connect API is the solution here). Since Kafka Website says that "Log Aggregation" is a very popular use case, I was wonder if someone has any example projects or website which demonstrates file exchange examples using Kafka.
P.S. I know that by definition Connect API says that this is for reliable data exchange between kafka and "Other" systems - but I don't see why the other system cannot have kafka. So I am hoping that my question doesn't have to focus on "Other" non-kafka systems.
Your understanding is correct, however if u want the same order you can use just 1 partition for that topic.
So the order in which machine#2 reads will be the same as what you sent.
However this will be inefficient and will lack parallelism for which Kafka is widely used.
Kafka has ordering guarantee within a partition. quote from documentation
Kafka only provides a total order over records within a partition, not
between different partitions in a topic
In order to send all the lines from a file to only one partition, send an additional key to the producer client which will hash the sent message to the same partition.
This will make sure you receive the events from one file in the same order on machine#2. If you have any questions feel free to ask, as we use Kafka for ordering guarantee of events generated from multiple sources in production which is basically your use case as well.