Replicate file over kafka and prevent duplicate data - apache-kafka

I'm interested in publishing a file's contents over a kafka channel in realtime (I can do this in python) but I'm wondering what strategy might be effective to prevent sending duplicate data in case my publisher crashes and I need to restart it? Is there anything in kafka that can help with this directly or must I explicitly track the file offset I've published so far?
I suppose another way might be for the publisher to bootstrap the data already published and count the bytes received then file seek and recover?
Are there any existing scripts or apps that handle this already I can perhaps leverage instead?

Instead of publishing it yourself, I strongly recommend using Kafka Connect. In addition of not having to write a custom made code, the connectors could also support the "exactly-once" feature for you.
More details about connectors can be found here: https://www.confluent.io/product/connectors/

You might want to check kafka's Log compaction feature. It does the deduplication for u provided u have unique key for all the duplicate messages.
https://kafka.apache.org/documentation/#compaction

Related

How to ensure exactly once processing in Kafka for older version 0.10.0

I have a scenario where I need to call an api to the notification service which will send an email to the user.
Based on the message that appears to the topic I need to consume and call an api.
But if the consumer fails in between due to the nature of the at-least once configuration of the consumer its hard to avoid duplicate processing.
Sorry for newbie question. I Tried to find the many blogs but most of them are using newer version of Kafka but in our infra they are still using older version.
Wondering how can we achieve this behavior?
Links that I referred earlier :
If exactly-once semantics are impossible, what theoretical constraint is Kafka relaxing?
I also saw there documentation where they mention to store the processing result and offset to the same storage. But my scenario is api call lets imagine if the consumer call api and dies in between before committing offset then next poll will get that message and call api. In this way email will be sent twice.
Any help is appreciated.

Does it make sense to use Apache Kafka for this Scenario?

There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.

Kafka: How to Produce multiple incoming files to Kafka?

We have a scenario where Kafka Producer should read a list of incoming files and produce them to Kafka Topics. I've read about FileSourceConnector (http://docs.confluent.io/3.1.0/connect/connect-filestream/filestream_connector.html) but it reads only one file and sends new lines added to that file. File rotation is not handled. A few questions:
1) Is it better to implement our own Producer code to meet our requirement or can we extend the File Connector class so that it reads new files and sends them to Kafka topics.
2) Is there any other source connector that can be used in this scenario?
In terms of performance and ease of development, which approach is better? i.e., developing our Producer code to read files and send to Kafka or extending the Connector code and making changes to it.
Any kind of feedback will be greatly appreciated!
Thank you!
I personally used the Producer API directly. I handled file rotation and could publish in realtime. There was a tricky part into making sure the files were exactly the same on the source and sink systems (exactly-once processing).
Have you take a look to Akka Streams - Reactive Kafka? https://github.com/akka/reactive-kafka
Check this example: https://github.com/ktoso/akka-streams-alpakka-talk-demos-2016/blob/master/src/main/java/javaone/step1_file_to_kafka/Step1KafkaLogStreamer.java
You could write a producer as you suggested - or better yet, write your own connector using the developer API

Kafka consumer does not continue from where it left off

I'm currently using consumer groups to read messages from kafka. I have noticed however, that if my consumer goes down and I bring it back up again, it does not consume messages from where it left off. After reading the documentation here, it seems like I would have to implement this functionality myself. I know there's an autooffset.reset config for the consumer, but that just seems to allow me to either consume everything from the beginning, or consume from the last message currently on the queue. Is my understanding correct? That I would have to implement this myself? Or am I missing something here. It seems like a pretty basic feature that any queueing system should provide out of the box.
The version I'm using is 0.8.1.1 with scala version 2.10.
Based on the link, you're trying to use SimpleConsumer. With SimpleConsumer you need to take care of low level details like managing offsets by yourself. It is more difficult but allows to have more control on how data is consumed.
If all you want is just to read data without worrying much about low level details, take a look at HighLevelConsumer:
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example

Is Kafka ready for production use?

I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?
It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.
There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.
I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).
I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)