Sending from external system to Kafka without duplicates in a transaction - apache-kafka

I have a requirement to send data from an external system to a Kafka topic with exactly once semantics.
The source has an offset, we can consume messages from a given offset.
Looking at Kafka documentation, I see there are 2 ways to do this.
Kafka Source Connector
Use plain Kafka producer with transactions.
It looks like option 1 doesn't support exactly once semantics now, Kafka jira 6080 is unresolved. Also I would like to understand how we can do this directly with the producer apis.
For option 2, the (consume, transform, produce) loop in all the documents show committing offsets of consumers using AddOffsetsToTxn. What is the recommended strategy if the source is not a Kafka topic? Looks like writing the source offset in a different topic as part of the transaction and using it during recovery would work. Is this the recommended way?

Related

Processing Unprocessed Records in Kafka on Recovery/Rebalance

I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)

Is Kafka a message queue and can Kafka be used as the database?

Some places mentioned Kafka is the publish-subscribe messaging. Other sources mentioned Kafka is the Message Queue. May I ask the differences between those and can Kakfa be used as the database?
There are 2 patterns named Publish-Subscribe and Message Queue. There are some places discussed the differences. here
Kafka especially supports both of these 2 patterns. For the publish-subscribe pattern, Kafka has publisher/subscriber which supported this pattern. The publisher sends messages to one topic and the subscriber can subscribes and receives messages on that one. For the queueing pattern, Kafka has a concept named Consumer Group. Within the same consumer group, all consumers will share jobs hence balancing the workload.
Because of the flexible design from the start, Kafka is broadly used for many software patterns while designing the system.
Personally, I would not call Kafka itself a database but you can use Kafka as the storage, especially through some mechanisms such as the log compaction. Ref1 Ref2
Kafka is a storage at base like a database but without indexes, where every query is a full scan of your data. Kafka it store data in files that can not be modified. Ex if you use event sourcing you can save all event of your system in Kafka and reprocess all events when your system have a bug.
Imagine that Kafka can split a very huge file(10TB or more) on multiple server and provide a way to read that file in a distributed manner using partitions( more partition you have, more application can read in parallel).
Because its a storage, Kafka can also be used as a message queue or as a publish-subscribe system.

Apache Kafka Lineage Information

Is there any way to obtain lineage data for Kafka jobs.
Like for example we have a Job History URL for all the MapReduce jobs.
Is there anything similar for Kafka where I can get the metadata of the Producer producing information to a particular topic? (eg: IP address of the producer)
Out of the box, no, not for Producers.
You can list consumer client addressses, but not see producers into a topic.
One option would be to look into Apache Atlas for lineage information, which is one component of Hortonwork's Stream Messaging Manager for analyzing Kafka connection information.
Otherwise, you would be stuck trying to enfore producers to send this data along with the messages.

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

Kafka Streams use case

I am building a simple application which does below in order -
1) Reads messages from a remote IBM MQ(legacy system only works with IBM MQ)
2) Writes these messages to Kafka Topic
3) Reads these messages from the same Kafka Topic and calls a REST API.
4) There could be other consumers reading from this topic in future.
I came to know that Kafka has the new streams API which is supposed to be better than Kafka consumer in terms of speed/simplicity etc. Can someone please let me know if the streams API is a good fit for my use case and at what point in my process i can plug it ?
It is true that Kafka Streams API has a simple way to consume records in comparison to Kafka Consumer API (e.g. you don't need to poll, manage a thread and loop), but it also comes with a cost (e.g. local data store - if you do stateful processing).
I would say that if you need to consume records one by one and call a REST API use the Consumer API, if you need stateful processing, query the topic state, etc. use the Streams API.
For more info take a look to this blog post: https://balamaci.ro/kafka-streams-for-stream-processing/
Reads messages from a remote IBM MQ (legacy system only works with
IBM MQ)
Writes these messages to Kafka Topic
I'd use Kafka Connect for (1) and (2). It is part of the Kafka project, and there are many free as well as commercial "connectors" available for hundreds of systems.
Reads these messages from the same Kafka Topic and calls a REST API.
You can use Kafka Streams as well as the lower-level Consumer API of Kafka, depending on what you prefer. I'd go with Kafka Streams as it is easier to use and far more powerful. (Both are part of the Kafka project.)
There could be other consumers reading from this topic in future.
This works out-of-the-box -- once data is stored in a Kafka topic according to step 2, many different applications and "consumers" can read this data independently.
Looks like you are not doing any processing/transformation once you consume you message from your IBM MQ or even after your Kafka Topic.
First one -> from IBM Mq to your Kafka Topic is kind of a pipeline and
Secondly -> You are just calling the REST API(I assume w/o any processing)
Considering these facts it seems to be a good fit for using Simple consumer.
Let's not use a technology only because it's there :)