Exactly-once semantics in spring Kafka - apache-kafka

I need to apply transactions in a system that comprises of below components:
A Kafka producer, this is some external application which would publish messages on a kafka topic.
A Kafka consumer, this is a spring boot application where I have configured the kafka listener and after processing the message, it needs to be saved to a NoSQL database.
I have gone through several blogs like this & this, and all of them talks about the transactions in context of streaming application, where the messages would be read-processed-written back to a Kafka topic.
I don't see any clear example or blog around achieving transactionality in the use case similar to mine i.e. producing-processing-writing to a DB in a single atomic transaction. I believe it to be very common scenario & there must be some support for it as well.
Can someone please guide me on how to achieve this? Any relevant code snippet would be greatly appreciated.

in a single atomic transaction.
There is no way to do it; Kafka doesn't support XA transactions (nor do most NoSQL DBs). You can use Spring's transaction synchronization for best-effort 1PC.
See the documentation.
Spring for Apache Kafka implements normal Spring transaction synchronization.
It provides "best efforts 1PC" - see Distributed transactions in Spring, with and without XA for more understanding and the limitations.

I'm guessing you're trying to solve the scenario where your consumer goes down after writing to the database but before committing the offsets, or other similar problems. Unfortunately this means you have to build your own fault-tolerance.
In the case of the problem I mentioned above, this means you would have to manage the consumer offsets in your end-output database, updating them in the same database transaction that you're writing the output of your consumer application to.

Related

Keeping services in sync in a kafka event driven backbone

Say I am using Kafka as the event-driven backbone for all my microservices in my system design. Many microservices use the events data to populate their internal databases.
Now there is a requirement where I need to create a new service and it uses some events data. The service will only be able to consume events after the time it comes live and hence, won't have a lot of data that it missed. I want a strategy such that I don't have to backfill my internal databases by writing out scripts.
What are some cool strategies I can have which do not create a huge load on Kafka & does not account for a lot of scripting to backfill data in the new services that I ever create?
There are a few strategies you can have here, depending on how you publish data to a kafka topic. Here are a few ideas:
first, you can set the retention of a kafka topic to be forever, meaning that it will store all the data. This is OK as kafka is built for this purpose as well. See this. By doing this, any new service that come alive can start consuming data from the start.
if you are using kafka for latest state publishing for a given entity/aggregate, you can also consider configuring the topic to be a compacted. This will let you store at least the latest state of your entity/aggregate on the topic, and new consumers that starts listening on the topic will have less data to configure. However, your consumers still need to know how to process multiple messages per entity/aggregate as you cannot guarantee it will have exactly one message in the topic.

Apache Kafka StateStore

I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.

How to model topics and partitions for Kafka when used to store all business events?

We're considering using Kafka as a way to store all our business events forever. The purpose is to be able to spin up new "microservices" that we haven't yet thought of that will be able to leverage on all previous events to build up their projections/state. Another use case might be an existing service where we'd like to "replay" all events that is of interest to this service to recreate its state.
Note that we're not planning to use Kafka as an "event store" in the sense that events will be projected/loaded into an aggregate on "every request".
Also (as far as I can tell) we don't know how consumers will consume the events. A new microservice might need all sorts of different events in order to create its internal projection/state.
Is Kafka suitable for this or is there a better alternative?
If so, what's a good way to model this (topics/partitions)?
We're currently using RabbitMQ for messaging (business events are sent to RabbitMQ). It would be great if we could migrate away from RabbitMQ in the future and move entirely to Kafka. I assume that this could change the way topics and partitions are modelled since now we have a better understanding of how consumers will consume the events. Would this be compatible with the other use case (infinite retention and replay)?
This is very good that you are switching to KAFKA and Yes it is possible to keep data in KAFKA BROKERs but i would suggest rather than keeping all the data in KAFKA-BROKERs for all time why can't you dump this data into HDFS or S3(AWS) it will be cheaper and you will have all the features of HDFS available with your data.
Storing all data in Brokers will increase overhead on Zookeeper as well.