Spring Cloud Stream vs Kafka Stream for Exactly Once feature - apache-kafka

I was not able to find here and on Spring website and blogs if Spring Cloud Stream is able to provide "Exactly Once" semantic provided by Kafka Stream APIs.
Maybe there is not a single configuration/annotation and in the thread "Is it possible to get exactly once processing with Spring Cloud Stream?" I can find something useful, but the answer is a very high level from expert.
Thanks for help

Spring Cloud Stream does not do anything particular regarding processing guarantees. You can delegate that to Kafka Streams by providing the processing.guarantee property and setting that to exactly-once. See this for more details. When using Spring Cloud Stream Kafka Streams binder, you can provide this as a property to the Spring Boot application as below.
spring.cloud.stream.kafka.streams.binder.configuration.processing.guarantee.
Keep in mind that Kafka Stream's exactly once guarantee only works if you are writing the results back to Kafka.

Related

Apache Kafka vs. HTTP API

I am new to Apache Kafka and its stream services related to API's, but was wondering if there was any formal documentation on where to obtain the initial raw data required for ingestion?
In essence, I want to try my hand at building a rudimentary crypto trading bot, but was under the impression that http APIs may have more latency than APIs that integrate with Kafka Streams. For example, I know RapidAPI has a library of http APIs that can be accessed that would help pull data, but was unsure if there was something similar if I wanted the data to be ingested through Kafka Streams. I guess I am under the impression that the two data sources will not be similar and will be different in some way, but am also unsure if this is not the case.
I tried digging around on Google, but it's not very clear on what APIs or source data is taken for Kafka Streams, or if they are the same/similar just handled differently.
If you have any insight or documentation that would be greatly appreciated. Also feel free to let me know if my understanding is completely false.
any formal documentation on where to obtain the initial raw data required for ingestion?
Kafka accepts binary data. You can feed in serialized data from anywhere (although, you are restricted by (configurable) message size limits).
APIs that integrate with Kafka Streams
Kafka Streams is an intra-cluster library, it doesn't integrate with anything but Kafka.
If you want to periodically, poll/fetch an HTTP/1 API, then you would use a regular HTTP client, and a Kafka Producer.
Probably a similar answer with streaming HTTP/2 or websocket, although, still not able to use Kafka Streams, and you'd have to deal with batching records into a Kafka Producer request
You instead should look for Kafka Connect projects on the web that operate with HTTP, or opt for something like Apache NiFi as a broader project with lots of different "processors" like GetHTTP and ProduceKafka.
Once the data is in Kafka, you are welcome to use Kafka Streams/KSQL to do some processing

Kafka to MongoDB using Spring Cloud Dataflow

I'm working on a project where i have to process data coming from Kafka cluter, processing it and send it to MongoDB. The application should be deployable on the Pivotal Cloud foundary. After doing some research on the internet, i found the toolkit Spring-Cloud-Dataflow to be interesting since it can be deployed in PCF. I'm wondering how we can use it to create our real time streaming pipeline. For the moment, i'm thinking about using Kafka Streams and Spring Cloud Stream to process and transform the streams of topics but i don't know how to integrate it in SCDF and also how we can send those streams to MongoDB. I'm sorry if my question is not clear, i'm entierly new to those frameworks.
Thanks in advance
You could use the named-destination support in SCDF to directly consume events from Kafka or any other Spring Cloud Stream supported message broker implementations.
Now, for the write portion, you can use the out-of-the-box MongoDB-sink application that we build, maintain, and ship.
If you have to do some processing before you write to MongoDB, you can create a custom Spring Cloud Stream application with the desired binder implementation [see: dev-guide/docs].
To put this all together, if we assume you have events coming from a Kafka topic named Customers, and the custom processor doing some transformation on each of the received payloads (let's assume the name of the processor as CustomerTransformer), and finally the writing part to MongoDB.
Here's a take of this streaming data pipeline use-case designed from SCDF's Dashboard:

Apache Kafka StateStore

I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.

Exactly-once semantics in spring Kafka

I need to apply transactions in a system that comprises of below components:
A Kafka producer, this is some external application which would publish messages on a kafka topic.
A Kafka consumer, this is a spring boot application where I have configured the kafka listener and after processing the message, it needs to be saved to a NoSQL database.
I have gone through several blogs like this & this, and all of them talks about the transactions in context of streaming application, where the messages would be read-processed-written back to a Kafka topic.
I don't see any clear example or blog around achieving transactionality in the use case similar to mine i.e. producing-processing-writing to a DB in a single atomic transaction. I believe it to be very common scenario & there must be some support for it as well.
Can someone please guide me on how to achieve this? Any relevant code snippet would be greatly appreciated.
in a single atomic transaction.
There is no way to do it; Kafka doesn't support XA transactions (nor do most NoSQL DBs). You can use Spring's transaction synchronization for best-effort 1PC.
See the documentation.
Spring for Apache Kafka implements normal Spring transaction synchronization.
It provides "best efforts 1PC" - see Distributed transactions in Spring, with and without XA for more understanding and the limitations.
I'm guessing you're trying to solve the scenario where your consumer goes down after writing to the database but before committing the offsets, or other similar problems. Unfortunately this means you have to build your own fault-tolerance.
In the case of the problem I mentioned above, this means you would have to manage the consumer offsets in your end-output database, updating them in the same database transaction that you're writing the output of your consumer application to.

Kafka Stream API vs Consumer API

I need to read from a specific Kafka topic, do a VERY short processing on the message and pass it on to a different Kafka cluster.
Currently, I'm using a consumer that's also a producer on the other kafka server.
However, the streaming API supposedly offers a more light-weight high-throughput option.
So the questions are:
Assuming my processing code doesn't require much horse power, is the streaming API better?
Does the streaming APi support writing to a different Kafka cluster?
What are the Streaming API cons comparing to the Consumer API?
Unfortunately KafkaStreams doesn't currently support writing to a different Kafka cluster.