Implement Kafka Streams Processor in .Net? - apache-kafka

Is that possible?
The official .Net client confluent-kafka-dotnet only seems to provide consumer and producer functionality.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.

Yes, it is possible to re-implement Apache Kafka's Streams client library (a Java library) in .NET. But at the moment there doesn't exist such a ready-to-use Kafka Streams implementation for .NET.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.
No, Kafka Streams "processors" as you call them do not run in (the JVMs of) the Kafka brokers, which would be server-side.
Instead, the Kafka Streams client library is used to implement client-side Java/Scala/Clojure/... applications for stream processing. These applications talk to the Kafka brokers (which form the Kafka cluster) over the network.

May 2020 there seems to be a project in the making to support Kafka Streams in .NET:
https://github.com/LGouellec/kafka-stream-net
As per their road-map they are now in early beta and intend to get to v1 but the end of the year or beginning of next

Related

Kafka Streams without Sink

I'm currently planning the architecture for an application that reads from a Kafka topic and after some conversion puts data to RabbitMq.
I'm kind new for Kafka Streams and they look a good choice for my task. But the problem is that Kafka server is hosted at another vendor's place, so I can't even install Cafka Connector to RabbitMq Sink plugin.
Is it possible to write Kafka steam application that doesn't have any Sink points, but just processes input stream? I can just push to RabbitMQ in foreach operations, but I'm not sure will Stream even work without a sink point.
foreach is a Sink action, so to answer your question directly, no.
However, Kafka Streams should be limited to only Kafka Communication.
Kafka Connect can be installed and ran anywhere, if that is what you wanted to use... You can also use other Apache tools like Camel, Spark, NiFi, Flink, etc to write to RabbitMQ after consuming from Kafka, or write any application in a language of your choice. For example, the Spring Integration or Cloud Streams frameworks allows a single contract between many communication channels

Kafka Producer vs Kafka Connector

I need to push data from database(say Oracle DB) to a kafka topic by calling a stored procedure.I need to do some validations too on message
Should i use a Producer API or Kafka Connector.
You should use Kafka Connect. Either that, or use the producer API and write some code that handles:
Scale-out
Failover
Restarts
Schemas
Serialisation
Transformations
and that integrates with hundreds of other technologies in a standardised manner for ease of portability and management.
At which point, you'll have reinvented Kafka Connect ;-)
To learn more about Kafka Connect see this talk. To learn about the specifics of database->Kafka ingestion see this talk and this blog.

Apache Flink State Store vs Kafka Streams

As far as I know handles Kafka Streams its States localy in memory or on disc or in a Kafka topic because all the input date is from a partition, where all the messages are keyed by a defined value. Most of the time the computations can be done without knowing the state of other Processors. If so, you have another Streams instance whichs calculsates the result. Like in this picture:
Where exactly does Flink store its States? Can Flink also store the states locally or does it always publish them always to all instances (tasks)? Is it possible to configure Flink so that it stores the States in a Kafka Broker?
Flink also uses local stores (that can be keyed), similar to Kafka Streams. However, it does not write state into Kafka topics.
For fault-tolerance, it takes so-called "distributed snapshots", that are stored in a configurable state backend (eg, HDFS).
Check out the docs for more details:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html
https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/state_backends.html
There is a distinction between Flink and Kafka Streams. Flink is cluster framework, your code is deployed and runs as job in Flink Cluster. Kafka streams is API that you embed in your standard java application. Stream processing logic runs inside the your application java process. They both can sink results to Kafka, key value store, database or external systems. Flinkā€™s master node implements its own high availability mechanism based on ZooKeeper and ensures the availability interim states after the disaster. If you are using Kafka Streams once you managed to save your interim states to Kafka Cluster you will have the same HA features provided by Kafka Cluster.

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing?
I am trying to grasp the technical and programmatic differences as well.
Please help me understand by reporting from your experience.
Beam is an API that uses an underlying stream processing engine like Flink, Storm, etc... in one unified way.
Kafka is mainly an integration platform that offers a messaging system based on topics that standalone applications use to communicate with each other.
On top of this messaging system (and the Producer/Consummer API), Kafka offers an API to perform stream processing using messages as data and topics as input or output. Kafka Stream processing applications are standalone Java applications and act as regular Kafka Consummer and Producer (this is important to understand how these applications are managed and how workload is shared among stream processing application instances).
Shortly said, Kafka Stream processing applications are standalone Java applications that run outside the Kafka Cluster, feed from the Kafka Cluster and export results to the Kafka Cluster. With other stream processing platforms, stream processing applications run inside the cluster engine (and are managed by this engine), feed from somewhere else and export results to somewhere else.
One big difference between Kafka and Beam Stream API is that Beam makes the difference between bounded and unbounded data inside the data stream whereas Kafka does not make that difference. Thereby, handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data.
Beam is a programming API but not a system or library you can use. There are multiple Beam runners available that implement the Beam API.
Kafka is a stream processing platform and ships with Kafka Streams (aka Streams API), a Java stream processing library that is build to read data from Kafka topics and write results back to Kafka topics.

Apache Camel vs Apache Kafka [duplicate]

This question already has answers here:
Difference Between Apache Kafka and Camel (Broker vs Integration)
(4 answers)
Closed 5 years ago.
As far as I know, Apache Kafka is asynchronous messaging platform, where as Apache Camel is a platform implementing the enterprise integration patterns.
So, what are the practical differences of Apache Camel and Apache Kafka? We planned to implement the system with Apache Camel, which is relatively easy, but our customer wanted the Apache Kafka instead without rational.
What would be the advantages of choosing Apache Kafka to implement a message queue functionality, which could be implemented with Apache Camel as well? I'm concerned Kafka would just introduce unnecessary overhead to project. Are we comparing apples and oranges?
What we need is straightforward API's to setup and use clustered message queues. Our initial plan was to use Camel to consume/produce on clustered JMS or ActiveMQ queues. How would Kafka make this task easier? The application itself would run on WebLogic server on either case.
The messaging would be point-to-point type, where there are multiple instances of same service running, but only one instance should process the message and emit the result according to load balancing policy. Message queues are also clustered, so neither failure of service instance or queue instance is SPOF.
Camel and Kafka are totally different things. In many use cases, camel is just used as a client of kafka/activemq/... .
Kafka and activemq are similar, but also different things, refer What is the difference between Apache kafka vs ActiveMQ. Kafka has higher throughput, and data always on disk, so a little more reliable than activemq.
Kafka is usually used as real-time data streaming, and in general activemq is mainly used for integration between applications, the book says so. But in most real world cases ,kafka and activemq can replace each other easily.
It is very hard to compare those two. They are not covering the same areas of work, but exist some systems, where you can replace one by the other.
So very shortly.
Kafka is messaging platform with streaming ability to process messages Apache Kafka.
Camel is ETL framework it can transform messages/events/data from "any" (see endpoint list by Camel) input point and send it to "any" output Apache Camel - Enterprise Integration Patterns.
You may use Camel without Kafka at all, and vice versa. But there are of course possibilities to use succesfully both together.
Case 1. You process mail and store in PostgreSQL DB. Kafka is useless
here.
Case 2. You process messages from ActiveMQ and send them to
Kafka. You may use both.