What is the difference between Apache Kafka and Kafka Streams on Spring Cloud Stream? - apache-kafka

In the Spring Cloud website (https://spring.io/projects/spring-cloud-stream), are listed the binders options to use. And there we have the Apache Kafka and the Kafka Streams options.
What's the difference between them?
For what purpose we should choose between these two?

The Apache Kafka binder is used for basic kafka client usage consumer/producer api,
Kafka Stream binder is built upon the base apache kafka binder and adds the ability to use kafka streams api,
Kafka streams api is lightweight code libraries which gives you the functionality to manipulate data from topic/s in kafka to other topic/s in kafka , allow you to transform, enhance, filter,join, aggregate and more...
The Apache Kafka Binder implementation maps each destination to an Apache Kafka topic. The consumer group maps directly to the same Apache Kafka concept. Partitioning also maps directly to Apache Kafka partitions as well.
The binder currently uses the Apache Kafka kafka-clients version 2.3.1. This client can communicate with older brokers (see the Kafka documentation), but certain features may not be available. For example, with versions earlier than 0.11.x.x, native headers are not supported. Also, 0.11.x.x does not support the autoAddPartitions property
https://docs.spring.io/spring-cloud-stream-binder-kafka/docs/3.1.3/reference/html/spring-cloud-stream-binder-kafka.html#_apache_kafka_binder
Spring Cloud Stream includes a binder implementation designed explicitly for Apache Kafka Streams binding. With this native integration, a Spring Cloud Stream "processor" application can directly use the Apache Kafka Streams APIs in the core business logic.
Kafka Streams binder implementation builds on the foundations provided by the Spring for Apache Kafka project.
Kafka Streams binder provides binding capabilities for the three major types in Kafka Streams - KStream, KTable and GlobalKTable.
Kafka Streams applications typically follow a model in which the records are read from an inbound topic, apply business logic, and then write the transformed records to an outbound topic. Alternatively, a Processor application with no outbound destination can be defined as well.
https://docs.spring.io/spring-cloud-stream-binder-kafka/docs/3.1.3/reference/html/spring-cloud-stream-binder-kafka.html#_kafka_streams_binder

Related

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

Kafka Streams without Sink

I'm currently planning the architecture for an application that reads from a Kafka topic and after some conversion puts data to RabbitMq.
I'm kind new for Kafka Streams and they look a good choice for my task. But the problem is that Kafka server is hosted at another vendor's place, so I can't even install Cafka Connector to RabbitMq Sink plugin.
Is it possible to write Kafka steam application that doesn't have any Sink points, but just processes input stream? I can just push to RabbitMQ in foreach operations, but I'm not sure will Stream even work without a sink point.
foreach is a Sink action, so to answer your question directly, no.
However, Kafka Streams should be limited to only Kafka Communication.
Kafka Connect can be installed and ran anywhere, if that is what you wanted to use... You can also use other Apache tools like Camel, Spark, NiFi, Flink, etc to write to RabbitMQ after consuming from Kafka, or write any application in a language of your choice. For example, the Spring Integration or Cloud Streams frameworks allows a single contract between many communication channels

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing?
I am trying to grasp the technical and programmatic differences as well.
Please help me understand by reporting from your experience.
Beam is an API that uses an underlying stream processing engine like Flink, Storm, etc... in one unified way.
Kafka is mainly an integration platform that offers a messaging system based on topics that standalone applications use to communicate with each other.
On top of this messaging system (and the Producer/Consummer API), Kafka offers an API to perform stream processing using messages as data and topics as input or output. Kafka Stream processing applications are standalone Java applications and act as regular Kafka Consummer and Producer (this is important to understand how these applications are managed and how workload is shared among stream processing application instances).
Shortly said, Kafka Stream processing applications are standalone Java applications that run outside the Kafka Cluster, feed from the Kafka Cluster and export results to the Kafka Cluster. With other stream processing platforms, stream processing applications run inside the cluster engine (and are managed by this engine), feed from somewhere else and export results to somewhere else.
One big difference between Kafka and Beam Stream API is that Beam makes the difference between bounded and unbounded data inside the data stream whereas Kafka does not make that difference. Thereby, handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data.
Beam is a programming API but not a system or library you can use. There are multiple Beam runners available that implement the Beam API.
Kafka is a stream processing platform and ships with Kafka Streams (aka Streams API), a Java stream processing library that is build to read data from Kafka topics and write results back to Kafka topics.

Implement Kafka Streams Processor in .Net?

Is that possible?
The official .Net client confluent-kafka-dotnet only seems to provide consumer and producer functionality.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.
Yes, it is possible to re-implement Apache Kafka's Streams client library (a Java library) in .NET. But at the moment there doesn't exist such a ready-to-use Kafka Streams implementation for .NET.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.
No, Kafka Streams "processors" as you call them do not run in (the JVMs of) the Kafka brokers, which would be server-side.
Instead, the Kafka Streams client library is used to implement client-side Java/Scala/Clojure/... applications for stream processing. These applications talk to the Kafka brokers (which form the Kafka cluster) over the network.
May 2020 there seems to be a project in the making to support Kafka Streams in .NET:
https://github.com/LGouellec/kafka-stream-net
As per their road-map they are now in early beta and intend to get to v1 but the end of the year or beginning of next

KAFKA Producer API Vs JMS Producer API

High level Design of application :
Upstream system sends stream of data, data is received by Java Application. Using KAFKA as data store, logstash will publish stored data in Elastic index, and all the application will use elastic search query to get the data.
Problem : To publish data from Java application to KAFKA, which API Kafka JMS client or Java Kafka Producer/Consumer API should be used?
As per kafka documentation, If you are interested in writing new Java applications then you are encouraged to use the Java Kafka Producer/Consumer APIs as they provide advanced features not available when using the kafka-jms-client https://docs.confluent.io/current/clients/kafka-jms-client/docs/index.html .
Also as per KAFKA documentation it is not typical Messgaing broker and not all JMS concepts map 1:1 kafka.
Is there any benefit of using JMS API for KAFKA since KAFKA is not typical Messaging broker [and application will be still tightly couple to KAFKA] and not all JMS concepts can be mapped to kafka?