Kafka on Masstransit - apache-kafka

I am studying the possibility to use Kafka with Masstransit on our Information System. Masstransit documentation says Kafka may be used but as a Rider; As the usage of a transport is mandatory we must use another tool (as ActiveMQ, RabbitMQ) in addition of Kafka with Masstransit.
What does transport do that can not be done by Kafka, that prevent exclusive usage of Kafka?

MassTransit is designed with publish-subscribe with competing consumers in mind, which is the default pattern when using a message broker.
Riders were introduced, as mentioned in the docs, as a gateway between event streaming and pub-sub. So, the main purpose of riders is to support the scenario when you need to consume a message from a rider and publish it to the bus. Even producers are optional for riders, and they are implementation-specific. As event streaming infrastructure like Kafka doesn't even support competing consumers, it makes little sense to make something like Kafka transport, as it introduces a heavy impedance mismatch between MassTransit concepts and what Kafka does.
You can still only use Kafka with MassTransit, but you'd need to use the in-memory transport to configure and start the bus.

Related

Process messages pushed through Kafka

I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data

Apache Kafka StateStore

I am learning Apache Kafka (as a messaging system) and in that process came to know of term StateStore , link here
I am also aware of Apache kafka streams, the client API.
Is StateStore applicable for Apache kafka in the context of messaging systems or it is applicable to Apache Kafka Streams.
Does Apache have their "own" implementation of StateStore or use third party implementation (for example, rockdsb.
Can anyone help me understand this.
Adding an overview to the good concise explanation about StateStore in the context of Kafka Streams and your question.
Kafka Broker in a nutshell
In a messaging context your work simplified would be:
Publishing state (producing messages)
Saving messages for a period of time for later consumption (retention time)
Consuming state (getting the messages)
And in a nutshell #2 plus fault tolerance and keeping track of the position of your consumer groups' reads (offsets) is what a Kafka broker does for you.
Kafka client API's
Apart from that Kafka provides client libraries for your common patterns of working with messages:
Producer - Publish messages to Kafka topics
Consumer - Subscribe to Kafka topics
Connect - Create reliable integrations with external stores such as various DBMS.
Streams - DSL and utilities aimed to simplify development of common streaming application patterns.
Admin - Programmatically manage / monitor Kafka resources.
Kafka Streams State Stores
I'll quote the great explanation from the Streams Architecture docs (I highly recommend Kafka docs as they are built very good and for any level of experience).
Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as join() or aggregate(), or when you are windowing a stream.
As you can see the StateStore is used as a helper for extending the built-in abilities from a single message processing context to multi-message processing, thus enabling more complex functions over a bunch of messages (all the messages passed in a time window, aggregation functions over several messages, etc.)
I'll add to that that RocksDB is the default implementation used by Kafka and can be changed as was mentioned in previous answer.
Also if you want to explore more here is a link to the great intro videos form Apache Kafka's official docs:
Streams API intro videos
Have an awesome learning experience!
StateStore is applicable to kafka streams context.
Some processors like reduce or aggregate are stateful operations.
Kafka streams use state stores to manage this. By default, it uses rocksDB, but it is customizable.

can vert.x event bus replace the need for Kafka?

I am evaluating the vert.x framework to see if I can reduce the Kafka based communications between my microservices developed using spring boot.
The question is:
Can I replace
Kafka with vert.x event bus and
spring boot microservices with vert.x based verticles
To answer quickly, I would say it depends on your needs.
Yes, the eventbus can be a good way to handle natively communication between microservices verticles using an asynchronous and non-blocking paradigm.
But in some cases you could need:
to handle some common enterprises patterns like replay mechanisms, persistence of messages, transactional reading
to be able to process some kind of messages in a chronological order
to handle communication between multiples kind of microservices that aren't all written with the same framework/toolkit or even programming language
to handle reliability, resilience and
failure recovery when all your consumers/microservices/verticles are died
to handle dynamic horizontal scalability and monitoring of your consumers/microservices/verticles
to be able to work with a single cluster deployed in multi-datacenters and multi-regions
In those cases I'd prefer to choose Apache Kafka over the native eventbus or an old fascioned JMS compliant system.
It's not forbidden to use both eventbus and kafka in the same microservices architecture according to your real needs. For example, you could have one kafka consumers group reading a kafka topic to handle scaling, monitoring, failure recovery and reply mechanism and then handle communication between your sub-verticles through the eventbus.
I'll clarify a little bit for the scalability and monitoring part and explain why I think it's more simple to handle that with Kafka over the native eventbus and cluster mode with vert.x : Kafka allow us to know in real time (through JMX metrics and the describe command):
the "lag" of a topic which corresponds to
the number of unread messages
the number of consumers of each group that are listening a topic
the number of partitions of a topic affected of each consumers
i/o metrics
So it's possible to use an ElasticStack or Prometheus+Grafana solution to monitor those metrics and use them to handle a dynamic scalability (when you know that there's a need to increase temporarily the number of consumers for example according to the lag metric and the number of partitions and the cpu/ram/swap metrics of your hosts).
To answer the second question vert.x or SpringBoot my answer will be not very objective but I'd vote for vert.x for its performances on the JVM and especially for its simplicity. I'm a little tired of the Spring factory and its big layers of abstraction that hides a lot of issues under a mountain of annotations triggering a mountain of AOP.
Moreover, In the Java world of microservices, there's other alternatives to SpringBoot like the different implementations of Microprofile (thorntail project for example).
The event-bus is not persistent. You should use it for fast verticle-to-verticle communications, and more generally to dispatch events where you know that you can loose them if you have some crash.
Kafka streams are persistent, and you should send events there because either you want other (possibly non-Vert.x) applications to consume them, and/or because you want to ensure that these events are not being lost in case of failure.
A reactive (read "scalable and fault-tolerant") Vert.x application typically uses a combination of both the event-bus and some replicable messaging systems like AMQP / Kafka / etc.
On the question:
Can I replace spring boot microservices with vert.x based verticles?
Yes, definitely, although the 2 have different programming models.
If you want a more progressive approach and use Spring for structuring your application while using Vert.x for resource efficiency over your I/O and event processing then you can mix them, see https://github.com/vert-x3/vertx-examples/tree/master/spring-examples for examples.
Take a look at the Quarkus framework: in the workshop section you'll find Vert.x and Apache Kafka combined!

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.

What is the difference between MQTT broker and Apache Kafka

I am developing a mobile messaging app. I was going through technology needed and found two MQTT & Apache Kafta. To me both seems doing the same thing in the same way (in terms of subscribing & publishing to a topic).
I heard that MQTT is fit for mobiles as it is very light weight ? So basically what is the difference between these two and what are the advantage of each on other?
The main motive behind Kafka is scalability.
MQTT is a protocol with public specification for lightweight client / message broker communications, allowing publish/subscribe exchanges. Multiple implementations of client libraries and brokers (Mosquitto, JoramMQ...) exist and are virtually compatible. MQTT just specifies the transport, and vaguely the application part (i.e. how data is handled and possibly stored, how clients are authorized...). The spec is not clear if data consumed on a topic is only real-time or possibly persistent. The spec doesn't state anything about how the message broker implementing MQTT could/should scale.
On the other hand, Apache Kafka is a message broker based on an internal "commit log": its focus is storing massive amounts of data on disk, and allowing consumption in real-time or later (as long as data is still available on disk). It's designed to be deployable as cluster of multiple nodes, with good scalability properties. Kafka uses its own network protocol.
So you are comparing two different things here: a standard pub/sub protocol (with multiple implementations), and a specific message storing/distributing software, vaguley of the same family with its own protocol.
I'd say that if you need to store massive amount of messages, to ensure batch processing, look more at Kafka. If you have lots of clients/apps exchanging messages in real-time on many independent topics look more at the MQTT (or even AMQP) message broker implementations.
MQTT is a standard protocol (with many implementations). Kafka (which is also a protocol) is normally used by downloading it from the Apache website or e.g. a Confluent Docker image.
It is like comparing apples and oranges, both exist for very different reasons.
Most use cases I see in IoT environments combine both MQTT and Apache Kafka. The edge devices speak MQTT protocol (for the benefits it has in edge environments. These are then forwarded to Apache Kafka to get the events into the rest of the enterprise architecture.
You can do this either via a MQTT Broker like HiveMQ + Apache Kafka or via a MQTT Proxy (so that you don't need the MQTT Broker). Both options have trade-offs, of course.
See this example of how to combine MQTT with Apache Kafka. Or go directly to the Github code: "Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data".
I also created a live demo about how to integrate Apache Kafka and MQTT.