Create custom receiver for MQTT in spark streaming - scala

I have a requirement to do an analysis using spark on the data coming from IoT device via MQTT broker. The connectivity from my spark job is to MQTT broker where I can subscribe to a specific topic. I have used the MQTTUtils library in spark to connect to the broker, but I have doubts about how the library works internally. What I noticed is, "MQTTutils.createStream" connects to the MQTT broker for a topic. In that case, if I have to subscribe 100 topics in an MQTT broker, it may establish 100 connections to broker. That is not desirable in a real scenario. Please let me know if it is not working the way I think.
So I have decided to create a custom receiver for MQTT broker so that I can manage the connection in my MQTT client. I have gone through the document on how to implement a custom receiver, but I did not succeed in implementing it in the proper way.
If somebody has hands-on on such a custom receiver, please help me to make it works.
Appreciate your support since it is critical for my solution

Related

Integrating a MQTT broker inside server

I am learning about MQTT brokers, and I have got a question I cannot answer. Is it possible integrate a MQTT broker inside a server that acts as a client in a client/server architecture? - The reason I would need that is in case that this client retrieves data from an API.
I have tried to depict what I mean. If it is not possible, how would one approach it then, in case the data from the API is needed?
There no reason for the broker to be part of the client.
The client receives the data and then publishes it as a message to a separate broker where subscribers receive the message. There is no benefit to combing the two.
Building adapters like this is common practice (it's one of the reasons tools like Node-RED were created)

Process messages pushed through Kafka

I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data

IoT - multiple Kafka producers to publish messages to same topic

I am trying to develop an IoT based application, where multiple devices will generate the data and send it to a Kafka broker. The possible count of the devices would be around 60-70 thousand.
There could be 5-10 different device types, however, the number of device count for each type would be in thousand such as 20-40 thousand each.
I want to understand the Kafka topic usage in which it should support these many devices (20-40 thousand devices on average).
Also, please let me know whether the MQTT based implementation is required for such kind of applications.
Thanks in advance,
Avinash Deshmukh
The number of partitions mostly influence the consumer side, because it's the unit of parallelism for reading messages.
On the producer side, consider that because each leader partition is hosted by a broker, the producer has to connect to different brokers for writing to different partitions so multiple TCP connections would be needed.
Regarding MQTT, it means that you have to run an MQTT broker and then Kafka Connect with the related MQTT connector. It would make sense depending on the kind of devices you have. MQTT protocol is most suitable for embedded/IoT devices as lightweight protocol, so for low power devices it could make more sense than having a Kafka protocol on the device stack. But it could also depend if you are using an IoT gateway at the edge gathering messages from multiple devices on the field and then sending to Kafka.
I want to understand the Kafka topic usage in which it should support these many devices (20-40 thousand devices on average).
To be honest I don't understand the question. Are you asking for topic sizing? number of partitions? etc etc.
Are you going to provide your devices direct access to the Kafka cluster?
Also, please let me know whether the MQTT based implementation is required for such kind of applications.
Apache Kafka doesn't support MQTT per-se. Are you talking about any kind of commercial solution for that?

Kafka connect or Kafka Client

I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.

What is the difference between MQTT broker and Apache Kafka

I am developing a mobile messaging app. I was going through technology needed and found two MQTT & Apache Kafta. To me both seems doing the same thing in the same way (in terms of subscribing & publishing to a topic).
I heard that MQTT is fit for mobiles as it is very light weight ? So basically what is the difference between these two and what are the advantage of each on other?
The main motive behind Kafka is scalability.
MQTT is a protocol with public specification for lightweight client / message broker communications, allowing publish/subscribe exchanges. Multiple implementations of client libraries and brokers (Mosquitto, JoramMQ...) exist and are virtually compatible. MQTT just specifies the transport, and vaguely the application part (i.e. how data is handled and possibly stored, how clients are authorized...). The spec is not clear if data consumed on a topic is only real-time or possibly persistent. The spec doesn't state anything about how the message broker implementing MQTT could/should scale.
On the other hand, Apache Kafka is a message broker based on an internal "commit log": its focus is storing massive amounts of data on disk, and allowing consumption in real-time or later (as long as data is still available on disk). It's designed to be deployable as cluster of multiple nodes, with good scalability properties. Kafka uses its own network protocol.
So you are comparing two different things here: a standard pub/sub protocol (with multiple implementations), and a specific message storing/distributing software, vaguley of the same family with its own protocol.
I'd say that if you need to store massive amount of messages, to ensure batch processing, look more at Kafka. If you have lots of clients/apps exchanging messages in real-time on many independent topics look more at the MQTT (or even AMQP) message broker implementations.
MQTT is a standard protocol (with many implementations). Kafka (which is also a protocol) is normally used by downloading it from the Apache website or e.g. a Confluent Docker image.
It is like comparing apples and oranges, both exist for very different reasons.
Most use cases I see in IoT environments combine both MQTT and Apache Kafka. The edge devices speak MQTT protocol (for the benefits it has in edge environments. These are then forwarded to Apache Kafka to get the events into the rest of the enterprise architecture.
You can do this either via a MQTT Broker like HiveMQ + Apache Kafka or via a MQTT Proxy (so that you don't need the MQTT Broker). Both options have trade-offs, of course.
See this example of how to combine MQTT with Apache Kafka. Or go directly to the Github code: "Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data".
I also created a live demo about how to integrate Apache Kafka and MQTT.