Spark Streaming Receivers - pyspark

I was wondering what are the examples of reliable receivers and unreliable receivers in spark streaming. Like for example socket may be (not sure) unreliable receiver etc.

Related

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

TCP delivery and processing gurantees and exactly once guarantee in event / streaming based systems

To the best of my knowledge, TCP provides at least once delivery (retransmission until an ACK), and exactly once processing at the receiver (duplicated packets are going to be just ignored and only one copy will be delivered to the app. ), if this is true why application layer messaging systems (e.g., Kafka ) and streaming systems (e.g., Spark, ) would require their own application level protocols to provide exactly once processing guarantees, why not just relying on TCP for one time delivery and/or processing?
The reliability guarantees of TCP only cover data delivery between systems and not data delivery between applications. The recipient system sends an ACK back if the data were received by the OS and put into the receive buffer of the socket. This means the ACK might be send before the application reads and processes the data. To cover guarantees about reading and processing the data by the application some kind of ACK is therefore needed inside the application protocol.

Create custom receiver for MQTT in spark streaming

I have a requirement to do an analysis using spark on the data coming from IoT device via MQTT broker. The connectivity from my spark job is to MQTT broker where I can subscribe to a specific topic. I have used the MQTTUtils library in spark to connect to the broker, but I have doubts about how the library works internally. What I noticed is, "MQTTutils.createStream" connects to the MQTT broker for a topic. In that case, if I have to subscribe 100 topics in an MQTT broker, it may establish 100 connections to broker. That is not desirable in a real scenario. Please let me know if it is not working the way I think.
So I have decided to create a custom receiver for MQTT broker so that I can manage the connection in my MQTT client. I have gone through the document on how to implement a custom receiver, but I did not succeed in implementing it in the proper way.
If somebody has hands-on on such a custom receiver, please help me to make it works.
Appreciate your support since it is critical for my solution

Which is best polling or realtime for google applications like Gmail or Google Drive?

In general everyone say realtime is best for the performance of the application but is it good to have all the applications as realtime ??
There are some cases where polling might be better than real-time streaming. Essentially, it's when you have a massive event stream and the client cannot easily cope with this stream in real time. For example, you are pushing tons of events to a mobile device that dequeues the data more slowly than the producer. In such a case, thanks to polling, the client could ask for a new batch of data, process it quietly, than ask for another batch. Of course, all this makes sense if the data producer (the server) is able to resample the data flow so that at each request, it doesn't need to send all the same data it would send when streaming.
So, to go back to your specific question, both Gmail and Google Drive do not produce so much real-time data to need polling (I know this sounds counterintuitive!), and I would then say that real-time streaming would always be better than polling. But streaming is a bit more delicate than polling). You must monitor if the connection is healthy. It could be half-closed or half-opened and you need bidirectional heartbeats to make sure it's fully alive. In case of disconnection, you must be able to automatically reconnect and restore the state before the connection broke.

Flafka (Http -> Flume->Kafka ->Spark Streaming)

I have one use case for real time streaming, we will be using Kafka(0.9) for message buffer and spark streaming(1.6) for stream processing (HDP 2.4). We will receive ~80-90K/Sec event on Http. Can you please suggest a recommended architecture for data ingestion into Kafka topics which will be consumed by spark streaming.
We are considering flafka architecture.
Is Flume listening to Http and sending to Kafka (Flafka )for real time streaming a good option?
Please share other possible approaches if any.
One approach could be Kafka Connect. Look for a source that fit in your needs or develop a custom new one.