Storm for batch like processing - apache-kafka

I am new to Storm and have few basic questions. My use case for storm is both stream and batch processing.
Use Case #1: Storm topology takes in the tuples as stream and processes it.
Use Case #2: Storm topology should take in the tuples as a batch of tuples and process it.
I'm using Kafka as the queuing mechanism to feed Storm topology.
Question: Is there a way, where I can tell that the a particular tuple is the end of the stream and storm should tell me when the processing of all the tuples is finished?
Is Storm not the correct framework to do this, as it is meant for stream processing (Use case #1). Would Storm Trident help for use case #2?

You cannot tell Storm, that a tuple is the last of a stream. However, if you know that you just emitted the last tuple from your Spout, you can set an internal flag for yourself and furthermore wait until you received all acks in the Spout. When all acks are received, you know that all tuples got processed completely by Storm.
For question 2, it is not clear to me, what you mean by "do the same processing"? It seems, you want to process the same data twice in different modes (or did I understand something wrong)? Why do you distinguish between "stream" and "batch" case? What is the different semantics you want to get? And what do you mean by "take in the tuples as a batch of tuples". Do you know that you have an finite data stream? Do you want to put all tuples into a single batch? Or do you want to do some micro-batching?
For micro-batching, Trident would be useful. If you have a real batch job, Storm is no good fit. For this, you might want to check out Apache Flink (disclaimer, I am a committer to Flink) or Apache Spark. Both are hybrid systems supporting batch and streaming. Depending on your needs on streaming semantics the one or other might be the better fit. Spark provides micro-batching to emulate streaming while Flink does real streaming.

Related

How to ensure exactly once semantics while processing kafka messages in Apache Storm

I needed exactly once delivery in my app. I explored kafka and realised that to have message produced exactly once, I have to set idempotence=true in producer config. This also sets acks=all, making producer resend messages till all replicas have committed it. To ensure that consumer does not do duplicate processing or leave any message unprocessed, it is advised to commit the processing output and offset to external database in same database transaction, so that either both of them will be persisted or none avoiding duplicate and no processing.
In consumer, message is left processed if consumer first commits it but fails before processing it and message is processed more than once if consumers first processes it but fails before committing it.
Q1. Now I was guessing how can I imitate the same with Apache Storm. I guess exactly once production of message can be ensured by setting idemptence=true in KafkaBolt. Am I right?
I was guessing how I can ensure missed and duplicate message processing in Storm. For example, this doc page says if I anchor a tuple (by passing it as first parameter to OutputCollector.emit()) and then pass the tuple to OutputCollector.ack() or OutputCollector.fail(), Storm will ensure data loss. This is what it exactly says:
Now that you understand the reliability algorithm, let's go over all the failure cases and see how in each case Storm avoids data loss:
A tuple isn't acked because the task died: In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed.
Acker task dies: In this case all the spout tuples the acker was tracking will time out and be replayed.
Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
Q2. I guess this ensures that message is not left unprocessed, but does not avoid duplicate processing of messages. Am I correct with this? Also is there anything else that Storm offers to ensure exactly once semantics like kafka that I am missing?
Regarding Q1: Yes, you can get the same behavior from the KafkaBolt by setting that property, the KafkaBolt simply wraps a KafkaProducer.
Regarding semantics on the consuming side, you have the same options with Storm as you do with Kafka. When you read a message from Kafka, you can choose to commit before or after you do your processing (e.g. write to a database). If you do it before, and the program crashes, you will lose the message. Let's call this at-most-once processing. If you do it after, you risk processing the same message twice if the program crashes after the processing but before the commit, called at-least-once processing.
So, regarding Q2: Yes, using anchored tuples and acking will provide you with at-least-once semantics. Not using anchored tuple would give you at-most-once.
Yes, there is something else Storm offers to ensure exactly once semantics called Trident, but it requires you to write your topology differently, and your data store has to be adapted to it so message deduplication can happen. See the documentation at https://storm.apache.org/releases/2.0.0/Trident-tutorial.html.
Also just to caution you: When documentation for Storm (or Kafka) talk about exactly-once semantics, there are some assumptions made about what kind of processing you'll do. For example, when Storm's Trident docs talk about exactly-once, there's an assumption that you'll adapt your database so you can decide when given a message whether it has already been stored. When Kafka's documentation talks about exactly-once, the assumption is that your processing will be reading from Kafka, doing some computation (most likely with no side effects) and writing back to Kafka.
This is just to say that for some types of processing, you may still need to pick between at-least-once and at-most-once. If you can make your processing idempotent, at-least-once is a good option.
Finally if your processing fits the "read from Kafka, do computation, write to Kafka" model, you can likely get nicer semantics out of Kafka Streams than Storm, as Storm can't provide the exactly-once semantics Kafka can provide in that case.

storm failing to process all the tuples

I'm using Apache Storm to process huge data coming off a Kafka spout. Currently, there are over 3k json messages already published to Kafka and it's continuing. I have to process all the messages published from beginning. So, I have set a Kafka spout parameter accordingly.
This results in a lot of failures in tuple processing. I got this info from the storm UI.
I suspect the storm is not able to handle all the messages bombarded towards it in a single shot.
Any help is appreciated.
1) increase the parallelism hint for the bolts so that there's no backlog slowing down the processing for any tuple emitted by the spout, or
2) use the topology.max.spout.pending property to limit the number of tuples the spout can emit before having to wait for one of those tuples to complete.
try combination of both solutions. In production usually you need to run many iterations to get proper value of both the values (parallelism,topology.max.spout.pending)

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.

Does it make sense to build a data processing pipeline using only Kafka?

I am building a data processing pipeline using Kafka.
The pipeline is linear with 4 stages.
The data volume is medium (will need more than one machine but not hundreds or thousands; data volume is a few tens of gigabytes)
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why? Of course, I prefer the simplest possible architecture. If I can do it all with Kafka, I'd prefer that. In the future I may need some additional machine learning stages and that may affect the answer. I have no strong once-only semantics, I can accept some message loss and some duplication with no problem.
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why?
Technically yes you can. If you are ready to handle the whole distributed architecture on your own. Writing your own multi-threaded producers, managing those consumers and so on. You also need to consider in terms of Scalability, performance, durability etc. And here comes the beauty of using computation engine like Storm, Spark etc. So you can simply concentrate on the core logic and leave the infrastructure be maintained by them.
For example using a combination of Kafka and Storm for your architecture, you can store terabytes of data using kafka and feed them to storm for processing. If you are familiar with storm then a sample topology can be something like this:
(kafka-spout consuming messages from topic) --> ( Bolt-A for processing the data receive through spout & feeding it to bolt B) --> (Bolt-B for pushing back the processed data into another kafka topic)
Using such architecture offers great deal in scalability, throughput, performance etc.Making some easy configuration changes you will be able to tune your application based on your requirements.

Data ingestion with Apache Storm

I have been reading a lot of articles where implementations of Apache Storm are explained for ingesting data from either Apache Flume or Apache Kafka. My main question remains unanswered after reading several articles. What is the main benefit of using Apache Kafka or Apache Flume? Why not collecting data from a source directly into Apache Storm?
To understand this I looked into these frameworks. Correct me if I am wrong.
Apache Flume is about collecting data from a source and pushing data to a sink. The sink being in this case Apache Storm.
Apache Kafka is about collecting data from a source and storing them in a message queue until Apache Storm processes it.
I am assuming you are dealing with the use case of Continuous Computation Algorithms or Real Time Analytics.
Given below is what you will have to go through if you DO NOT use Kafka or any message queue:
(1) You will have to implement functionality like consistency of data.
(2) You are ready to implement replication on your own
(3) You are ready to tackle a variety of failures and ready to build a fault tolerant system.
(4) You will need to create a good design so that your producer and consumer are completely decoupled.
(5) You will have to implement persistence. What happens if your consumer fails?
(6) What happens to fault resilience? Do you want to take the entire system down when your consumer fails?
(7) You will have to implement delivery guarantees as well as ordering guarantees.
All of the above are inherent features of a message queue (Kafka etc.) and you will of-course not like to re-invent the wheel here.
I think the reason for having different configurations could be a matter of how the source data is obtained. Storm spouts (the first elements in the Storm topologies) are meant to synchronously polling for the data, while Flume agents (agent=source+channel+sink) are meant to asynchronously receive the data at the source. Thus, if you have a system that notifies certain events then a Flume agent is required; then this agent would be in charge of receiving the data and putting into any queue management system (ActiveMQ, RabbitMQ...) in order to be cosumed by Storm. The same would apply to Kafka.