How to integrate Storm and Kafka [closed] - streaming

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have worked in Storm and developed a basic program which is using a local text file as input source. But now I have to work on streaming data coming continuously from external systems. For this purpose, Kafka is the best choice.
The problem is how to make my Spout get streaming data from Kafka. Or how to integrate Storm with Kafka. How can I do that so I may process data, coming from Kafka?

Look for KafkaSpout.
It's a normal Storm Spout implementation that reads from a Kafka cluster. All you need is to configure that spout with parameters like list of brokers, topic name, etc. You can simply then chain the output to corresponding bolts for further processing.
From the same doc mentioned above, the configuration goes like this:
SpoutConfig spoutConfig = new SpoutConfig(
ImmutableList.of("kafkahost1", "kafkahost2"), // List of Kafka brokers
8, // Number of partitions per host
"clicks", // Topic to read from
"/kafkastorm", // The root path in Zookeeper for the spout to store the consumer offsets
"discovery"); // An id for this consumer for storing the consumer offsets in Zookeeper
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);

Related

Kafka Connect for edit/customize the message before sending it to Kafka [duplicate]

This question already has answers here:
Kafka Connect- Modifying records before writing into sink
(2 answers)
Closed 1 year ago.
As I've read from Kafka: The definitive guide book, Kafka Connect can simplify the task of loading CSV files into Kafka. But because we didn't write any code for business logic implementation (like Python/Java code), what should I do if I want to get data from CSV, and add many data from different sources to generate a new message, or even generate new data from system logs to that new message, before loading it into Kafka? Is Kafka Connect still a good approach in this use case?
The source for this answer is from this Stackoverflow thread: Kafka Connect- Modifying records before writing into HDFS sink
You have several options.
Single Message Transforms, great for light-weight changes as messages pass through Connect. Configuration-file-based, and extensible using the provided API if there's not an existing transform that does what you want. See the discussion here on when SMT is suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like.

Why Kafka topic queue does not get empty when message(s) was taken by consumer? [duplicate]

This question already has answers here:
Delete message after consuming it in KAFKA
(5 answers)
Closed 2 years ago.
I'm learning Kafka and if someone could help me to understood one thing.
"Producer' send message to Kafka topic. It stays there some time (7 days by default, right?).
But "consumer" receives such message and there is not much sense to keep it there eternally.
I expected that these messages disappear when consumer gets them.
Otherwise, when I connect to Kafka again, I will download the same messages again. So I have to manage duplicate avoidance.
What's the logic behind it?
Regards
"Producer" send message to Kafka topic. It stays there some time (7 days by default, right?).
Yes, a Producer send the data to a Kafka topic. Each topic has its own configurable cleanup.policy. By default it is set to a retention period of 7 days. You can also configure the retention of the topic based on byte size.
But "consumer" receives such message and there is not much sense to keep it there eternally.
Kafka can be seen as a Publisher/Subscribe messaging system (although mainly being a streaming platform). It has the great benefit that more than one single Consumer can read the same messages of a topic. Compared to other messaging systems the data is not deleted after acknowledged by a consumer.
Otherwise, when I connect to Kafka again, I will download the same messages again. So I have to manage duplicate avoidance.
Kafka has the concept of "Offsets" and "ConsumerGroups" and I highly recommend to get familiar with them as they are essential when working with Kafka. Each consumer is part of a ConsumerGroup and each message in a topic has a unique identifer called "offset". An offset is like a unique identifer that stays with the same message for its life-time.
Each ConsumerGroup keeps track of the messages (offsets) that it already consumed. Now, if you do not want to read the same messages again your ConsumerGroup just have to commit those offsets and it will not read them again.
That way you will not consume duplicates, but still other consumers (with a differen ConsumerGroup) are able to read all messages again.

Best practice for send message to many applications in kafka [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a topic with many partitions. I want many applications read all messages from this topic. Some applications read frequently and other at midnight.
I don't find any help, for this problem in stackoverflow or in book.
How i can implement that in kafka ?
Thanks
You'd need Kafka Consumers in order to let your applications consume messages from Kafka topics. In case each of your application needs to consume all messages, then you'd have to assign a distinct group-id to every application.
Kafka assigns the partitions of a topic to the consumer in a group. It also guarantees that a message is only ever read by a single consumer in the group. Partitions is the level of parallelism in Kafka (if you have 2 partitions then you can have up to 2 consumers in the same consumer group).
Example:
Say you have topic example-topic with 5 partitions and 2 applications each of which must consume all the messages from example-topic. You will need two distinct consumer groups (1 per application), say group.id=app-group-1 for the first app and group.id=app-group-2 for your second app. Within each consumer group, you can start at most 5 consumers consuming messages from your topic. Therefore, up to 5 consumers will subscribe to topic example-topic and belong to group.id=app-group-1 and another 5 consumers that will subscribe to topic example-topic and belong to group.id=app-group-2.

Is there any difference between KafkaConsumer and KafkaStreams? [duplicate]

This question already has answers here:
Kafka: Consumer API vs Streams API
(3 answers)
Closed 5 years ago.
I'm using Apache Kafka 0.8.2.1, planning to upgrade my application to use Apache kafka 1.0.0.
While I inspect about Kafka Streams, I got some question about difference between KafkaConsumer and KafkaStreams.
Basically, KafkaConsumer have to consume from broker using polling method. I can specify some duration while polling, and whenever I got ConsumerRecored I can handle it to product some useful information.
KafkaStream, on the other hand, I don't have to specify any polling duration but just call start() method.
I know that KafkaConsumer basically used to consume literally, from broker and KafkaStreams can do various thing like Map-Reduce or interact with database, even re-produce to other kafka or any other systems.
So, there is my question. Is there any difference between KafkaConsumer and KafkaStream basically(in other words, When it comes to level of apache kafka library.)?
Yes, there is the difference between Kafka Consumer and Kafka Streams.
Kafka Consumer can be used at receiving end to receive data and process for future computation(based on topic and partition)
Kafka Streams API to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.
As you've said, they offer different functionalities:
KafkaStreams allows to perform complex processing on records
KafkaConsumer allows to receive records from a Kafka Cluster
KafkaStreams uses regular KafkaConsumers and KafkaProducers clients under the cover in order to retrieve records and send the results of processing to the brokers. It uses predefined values for many configurations but still exposes a lot of client configurations.
KafkaStreams is a regular (although pretty advanced) Kafka application using the Kafka clients (Consumer and Producer). Its APIs allow higher level applications to focus on the business logic and not on the Kafka details.
Also being part of the Apache Kafka distribution, it's using best practices and tricks to make the most of Kafka.

The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In spark stream, we set the batch interval for nearly realtime microbatch processing. In Flink (DataStream) or Storm, stream is realtime, so I guess there is no such concept of batch interval.
In kafka, the consumer is pulling, I imagine that Spark uses the batch interval parameter to pull out the messages from Kafka broker, so how does Flink and Storm do it? I imagine that Flink and Storm pull the Kafka messages in a fast loop to form the realtime stream source, if so, and if I set Spark batch interval to be small such as 100ms, 50ms or even smaller, do we have significant differences between Spark Streaming and Flink or Storm?
Meanwhile, in Spark, if the streaming data is large and batch interval is too small, we may meet a situation that there are lots of data being waiting to be processed, and therefore there is change we will see OutOfMemmory happens. Would it happen in Flink or Storm?
I have implemented an application to do topic-to-topic transformation, the transformation is easy, but source data could be huge (considering it a IoT app). My original implementation is backed by reactive-kafka, it works fine in my standalone Scala/Akka app. I did not implemented the application to be clustered, because if I need it, Flink/Storm/Spark are already there. Then I found Kafka Stream, to me it is similar to reactive-akka in the view of client usage. So, if I use Kafka Stream or reactive-kafka in standalone applications or microservices, do we have to concern about the reliability/availability of the client code?
You understanding about micro-batch vs stream processing is correct. You are also right, that all three system use the standard Java consumer that is provided by Kafka to pull data for processing in an infinite loop.
The main difference is, that Spark needs to schedule a new job for each micro batch it processes. And this scheduling overhead in quite high, such that Spark cannot handle very low batch intervals like 100ms or 50ms efficiently and thus throughput goes down for those small batches.
Flink and Storm are both true streaming systems, thus both deploy the job only once at startup (and the job runs continuously until explicitly shut down by the user) and thus they can handle each individual input record without overhead and very low latency.
Furthermore for Flink, JVM main memory is not a limitation because Flink can use off-heap memory as well as write to disk if available main memory is too small. (Btw: Spark since project Tungsten, can also use off-heap memory, but they can spill to disk to some extent -- but different than Flink AFAIK). Storm, AFAIK, does neither and is limited to JVM memory.
I am not familiar with reactive Kafka.
For Kafka Streams, it is a fully fault-tolerant, stateful stream processing library. It is design for micro service development (you do not need a dedicated processing cluster as for Flink/Storm/Spark) but can deploy your application instances anywhere and in any way to want. You scale you application by simply starting up more instances. Check out the documentation for more details: http://docs.confluent.io/current/streams/index.html (there are also interesting posts about Kafka Streams in Confluent blog: http://www.confluent.io/blog/)