I have one Kafka partition, and one sparkStreaming application. One server with 10 cores. When the spark streaming get one message from Kafka, the subsequent process will take 5 seconds(this is my code). So I found sparkStreaming read Kafka message very slow, I'm guessing that when spark read out one message then it will wait until the message was processed, so the reading and processing are synchronized.
I was wondering can I make the spark reading asynchronously? So the reading from Kafka won't be dragged by the subsequent processing. Then the spark will very quickly consume data from Kafka. And then I can focus on the slow data process inside spark. btw, I'm using foreachRDD function.
you can increase the number of partitions in kafka, it should improve the parallelism , also you can try with "Direct kafka receiver" which really improve the performance when your app is reading from kafka
Related
I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In spark stream, we set the batch interval for nearly realtime microbatch processing. In Flink (DataStream) or Storm, stream is realtime, so I guess there is no such concept of batch interval.
In kafka, the consumer is pulling, I imagine that Spark uses the batch interval parameter to pull out the messages from Kafka broker, so how does Flink and Storm do it? I imagine that Flink and Storm pull the Kafka messages in a fast loop to form the realtime stream source, if so, and if I set Spark batch interval to be small such as 100ms, 50ms or even smaller, do we have significant differences between Spark Streaming and Flink or Storm?
Meanwhile, in Spark, if the streaming data is large and batch interval is too small, we may meet a situation that there are lots of data being waiting to be processed, and therefore there is change we will see OutOfMemmory happens. Would it happen in Flink or Storm?
I have implemented an application to do topic-to-topic transformation, the transformation is easy, but source data could be huge (considering it a IoT app). My original implementation is backed by reactive-kafka, it works fine in my standalone Scala/Akka app. I did not implemented the application to be clustered, because if I need it, Flink/Storm/Spark are already there. Then I found Kafka Stream, to me it is similar to reactive-akka in the view of client usage. So, if I use Kafka Stream or reactive-kafka in standalone applications or microservices, do we have to concern about the reliability/availability of the client code?
You understanding about micro-batch vs stream processing is correct. You are also right, that all three system use the standard Java consumer that is provided by Kafka to pull data for processing in an infinite loop.
The main difference is, that Spark needs to schedule a new job for each micro batch it processes. And this scheduling overhead in quite high, such that Spark cannot handle very low batch intervals like 100ms or 50ms efficiently and thus throughput goes down for those small batches.
Flink and Storm are both true streaming systems, thus both deploy the job only once at startup (and the job runs continuously until explicitly shut down by the user) and thus they can handle each individual input record without overhead and very low latency.
Furthermore for Flink, JVM main memory is not a limitation because Flink can use off-heap memory as well as write to disk if available main memory is too small. (Btw: Spark since project Tungsten, can also use off-heap memory, but they can spill to disk to some extent -- but different than Flink AFAIK). Storm, AFAIK, does neither and is limited to JVM memory.
I am not familiar with reactive Kafka.
For Kafka Streams, it is a fully fault-tolerant, stateful stream processing library. It is design for micro service development (you do not need a dedicated processing cluster as for Flink/Storm/Spark) but can deploy your application instances anywhere and in any way to want. You scale you application by simply starting up more instances. Check out the documentation for more details: http://docs.confluent.io/current/streams/index.html (there are also interesting posts about Kafka Streams in Confluent blog: http://www.confluent.io/blog/)
I am experiencing an issue to start spark streaming on a really big kafka topic, there are around 150 million data in this topic already and the topic is growing super fast.
When I tried to start spark streaming and read data from the beginning of this topic by setting kafka parameter ("auto.offset.reset" -> "smallest"), it always try to finish all 150 million data processing in the first batch and return a "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. There isn't a lot calculation in this spark stream app though.
Can I have a way to process the history data in this topic in first several batches but not all in first batch?
Bunch of thanks in advance!
James
You can control spark kafka-input reading rate with following spark configuration spark.streaming.kafka.maxRatePerPartition .
You can configure this by giving how many docs you want to process per batch.
sparkConf.set("spark.streaming.kafka.maxRatePerPartition","<docs-count>")
Above config process <docs-count>*<batch_interval> records per batch.
You can find more info about above config here.
I am completely new to Big Data, from last few weeks i am try to build log analysis application.
I read many articles and i found Kafka + spark streaming is the most reliable configuration.
Now, I am able to process data sent from my simple kafka java producer to spark Streaming.
Can someone please suggest few things like
1) how can i read server logs real time and pass it to kafka broker.
2) any frameworks available to push data from logs to Kafka?
3) any other suggestions??
Thanks,
Chowdary
There are many ways to collect logs and send to Kafka. If you are looking to send log files as stream of events I would recommend to review Logstash/Filebeats - just setup you input as fileinput and output to Kafka.
You may also push data to Kafka using log4j KafkaAppender or pipe logs to Kafka using many CLI tools already available.
In case you need to guarantee sequence, pay attention to partition configuration and partition selection logic. For example, log4j appender will distribute messages across all partitions. Since Kafka guarantees sequence per partition only, your Spark streaming jobs may start processing events out of sequence.
I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.