Why does the kafka consumer code freeze when I start spark stream?

Why does the kafka consumer code freeze when I start spark stream? - scala

I am new to Kafka and trying to implement Kafka consumer logic in spark2 and when I run all my code in the shell and start the streaming it shows nothing.
I have viewed many posts in StackOverflow but nothing helped me. I have even downloaded all the dependency jars from maven and tried to run but it still shows nothing.
Spark Version: 2.2.0
Scala version 2.11.8
jars I downloaded are kafka-clients-2.2.0.jar and spark-streaming-kafka-0-10_2.11-2.2.0.jar
but it still I face the same issue.
Please find the below code snippet
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.kafka010.{KafkaUtils, ConsumerStrategies, LocationStrategies}
val brokers = "host1:port, host2:port"
val groupid = "default"
val topics = "kafka_sample"
val topicset = topics.split(",").toSet
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val msg = KafkaUtils.createDirectStream[String, String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicset, kafkaParams)
)
msg.foreachRDD{
rdd => rdd.collect().foreach(println)
}
ssc.start()
I am expecting SparkStreaming to start but it doesn't do anything. What mistake have I done here? Or is this a known issue?

The driver will be sitting idle unless you call ssc.awaitTermination() at the end. If you're using spark-shell then it's not a good tool for streaming jobs.
Please, use interactive tools like Zeppelin or Spark notebook for interacting with streaming or try building your app as jar file and then deploy.
Also, if you're trying out spark streaming, Structured Streaming would be better as it is quite easy to play with.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

After ssc.start() use ssc.awaitTermination() in your code.
For testing, write your code in a Main Object and run it in any IDE like Intellij
You can use command shell and publish messages from the Kafka producer.
I have written all these steps in a simple example in a blog post with working code in GitHub. Please refer to: http://softwaredevelopercentral.blogspot.com/2018/10/spark-streaming-and-kafka-integration.html

Related

Difference between running spark application as standalone vs spark submit / spark launcher?

I am exploring different options to package spark application and i am confused what is the best mode and what are the differences between the following modes?
Submit spark application's jar to spark-submit
Construct a fat jar out of spark gradle project and run the jar as stand alone java application.
I have tried both the ways , but my requirement is to package the spark application inside docker container , running fat jar looks easy for me but as am a newbie i don't have any idea about the restrictions that i may face if i go with fat jar approach(leaving aside fat jar may grow in size)
Can you please let us know your inputs
Is it possible to setup spark cluster including driver and executors programatically ?
val conf = new SparkConf()
conf.setMaster("local")
conf.set("deploy-mode", "client")
conf.set("spark.executor.instances", "2")
conf.set("spark.driver.bindAddress", "127.0.0.1")
conf.setAppName("local-spark-kafka-consumer")
val sparkSession = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Spark SQL data sources example")
.getOrCreate()
val sc = sparkSession.sparkContext
val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9093",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "consumerGroup10",
"auto.offset.reset" -> "earliest",
"max.poll.records" -> "1",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("topic1")
val stream = KafkaUtils.createDirectStream[String, String](...)
ssc.start()
ssc.awaitTermination()
} catch {
case e: Exception => println(e)
}

Using fat jars for deploying spark jobs is an old and even ancient practice. You can do this, trust me :) Just be careful about what you're writing inside it.

Flink - producing to kinesis not working

I'm trying to run a simple program which reads from one kinesis stream, does a trivial transformation, and writes the result to another kinesis stream.
Running locally on Flink 1.4.0 (this is the version supported on EMR currently, so no way of upgrading).
Here is the code:
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumerConfig = new Properties()
consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val kinesisMaps = env.addSource(new FlinkKinesisConsumer[String](
"source-stream", new SimpleStringSchema, consumerConfig))
val jsonMaps = kinesisMaps.map { jsonStr => JSON.parseFull(jsonStr).get.asInstanceOf[Map[String, String]] }
val values = jsonMaps.map(jsonMap => jsonMap("field_name"))
values.print()
val producerConfig = new Properties()
producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
val kinesisProducer = new FlinkKinesisProducer[String](new SimpleStringSchema, producerConfig)
kinesisProducer.setFailOnError(true)
kinesisProducer.setDefaultStream("target-stream")
kinesisProducer.setDefaultPartition("0")
values.addSink(kinesisProducer)
// execute program
env.execute("Flink Kinesis")
}
If I comment out the producing code, the program runs as expected and prints the correct values.
As soon as I add the producer code, I get the following exception:
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)
at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.invoke(FlinkKinesisProducer.java:248)
at org.apache.flink.streaming.api.functions.sink.SinkFunction.invoke(SinkFunction.java:52)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:608)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:569)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:486)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.deserializeRecordForCollectionAndUpdateState(ShardConsumer.java:264)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any idea what's the cause of this?

Apparently, this is an issue with the old version of Amazon KPL which is used in Flink 1.4.
There are at least two possible solutions for this:
Upgrade to Flink version 1.5.
You can still use it on EMR, if you install it as described here, under the section Custom EMR Installation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/aws.html
When building the Kinesis connector for Flink 1.4, you can build it with newer AWS dependencies: I've cherry-picked the aws dependency changes in pom.xml of the connector from 1.5, and built the connector with them. Looks like it's working as expected.

Why does Spark Streaming application work fine using sbt run but does not on Tomcat (as web application)?

I have a Spark application in Scala which grabs records from Kafka every 10 seconds and saves them as files. This is SBT project and I run my app with sbt run command. Everything works fine until I deploy my app on Tomcat. I managed to generate WAR file with this plugin but it looks like my app does not do anything when deployed on Tomcat.
This is my code:
object SparkConsumer {
def main (args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("mytopic")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value)).print
val arr = new ArrayBuffer[String]();
val lines = stream.map(record => (record.key, record.value));
stream.foreachRDD { rdd =>
if (rdd.count() > 0 ) {
val date = System.currentTimeMillis()
rdd.saveAsTextFile ("/tmp/sparkout/mytopic/" + date.toString)
rdd.foreach { record => println("t=" + record.topic + " m=" + record.toString()) }
}
println("Stream had " + rdd.count() + " messages")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
println(o)
}
}
stream.saveAsTextFiles("/tmp/output")
ssc.start()
ssc.awaitTermination()
}
}
The strange thing is that the app works completely fine when ran via sbt run command. It reads the records from Kafka properly and saves them as files in the desired directory. I have no idea what is happening. I tried to enable logging with log4j but it doesn't even log anything when on Tomcat. I've been looking for an answer but haven't found the solution.
To sum up
My Scala Spark app (which is SBT project) should read records from Kafka and save them as files every 10 seconds. It works when ran via sbt run command but it doesn't when deployed on Tomcat.
Additional info:
Scala 2.12
Tomcat 7
SBT 0.13.15
ask for more
Q: What is the problem?

tl;dr The standalone application SparkConsumer behaves properly on Tomcat and so does Tomcat itself.
I'm very surprised to have read the question because your code is not something that I'd expect working ever on Tomcat. Sorry.
Tomcat is a servlet container and as such requires servlets in a web application.
Even though you managed to create a WAR and deploy it to Tomcat, you did not "trigger" anything from this web application to start a Spark Streaming application (the code inside main method).
The Spark Streaming application does work fine when executed using sbt run because that's the goal of sbt run, i.e. execute standalone application in a sbt-managed project.
Given you have only one standalone application in your sbt project, sbt run has managed to find SparkConsumer and execute its main entry method. No surprise here.
It however won't work on Tomcat. You'd have to expose the application as a POST or GET endpoint and use a HTTP client (a browser or command-line tool like curl, wget or httpie) to execute it.
Spark does not support Scala 2.12 so...how did you manage to use the Scala version with Spark?! Impossible!

flink read data from kafka

I write a simple example
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers","xxxxxx")
properties.setProperty("zookeeper.connect","xxxxxx")
properties.setProperty("group.id", "caffrey")
val stream = env
.addSource(new FlinkKafkaConsumer082[String]("topic", new SimpleStringSchema(), properties))
.print()
env.execute("Flink Kafka Example")
when I run this code I got an error like this:
[error] Class
org.apache.flink.streaming.api.checkpoint.CheckpointNotifier not found
- continuing with a stub.
I google this error and find CheckpointNotifier is an interface.
I really don't understand where did I do wrong.

Since CheckpointNotifier is a class from an older Flink version, I suspect that you are mixing different Flink dependencies in your pom file.
Make sure all Flink dependencies have the same version.

Flink with Kafka Consumer doesn't work

I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect.
The code is very simple:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "myGroup")
println("topic: "+args(0))
val stream = env.addSource(new FlinkKafkaConsumer09[String](args(0), new SimpleStringSchema(), properties))
stream.print
env.execute()
I use kafka 0.9.0.0 with the same topics (in consumer[Flink] and producer[Kafka console]), but when I send my jar to the cluster, nothing happens:
Cluster Flink
What it could be happening?

Your stream.print will not print in console on flink .It will write to flink0.9/logs/recentlog. Other-wise you can add your own logger for confirming output.

For this particular case (a Source chained into a Sink) the Webinterface will never report Bytes/Records sent/received. Note that this will change in the somewhat near future.
Please check whether the job-/taskmanager logs do not contain any output.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does the kafka consumer code freeze when I start spark stream? - scala

Related

Difference between running spark application as standalone vs spark submit / spark launcher?

Flink - producing to kinesis not working

Why does Spark Streaming application work fine using sbt run but does not on Tomcat (as web application)?

flink read data from kafka

Flink with Kafka Consumer doesn't work

Categories

Resources