Spark consumer not receiving Kafka messages - scala

I have a Spark scala consumer that is connecting to my Kafka brokers on another cluster (Kafka cluster is separate from CDH cluster). params are my Kafka params that being picked up correctly.
val incomingstream = KafkaUtils.createDirectStream[String, String](
streamingContext, .....](topicSet, params))
print(incomingstream)
I am able to produce and consume on the console of my Kafka cluster. But on running the spark consumer having the above code, it just keeps waiting and even though i send messages through the kafka console producer it doesnt show up on the log prints. incomingstream doesnt get printed.
I have connectivity from the node where spark job is running to kafka cluster. Submitting in yarn mode. Shows connection to kafka brokers. (Not sure if the issue is because of Kerberos...doesnt say that in logs..)
Using CDH 5.10
Spark 2.2
Kafka 0.10
Scala 2.11.8
EDIT: Kafka Params passed in as below. Connecting fine to my kafka brokers from the spark job - printed the logs
"bootstrap.servers" -> "<domain>:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"security.protocol" -> "PLAINTEXT"
My Kafka listener is configured as plaintext (not SSL) - but if i pass the above, It is complaining of
Selector:375 - Connection with 10.18.63.18 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

Related

How to set specific offset number while consuming message from Kafka topic through Spark streaming Scala

I am using below spark streaming Scala code for consuming real time kafka message from producer topic.
But the issue is sometime my job is failed due to server connectivity or some other reason and in my code auto commit property is set true due to that some message is lost and not able to store in my database.
So just want to know is there any way if we want to pull old kafka message from specific offset number.
I tried to set "auto.offset.reset" is earliest or latest but it fetch only new message those is not yet commit.
Let's take the example here like my current offset number is 1060 and auto offset reset property is earliest so when I restart my job it starts reading the message from 1061 but in some case if I want to read old kafka message from offset number 1020 then is there any property that we can use to start the consuming message from specific offset number
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val topic = "test123"
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[KafkaAvroDeserializer],
"schema.registry.url" -> "http://abc.test.com:8089"
"group.id" -> "spark-streaming-notes",
"auto.offset.reset" -> "earliest"
"enable.auto.commit" -> true
)
val stream = KafkaUtils.createDirectStream[String, Object](
ssc,
PreferConsistent,
Subscribe[String, Object](topic, KafkaParams)
stream.print()
ssc.start()
ssc.awaitTermination()
From Spark Streaming, you can't. You'd need to use kafka-consumer-groups CLI to commit offsets specific to your group id. Or manually construct a KafkaConsumer instance and invoke commitSync before starting the Spark context.
import org.apache.kafka.clients.consumer.KafkaConsumer
val c = KafkaConsumer(...)
val toCommit: java.util.Map[TopicPartition,OffsetAndMetadata] = ...
c.commitSync(toCommit) // But don't do this every run of your app
ssc.start()
Alternatively, Structured Streaming does offer startingOffsets config.
auto.offset.reset only applies to non existing group.id's

Alpakka kafka consumer offset

I am using Alpakka-kafka in scala to consume a Kafka topic. Here's my code:
val kafkaConsumerSettings: ConsumerSettings[String, String] =
ConsumerSettings(actorSystem, new StringDeserializer, new StringDeserializer)
.withBootstrapServers(kafkaConfig.server)
.withGroupId(kafkaConfig.group)
.withProperties(
ConsumerConfig.MAX_POLL_RECORDS_CONFIG -> "100",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> "SSL"
)
Consumer
.plainSource(kafkaConsumerSettings, Subscriptions.topics(kafkaConfig.topic))
.runWith(Sink.foreach(println))
However, consumer only starts polling from the first uncommitted message in topic. I would like to always start from offset 0, regardless of messages being committed.
With Alpakka consumer, how do I specify offset manually?
I think you want to add a couple of config entries:
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> False so your job never save any offset
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest" so your job starts from the begining.
If your job already committed offsets in the past, you may have to reset its offset to earliest.

java.lang.RuntimeException for Flink consumer connecting to Kafka cluster with multiple partitions

Flink Version 1.9.0
Scala Version 2.11.12
Kafka Cluster Version 2.3.0
I am trying to connect a flink job I made to a kafka cluster that has 3 partitions. I have tested my job against a kafka cluster topic running on my localhost that has one partition and it works to read and write to the local kafka. When I attempt to connect to a topic that has multiple partitions I get the following error (topicName is the name of the topic I am trying to consume. Weirdly I dont have any issues when I am trying to produce to a multi-partition topic.
java.lang.RuntimeException: topicName
at org.apache.flink.streaming.connectors.kafka.internal.KafkaPartitionDiscoverer.getAllPartitionsForTopics(KafkaPartitionDiscoverer.java:80)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:508)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:529)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:393)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
My consumer code looks like this:
def defineKafkaDataStream[A: TypeInformation](topic: String,
env: StreamExecutionEnvironment,
SASL_username:String,
SASL_password:String,
kafkaBootstrapServer: String = "localhost:9092",
zookeeperHost: String = "localhost:2181",
groupId: String = "test"
)(implicit c: JsonConverter[A]): DataStream[A] = {
val properties = new Properties()
properties.setProperty("bootstrap.servers", kafkaBootstrapServer)
properties.setProperty("security.protocol" , "SASL_SSL")
properties.setProperty("sasl.mechanism" , "PLAIN")
val jaasTemplate = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"%s\" password=\"%s\";"
val jaasConfig = String.format(jaasTemplate, SASL_username, SASL_password)
properties.setProperty("sasl.jaas.config", jaasConfig)
properties.setProperty("group.id", "MyConsumerGroup")
env
.addSource(new FlinkKafkaConsumer(topic, new JSONKeyValueDeserializationSchema(true), properties))
.map(x => x.convertTo[A](c))
}
Is there another property I should be setting to allow for a single job to consume from multiple partitions?
After digging around and questioning everything in my process I found the issue.
I looked at the Java code of the KafkaPartitionDiscoverer function that had the runtime exception.
One section I noticed handled RuntimeException
if (kafkaPartitions == null) {
throw new RuntimeException("Could not fetch partitions for %s. Make sure that the topic exists.".format(topic));
}
I was working off of a kafka cluster that I dont maintain and had a topic name that was given to me that I did not verify first. When I did verify it using:
kafka-topics --describe --zookeeper serverIP:2181 --topic topicName
It returned a response of :
Error while executing topic command : Topics in [] does not exist
ERROR java.lang.IllegalArgumentException: Topics in [] does not exist
at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:435)
at kafka.admin.TopicCommand$ZookeeperTopicService.describeTopic(TopicCommand.scala:350)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:66)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
After I got the correct topic name everything works.

AbstractMethodError creating Kafka stream

I'm trying to open a Kafka (tried versions 0.11.0.2 and 1.0.1) stream using createDirectStream method and getting this AbstractMethodError error:
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at org.apache.spark.streaming.kafka010.KafkaUtils$.initializeLogIfNecessary(KafkaUtils.scala:39)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.streaming.kafka010.KafkaUtils$.log(KafkaUtils.scala:39)
at org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66)
at org.apache.spark.streaming.kafka010.KafkaUtils$.logWarning(KafkaUtils.scala:39)
at org.apache.spark.streaming.kafka010.KafkaUtils$.fixKafkaParams(KafkaUtils.scala:201)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.<init>(DirectKafkaInputDStream.scala:63)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:147)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:124)
This is how I'm calling it:
val preferredHosts = LocationStrategies.PreferConsistent
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[IntegerDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"auto.offset.reset" -> "earliest"
)
val aCreatedStream = createDirectStream[String, String](ssc, preferredHosts,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
I have Kafka running on 9092 and I'm able to create producers and consumers and pass messages between them so not sure why it's not working from Scala code. Any ideas appreciated.
Turns out I was using Spark 2.3 and I should've been using Spark 2.2. Apparently that method was made abstract in the later version so I was getting that error.
I had the same exception, in my case I created the application jar with dependency to spark-streaming-kafka-0-10_2.11 of version 2.1.0, while trying to deploy to Spark 2.3.0 cluster.
I recieved same error. I set my dependencies same version as my spark interpreter is
%spark2.dep
z.reset()
z.addRepo("MavenCentral").url("https://mvnrepository.com/")
z.load("org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0")
z.load("org.apache.kafka:kafka-clients:2.3.0")

Cannot pass broker list parameter from Scala to Kafka: Property bootstrap.servers is not valid

I need to consume messages from the topic of remote Kafka queue using Scala and Spark. By default the port of Kafka on remote machine is set to 7072, not 9092. Also, on remote machine there are the following versions installed:
Kafka 0.10.1.0
Scala 2.11
It means that I should pass the broker list (with the port 7072) from Scala to remote Kafka, because otherwise it will try to use the default port.
The problem is that according to logs the parameter bootstrap.servers cannot be recognized by the remote machine. I also tried to rename this parameter to metadata.broker.list, broker.list and listeners, but all the time the same error appears in logs Property bootstrap.servers is not valid and then the port 9092 is used by default (and the messages are obviously not consumed).
In POM file I use the following dependency for Kafka and Spark:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.2</version>
</dependency>
So, I use Scala 2.10, not 2.11.
This is my Scala code (it works absolutely fine if I use my own Kafka installed in Amazon Cloud where I have EMR machines (there I have the port 9092 used for Kafka)):
val testTopicMap = testTopic.split(",").map((_, kafkaNumThreads.toInt)).toMap
val kafkaParams = Map[String, String](
"broker.list" -> "XXX.XX.XXX.XX:7072",
"zookeeper.connect" -> "XXX.XX.XXX.XX:2181",
"group.id" -> "test",
"zookeeper.connection.timeout.ms" -> "10000",
"auto.offset.reset" -> "smallest")
val testEvents: DStream[String] =
KafkaUtils
.createStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
testTopicMap,
StorageLevel.MEMORY_AND_DISK_SER_2
).map(_._2)
I was reading this Documentation but it looks like everything I did is correct. Should I use some other Kafka client API (other Maven dependency)?
UPDATE #1:
I also tried Direct Stream (without Zookeeper), but it runs me into the error:
val testTopicMap = testTopic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> "XXX.XX.XXX.XX:7072,XXX.XX.XXX.XX:7072,XXX.XX.XXX.XX:7072","bootstrap.servers" -> "XXX.XX.XXX.XX:7072,XXX.XX.XXX.XX:7072,XXX.XX.XXX.XX:7072",
"auto.offset.reset" -> "smallest")
val testEvents = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopicMap).map(_._2)
testEvents.print()
17/01/02 12:23:15 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
UPDATE #2:
I found this related topic. The suggested solution says Fixed it by setting the property 'advertised.host.name' as instructed by the comments in the kafka configuration (config/server.properties). Do I understand correctly that config/server.properties should be changed on the remote machine where Kafka is installed?
Kafka : How to connect kafka-console-consumer to fetch remote broker topic content?
I think I ran into the same issue recently (EOFException) and the reason was a kafka version mismatch.
if i look here https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10/1.6.2 the compile time dependency of the kafka streaming version is 0.8 whereas you use 0.10.
as far as i know 0.9 is already not compatible with 0.8. can you try to setup a local 0.8 or 0.9 broker and try to connect?