After I have deployed my zoomeeper and Kafka clusters on Alibaba cloud server, I use my local idea to establish sparkstreamingcontext and try to connect to the Kafka cluster of the cloud server and consume data. However, the following error is reported. My code is as follows:
ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition first-1 could be determined
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingWC")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("cpp")
val kafkaPara = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "ryan1",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
)
val kafkaDS = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set("first"), kafkaPara)
)
kafkaDS.map(_.value()).print()
ssc.start()
ssc.awaitTermination()
Related
I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each
I am trying to create Kafka producer connected to Spark consumer. The producer works fine, however, the consumer in Spark does not read the data from the topic for some reason. I run kafka using spotify/kafka image in docker-compose.
Here is my consumer:
object SparkConsumer {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("KafkaSparkStreaming")
.master("local[*]")
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(3))
val topic1 = "topic1"
def kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lines = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set(topic1), kafkaParams)
)
lines.print()
}
Kafka Producer looks like this:
object KafkaProducer {
def main(args: Array[String]) {
val events = 10
val topic = "topic1"
val brokers = "localhost:9092"
val random = new Random()
val props = new Properties()
props.put("bootstrap.servers", brokers)
props.put("client.id", "KafkaProducerExample")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val t = System.currentTimeMillis()
for (nEvents <- Range(0, events)) {
val key = null
val values = "2017-11-07 04:06:03"
val data = new ProducerRecord[String, String](topic, key, values)
producer.send(data)
System.out.println("sent : " + data.value())
}
System.out.println("sent per second: " + events * 1000 / (System.currentTimeMillis() - t))
producer.close()
}
}
UPDATE:
My docker-compose file with Kafka:
version: '3.3'
services:
kafka:
image: spotify/kafka
ports:
- "9092:9092"
This is a common problem using Kafka with Docker. First, you should check what is the configuration in zookeeper for your topic. You can use the Zookeeper scripts inside the Kafka container. Probably when your topic is created the ADVERTISED_HOST is the name of your service. So when the consumer tries to connect to the broker, this returns "kafka" as the broker location. Because you are running the consumer outside the docker network, your consumer will never connect to the broker for consuming. Try to set the env for your kafka container with ADVERTISED_HOST=localhost.
I am reading messages from Kafka using Spark Kafka direct streaming. I want to implement zero message loss and after restarts spark, it has to read the missed messages from Kafka. I am using checkpoint to save all read offset, so that next time spark will start read from stored offset. this is my understanding.
I have used below code. I stopped my spark and pushed few message to Kafka. After restart the spark which is not reading missed messages from Kafka. Spark reads latest messages from kafka. How to read the missed message from Kafka?
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(6000))
ssc.checkpoint("C:/cp")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(50))
val msgStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Note: Application logs shows auto.offset.reset to none instead of latest. why ?
WARN KafkaUtils: overriding auto.offset.reset to none for executor
SBT
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
Windows : 7
If you want to read missed out messages, try commit process instead of checkpoint.
Please understand, Spark can't read old messages with property:
"auto.offset.reset" -> "latest"
Try this:
val kafkaParams = Map[String, Object](
//...
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
//...
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//Your processing goes here
//Then commit after completing your process.
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
Hope this helps.
I would rather suggest not to rely on checkpointing instead you can use an external data store to save your processed Kafka message offset.Please follow the link to get some insight.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
I made connector that read from database with jdbc, and consuming it from a Spark application. The app read the database data well, BUT it read only first 10 row and seems to ignore rest of them. How should I get rest, so I can compute with all data.
Here are my spark code:
val brokers = "http://127.0.0.1:9092"
val topics = List("postgres-accounts2")
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
//sparkConf.setMaster("spark://sda1:7077,sda2:7077")
sparkConf.setMaster("local[2]")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Record]))
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
// Create direct kafka stream with brokers and topics
//val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
"schema.registry.url" -> "http://127.0.0.1:8081",
"bootstrap.servers" -> "http://127.0.0.1:9092",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "io.confluent.kafka.serializers.KafkaAvroDeserializer",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, Record](
ssc,
PreferConsistent,
Subscribe[String, Record](topics, kafkaParams)
)
val data = messages.map(record => {
println( record) // print only first 10
// compute here?
(record.key, record.value)
})
data.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
I believe the issue lies in that Spark is lazy and will only read the data that is actually used.
By default, print will show the first 10 elements in a stream. Since the code does not contain any other actions in addition to the two print there is no need to read more than 10 rows of data. Try using count or another action to confirm that it is working.
I have set up the Spark-Kafka Consumer in Scala that receives messages from multiple topics:
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(10))
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> properties.getProperty("group_id"),
"auto.offset.reset" -> properties.getProperty("offset_reset")
)
// Kafka integration with receiver
val msgStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaParams, Map(properties.getProperty("topic1") -> 1,
properties.getProperty("topic2") -> 2,
properties.getProperty("topic3") -> 3),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
I need to develop corresponding action code for messages (which will be in JSON format) from each topic.
I referred to the following question, but the answer in it didn't help me:
get topic from Kafka message in spark
So, is there any method on the received DStream that can be used to fetch topic name along with the message to determine what action should take place?
Any help on this would be greatly appreciated. Thank you.
See the code below.
You can get topic name and message by foreachRDD, map operation on DStream.
msgStream.foreachRDD(rdd => {
val pairRdd = rdd.map(i => (i.topic(), i.value()))
})
The code below is an example source of createDirectStream that I am using.
val ssc = new StreamingContext(configLoader.sparkConfig, Seconds(conf.getInt(Conf.KAFKA_PULL_INTERVAL)))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString(Conf.KAFKA_BOOTSTRAP_SERVERS),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> conf.getString(Conf.KAFKA_CONSUMER_GID),
"auto.offset.reset" -> conf.getString(Conf.KAFKA_AUTO_OFFSET_RESET),
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics: Array[String] = conf.getString(Conf.KAFKA_TOPICS).split(",")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)