Spark consumer doesn't read Kafka producer messages Scala - scala

I am trying to create Kafka producer connected to Spark consumer. The producer works fine, however, the consumer in Spark does not read the data from the topic for some reason. I run kafka using spotify/kafka image in docker-compose.
Here is my consumer:
object SparkConsumer {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("KafkaSparkStreaming")
.master("local[*]")
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(3))
val topic1 = "topic1"
def kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lines = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set(topic1), kafkaParams)
)
lines.print()
}
Kafka Producer looks like this:
object KafkaProducer {
def main(args: Array[String]) {
val events = 10
val topic = "topic1"
val brokers = "localhost:9092"
val random = new Random()
val props = new Properties()
props.put("bootstrap.servers", brokers)
props.put("client.id", "KafkaProducerExample")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val t = System.currentTimeMillis()
for (nEvents <- Range(0, events)) {
val key = null
val values = "2017-11-07 04:06:03"
val data = new ProducerRecord[String, String](topic, key, values)
producer.send(data)
System.out.println("sent : " + data.value())
}
System.out.println("sent per second: " + events * 1000 / (System.currentTimeMillis() - t))
producer.close()
}
}
UPDATE:
My docker-compose file with Kafka:
version: '3.3'
services:
kafka:
image: spotify/kafka
ports:
- "9092:9092"

This is a common problem using Kafka with Docker. First, you should check what is the configuration in zookeeper for your topic. You can use the Zookeeper scripts inside the Kafka container. Probably when your topic is created the ADVERTISED_HOST is the name of your service. So when the consumer tries to connect to the broker, this returns "kafka" as the broker location. Because you are running the consumer outside the docker network, your consumer will never connect to the broker for consuming. Try to set the env for your kafka container with ADVERTISED_HOST=localhost.

Related

SparkStreaming can't connect Kafka

After I have deployed my zoomeeper and Kafka clusters on Alibaba cloud server, I use my local idea to establish sparkstreamingcontext and try to connect to the Kafka cluster of the cloud server and consume data. However, the following error is reported. My code is as follows:
ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition first-1 could be determined
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingWC")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("cpp")
val kafkaPara = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "ryan1",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
)
val kafkaDS = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set("first"), kafkaPara)
)
kafkaDS.map(_.value()).print()
ssc.start()
ssc.awaitTermination()

How to turn this simple Spark Streaming code into a Multi threaded one?

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

If Kafka Consumer fails (Spark Job), how to fetch the last offset committed by Kafka Consumer. (Scala)

Before I give any details, please note, I am NOT asking how to fetch the latest offset from Console using kafka-run-class.sh kafka.tools.ConsumerOffsetChecker.
I am trying to make a kafka consumer (kafka version 0.10) in Spark (2.3.1) using Scala (2.11.8), which will be fault tolerant. By fault tolerant, I mean, if for some reason the kafka consumer dies and restarts, it should resume consuming the messages from the last offset.
For achieving this, I commit the Kafka offset once it has been consumed using the below code,
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
val topic = Array("topic_1")
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, offsets) /*Edit: Added offset*/
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async Commit Here */
}
I have tried many thing to fetch the latest Offset for the given topic, but could not get it to work.
Can anyone help me out with the scala code to achieve this, please?
Edit:
In the above code, I am trying to fetch the last offset by using
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
but the offset fetched by the above code is 0, not the latest. Is there anyway to fetch the latest offset?
Found the solution to the above issue. Here it is. Hope it helps someone in need.
Language: Scala, Spark Job
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
import java.util.Properties
//create a new properties object with Kafaka Parameters as done previously. Note: Both needs to be present. We will use the proprty object just to fetch the last offset
val kafka_props = new Properties()
kafka_props.put("bootstrap.servers", "localhost:9092")
kafka_props.put("key.deserializer",classOf[StringDeserializer])
kafka_props.put("value.deserializer",classOf[StringDeserializer])
kafka_props.put("group.id","group_101")
kafka_props.put("auto.offset.reset","latest")
kafka_props.put("enable.auto.commit",(false: java.lang.Boolean))
kafka_props.put("session.timeout.ms",(30000: java.lang.Integer))
kafka_props.put("heartbeat.interval.ms",(3000: java.lang.Integer))
val topic = Array("topic_1")
/*val offsets = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> 2L) Edit: Added code to fetch offset*/
val topicAndPartition = new org.apache.kafka.common.TopicPartition("topic_1", 0) //Using 0 as the partition because this topic does not have any partitions
val consumer = new KafkaConsumer[String,String](kafka_props) //create a 2nd consumer to fetch last offset
import java.util
consumer.subscribe(util.Arrays.asList("topic_1")) //Subscribe to the 2nd consumer. Without this step, the offsetAndMetadata can't be fetched.
val offsetAndMetadata = consumer.committed(topicAndPartition) //Find last committed offset for the given topicAndPartition
val endOffset = offsetAndMetadata.offset().toLong //fetch the last committed offset from offsetAndMetadata and cast it to Long data type.
val fetch_from_offset = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> endOffset) // create a Map with data type (TopicPartition, Long)
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, fetch_from_offset) //Pass the offset Map of datatype (TopicPartition, Long) created eariler
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async offset Commit Here */
}

Inconsistent fetch by spark kafka consumer

I have written a code to fetch records from kafka into spark. I have come across some strange behaviour. It is consuming in inconsistent order.
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
So, Initially I produced 10 million records in kafka. Now producer is stopped and then started Spark Consumer Application. I checked Spark UI, initially I received 700,000-900,000 records per batch(every 10 seconds ) per stream, afterwards started getting 4-6K records per batch. So wanted to understand why the fetch count fell so badly despite the fact that data is present in Kafka so instead of giving 4k per batch , I'am open to consumer directly big size batch. What can be done and how ?
Thanks,

Kafka - How to read more than 10 rows

I made connector that read from database with jdbc, and consuming it from a Spark application. The app read the database data well, BUT it read only first 10 row and seems to ignore rest of them. How should I get rest, so I can compute with all data.
Here are my spark code:
val brokers = "http://127.0.0.1:9092"
val topics = List("postgres-accounts2")
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
//sparkConf.setMaster("spark://sda1:7077,sda2:7077")
sparkConf.setMaster("local[2]")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Record]))
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
// Create direct kafka stream with brokers and topics
//val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
"schema.registry.url" -> "http://127.0.0.1:8081",
"bootstrap.servers" -> "http://127.0.0.1:9092",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "io.confluent.kafka.serializers.KafkaAvroDeserializer",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, Record](
ssc,
PreferConsistent,
Subscribe[String, Record](topics, kafkaParams)
)
val data = messages.map(record => {
println( record) // print only first 10
// compute here?
(record.key, record.value)
})
data.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
I believe the issue lies in that Spark is lazy and will only read the data that is actually used.
By default, print will show the first 10 elements in a stream. Since the code does not contain any other actions in addition to the two print there is no need to read more than 10 rows of data. Try using count or another action to confirm that it is working.