Spark Streaming Direct Kafka API, OffsetRanges : How to handle first run - scala

My spark-streaming application reads from Kafka using the direct stream approach without the help of ZooKeeper. I would like to handle failures such that Exactly-once Semantics is followed in my application. I am following this for reference. Everything looks perfect except for :
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
In the very first run of the application, since there will be no offsets read, what value to pass in for the fromOffsets Map parameter? I am certainly missing something.
Thanks and appreciate any help!

The first offset isn't necessarily 0L, depending on how long the topics have existed.
I personally just pre-insert the appropriate offsets into the database separately. Then the spark job reads the offsets from the database at startup.
The file kafkacluster.scala in the spark Kafka integration has methods that make it easier to query Kafka for the earliest available offset. That file was private, but has been made public in the most recent spark code.

Related

Kafka consumers (same group-id) stucked in reading always from same partition

I have 2 consumers running with the same group-id and reading from topic having 3 partititons and parsing messages with KafkaAvroDeserializer. The consumer has these settings:
def avroConsumerSettings[T <: SpecificRecordBase](schemaRegistry: String, bootstrapServer: String, groupId: String)(implicit
actorSystem: ActorSystem): ConsumerSettings[String, T] = {
val kafkaAvroSerDeConfig = Map[String, Any](
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> schemaRegistry,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG -> true.toString
)
val kafkaAvroDeserializer = new KafkaAvroDeserializer()
kafkaAvroDeserializer.configure(kafkaAvroSerDeConfig.asJava, false)
val deserializer =
kafkaAvroDeserializer.asInstanceOf[Deserializer[T]]
ConsumerSettings(actorSystem, new StringDeserializer, deserializer)
.withBootstrapServers(bootstrapServer)
.withGroupId(groupId)
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
.withProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
}
I tried to send a malformed message to test error handling and now my consumer is stucked (always retry reading from same partition because I'm using RestartSource.onFailuresWithBackoff); but what is strange to me (AFAIK each consumer in the same group-id cannot read from the same partition) is that if I run another consumer it stucks as well because it reads again from the same partition where unreadable message is.
Can someone help me understand what am I doing wrong?
When you restart the Kafka source after a failure, that results in a new consumer being created; eventually the consumer in the failed source is declared dead by Kafka, triggering a rebalance. In that rebalance, there are no external guarantees of which consumer in the group will be assigned which partition. This would explain why your other consumer in the group reads that partition.
The issue here with a poison message derailing consumption is a major reason I've developed a preference to treat keys and values from Kafka as blobs by using the ByteArrayDeserializer and do the deserialization myself in the stream, which gives me the ability to record (e.g. by logging; producing the message to a dead-letter topic for later inspection can also work) that there was a malformed message in the topic and move on by committing the offset. Either in Scala is particularly good for moving the malformed message directly to the committer.

Sending to Kafka from a spark job taking too much time

I have a spark streaming job which consumes data from kafka and send back to kafka after doing some process on each data.
For this i am doing some map operations on data ,
val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicNameMap, StorageLevel.MEMORY_AND_DISK)
var ad = ""
val abc = lines.map(_._2).map { x =>
val jsonObj = new JSONObject(x)
val data = someMethod(schema, jsonObj)
data
}
then i am doing foreach operation on it , i am not collecting all the data to driver here since i want to send those record inside the executor itself.
abc.foreachRDD(rdd => {
rdd.foreach { toSend =>
val producer = KafkaProducerUtils.getKafkaProducer(kafkaBrokers)
println("toSend---->" + toSend)
producer.send(new ProducerRecord[String, String](topicToSend, toSend))
}
I tried this code for 1405 data for a 10 second period , but it took approximately 2.5 minute to complete the job. I know creating KafkaProducer is costly , Is there any other way around to reduce the processing time. For my testing purpose i am using 2 executors with 2 cores and 1GM each.
After searching a lot i found this article about KafkaSink . This will give you the idea about producing data to kafka inside spark streaming in effective way.
There must be several reasons for this enormous delay processing this amount of messages:
The problem could reside in your consume phase. If you use "createStream", at least, minor versions of Spark use the high level consumer implementation which needs Zookeeper to storage the offset of the consumers which belong to a specific group. So I´d check this communication because it could take too much time in the commit phase. If for any reason that makes commit for each one by one your consume rate could be impaired. So first of all, check this.
There is another reason due to the write ahead log to the file system. Although your configuration indicates memory an disk, as you can see in Spark documentation:
Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka
For better consume rates I would use createDirectStream instead.

Using Kafka to communicate between long running Spark jobs

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.

Serial consumption of Kafka topics from Spark

Given the following code:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String, String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.foreach { msg =>
// Now do some DataFrame-intensive work.
// As I understand things, DataFrame ops must be run
// on Workers as well as streaming consumers.
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
My understanding is that Spark and Kafka will automagically work together to figure out how many consumer threads to deploy to available Worker Nodes, which likely results in parallel processing of messages off a Kafka topic.
But what if I don't want multiple, parallel consumers? What if want 1-and-only-1 consumer reading the next message from a topic, processing it completely, and then starting back over again and polling for the next message.
Also, when I call:
val ssc = new StreamingContext(sc, Seconds(10))
Does this mean:
That a single consumer thread will receive all messages that were published to the topic in the last 10 seconds; or
That a single consumer thread will receive the next (single) message from the topic, and that it will poll for the next message every 10 seconds?
But what if I don't want multiple, parallel consumers? What if want
1-and-only-1 consumer reading the next message from a topic,
processing it completely, and then starting back over again and
polling for the next message.
If that is your use-case, I'd say why use Spark at all? Its entire advantage is that you can read in parallel. The only hacky workaround I can think of is creating a Kafka topic with a single partition, which would make Spark assign the entire offset range to a single worker, but that is ugly.
Does that mean that a single consumer thread will receive all messages that were
published to the topic in the last 10 seconds or that a single
consumer thread will receive the next (single) message from the topic,
and that it will poll for the next message every 10 seconds?
Neither. Since you're using direct (receiverless) stream approach, it means that every 10 seconds, your driver will ask Kafka to give him the offset ranges that have changed since the last batch, for each partition of the said topic. Then, Spark will take each such offset range, and send it to one of the workers to consume directly from Kafka. This means that with the direct stream approach, there is a 1:1 correspondence between Kafka partitions and Spark partitions.

Spark Streaming Update on Kafka Direct Stream parameter

I have the followingcode:
//Set basic spark parameters
val conf = new SparkConf()
.setAppName("Cartographer_jsonInsert")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messagesDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple4[String, Int, Long, String]](ssc, getKafkaBrokers, getKafkaTopics("processed"), (mmd: MessageAndMetadata[String, String]) => {
(mmd.topic, mmd.partition, mmd.offset, mmd.message().toString)
})
getKafkaBrokers and getKafkaTopics calls an API that checks the database for specific new Topics as we add them to our system. Does the SSC while running update variables each iteration? So ever messageDStream be re-created with the new variables each time?
It does not look like it does, is there any way to have the happen?
Tathagata Das, one of the creators of Spark Streaming answered a similar question in the Spark User List regarding modifications of existing DStreams.
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.
I don't see a straight forward way of implementing this with Spark Streaming, as you have no way of updating your graph. You need much more control than currently available. Maybe a solution based on Reactive Kafka, the Akka Streams connector for Kafka. Or any other streaming based solution where you control the source.
Any reason you are not using Akka Graph with reactive-kafka (https://github.com/akka/reactive-kafka). it is very easy to build reactive stream where source can be given topic , flow to process messages and sink to sink result.
I have built a sample application is using the same https://github.com/asethia/akka-streaming-graph