Given the following code:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String, String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.foreach { msg =>
// Now do some DataFrame-intensive work.
// As I understand things, DataFrame ops must be run
// on Workers as well as streaming consumers.
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
My understanding is that Spark and Kafka will automagically work together to figure out how many consumer threads to deploy to available Worker Nodes, which likely results in parallel processing of messages off a Kafka topic.
But what if I don't want multiple, parallel consumers? What if want 1-and-only-1 consumer reading the next message from a topic, processing it completely, and then starting back over again and polling for the next message.
Also, when I call:
val ssc = new StreamingContext(sc, Seconds(10))
Does this mean:
That a single consumer thread will receive all messages that were published to the topic in the last 10 seconds; or
That a single consumer thread will receive the next (single) message from the topic, and that it will poll for the next message every 10 seconds?
But what if I don't want multiple, parallel consumers? What if want
1-and-only-1 consumer reading the next message from a topic,
processing it completely, and then starting back over again and
polling for the next message.
If that is your use-case, I'd say why use Spark at all? Its entire advantage is that you can read in parallel. The only hacky workaround I can think of is creating a Kafka topic with a single partition, which would make Spark assign the entire offset range to a single worker, but that is ugly.
Does that mean that a single consumer thread will receive all messages that were
published to the topic in the last 10 seconds or that a single
consumer thread will receive the next (single) message from the topic,
and that it will poll for the next message every 10 seconds?
Neither. Since you're using direct (receiverless) stream approach, it means that every 10 seconds, your driver will ask Kafka to give him the offset ranges that have changed since the last batch, for each partition of the said topic. Then, Spark will take each such offset range, and send it to one of the workers to consume directly from Kafka. This means that with the direct stream approach, there is a 1:1 correspondence between Kafka partitions and Spark partitions.
Related
I have 2 consumers running with the same group-id and reading from topic having 3 partititons and parsing messages with KafkaAvroDeserializer. The consumer has these settings:
def avroConsumerSettings[T <: SpecificRecordBase](schemaRegistry: String, bootstrapServer: String, groupId: String)(implicit
actorSystem: ActorSystem): ConsumerSettings[String, T] = {
val kafkaAvroSerDeConfig = Map[String, Any](
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> schemaRegistry,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG -> true.toString
)
val kafkaAvroDeserializer = new KafkaAvroDeserializer()
kafkaAvroDeserializer.configure(kafkaAvroSerDeConfig.asJava, false)
val deserializer =
kafkaAvroDeserializer.asInstanceOf[Deserializer[T]]
ConsumerSettings(actorSystem, new StringDeserializer, deserializer)
.withBootstrapServers(bootstrapServer)
.withGroupId(groupId)
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
.withProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
}
I tried to send a malformed message to test error handling and now my consumer is stucked (always retry reading from same partition because I'm using RestartSource.onFailuresWithBackoff); but what is strange to me (AFAIK each consumer in the same group-id cannot read from the same partition) is that if I run another consumer it stucks as well because it reads again from the same partition where unreadable message is.
Can someone help me understand what am I doing wrong?
When you restart the Kafka source after a failure, that results in a new consumer being created; eventually the consumer in the failed source is declared dead by Kafka, triggering a rebalance. In that rebalance, there are no external guarantees of which consumer in the group will be assigned which partition. This would explain why your other consumer in the group reads that partition.
The issue here with a poison message derailing consumption is a major reason I've developed a preference to treat keys and values from Kafka as blobs by using the ByteArrayDeserializer and do the deserialization myself in the stream, which gives me the ability to record (e.g. by logging; producing the message to a dead-letter topic for later inspection can also work) that there was a malformed message in the topic and move on by committing the offset. Either in Scala is particularly good for moving the malformed message directly to the committer.
So I have a problem with Kafka Sinks in Spark Streaming while sending JSONs to multiple topics and unreliable kafka brokers. Here are some parts of code:
val kS = KafkaUtils.createDirectStream[String, TMapRecord]
(ssc,
PreferConsistent,
Subscribe[String, TMapRecord](topicsSetT, kafkaParamsInT))
Then I iterate over RDD's
kSMapped.foreachRDD {
rdd: RDD[TMsg] => {
rdd.foreachPartition {
part => {
part.foreach { ...........
And inside foreach I do
kafkaSink.value.send(kafkaTopic, strJSON)
kafkaSinkMirror.value.send(kafkaTopicMirrorBroker, strJSON)
When Mirror broker is down the entire Streaming Application is waiting for it and we are not sending anything to the main broker.
How would you handle it?
For the easiest solution you propose, imagine that me just skip messages that were meant to be sent to a broker that went down (say, that's CASE 1)
for the CASE 2 we'd do some buffering.
P.S. Later on I will use Kafka Mirror, but currently I don't have such an option so I need to make some solution in my code.
I've found several decisions of this problem:
You may use throwing any timeout exception on worker and checkpoints. Spark tries to restart bad task several times described in spark.task.maxFailures property. It is possible to increase number of retries. If streaming job fails after max retries, just will restart the job from checkpoint when broker is available. Or you could manually stop the job when it fails.
You could configure backpressure spark.streaming.backpressure.enabled=true that allow to receive data only as fast as it can process it.
You could send you two results back to your technical Kafka topic and handle it later with another streaming job.
You could make Hive or Hbase buffer for this cases and send unhandled data later in batch mode.
I am currently writing a Spark streaming application that reads data from Kafka and tries to decode it before applying some transformations.
The current code structure looks like this:
val stream = KafkaUtils.createDirectStream[String, String](...)
.map(record => decode(record.value())
.filter(...)
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
...
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
The decoding and filtering of failures happens on the DStream, and the offset management is done inside the foreachRDD, which means that I will only commit successful records.
To commit the failed records, I could move everything inside the foreachRDD loop:
val stream = KafkaUtils.createDirectStream[String, String](...)
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
...
// Decoding and filtering here
...
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
However, I am wondering whether there is another way to commit the failed records. Maybe it would be acceptable to not commit the failed records?
I am assuming you are using the spark-streaming-kafka library.
Reading the documentation of offset Ranges it stores the range of offsets from the topic partitions. It does not filter out or mark individual offsets within that range based on the clients filtering ".filter(…)" actions. So if you commit that offsetRanges it will commit the highest offset number per partition, regardless of your filter actions.
That makes sense, as your Consumer is telling the Kafka Broker, or more precisely, the Group Coordinator that it consumed these messages. The coordinator is not interested in what you are actually doing with the data, it just wants to know if that particular Consumer Group was reading a message/offset or not.
Coming back to your questions...
I am wondering whether there is another way to commit the failed records.
Although it doesn't look like you need it, but yes, there is another way of committing "failed" records. You can enable auto commit. Together with the Consumer configuration auto.commit.interval.ms, you can periodically commit the offsets your consumer polled from the topic.
Maybe it would be acceptable to not commit the failed records?
I don't have the knowledge of your particular use case, but it is acceptable to not commit the failed records. As mentioned above, the Group Coordinater is interested in the highest offset per partition that your consumer has consumed. If you consume a topic with 10 messages, you start reading from beginning and you only commit the 9th offset (offset counting starts at 0), then the next time you start your consumer it will ignore the first ten messages.
You could check out the Kafka internal topic __consumer_offsets to see what is stored for each Consumer Group: Topic, Partition, Offset (… among others).
Kafka direct consumer started to limit reads to 450 events(5 * 90 partitions) per batch (5 seconds), it was running fine for 1 or 2 days before that (about 5000 to 40000 events per batch)
I'm using spark standalone cluster (spark and spark-streaming-kafka version 1.6.1) running in AWS and using S3 bucket for checkpoint directory StreamingContext.getOrCreate(config.sparkConfig.checkpointDir, createStreamingContext), there are not scheduling delays and enough disk space on each worker node.
Didn't change any Kafka client initialization parameters, pretty sure that kafka's structure hasn't changed:
val kafkaParams = Map("metadata.broker.list" -> kafkaConfig.broker)
val topics = Set(kafkaConfig.topic)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
Also can't understand why when direct consumer description says The consumed offsets are by the stream itself I still need to use checkpoint directory when creating the streaming context?
This is usually the result of enabling backpressure via the setting spark.streaming.backpressure.enabled to true. Usually, when the backpressure algorithm sees that there's more data coming in then it's used to, it starts capping each batch to a rather small size until it can "stabilize" itself again. This sometimes has false positives and causes your stream to slow down the processing rate.
If you want to tweak the heuristic a little, there are some undocumented flags it is using (just make sure you know what you're doing):
val proportional = conf.getDouble("spark.streaming.backpressure.pid.proportional", 1.0)
val integral = conf.getDouble("spark.streaming.backpressure.pid.integral", 0.2)
val derived = conf.getDouble("spark.streaming.backpressure.pid.derived", 0.0)
val minRate = conf.getDouble("spark.streaming.backpressure.pid.minRate", 100)
If you want the gory details, PIDRateEstimator is what you're looking for.
I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.