I have a structured streaming application that reads messages from Kafka and then shuts down. However, if the kafka broker is down or unreachable, my application will attempt to reconnect indefinitely. I would prefer that it shuts down after 2-3 retries and let my orchestration engine handle alerting and retrying.
It seems like Kafka does not expose any configuration options on the consumer side that would allow us to stop attempting reconnects after a certain time period or retry count.
https://github.com/apache/kafka/blob/6ab4d047d563e0fe42a7c0ed6f10ddecda135595/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
The only configurations that I found in the ConsumerConfig that might assist us with this are the following:
kafka.reconnect.backoff.ms
kafka.reconnect.backoff.max.ms
kafka.socket.connection.setup.timeout.max.ms
Here is my Structured Streaming code for reference:
val checkpointUrl = "s3://mybucket/experiment/checkpoint/"
val url = "s3://mybucket/experiment/triggerOnce/"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "X.X.X.X:9092")
.option("kafka.reconnect.backoff.max.ms", 1000)
.option("kafka.socket.connection.setup.timeout.max.ms", 10000)
.option("subscribe", "experiment3")
.load()
.writeStream
.format("csv")
.option("checkpointLocation", checkpointUrl)
.trigger(Trigger.Once())
.start(url)
.awaitTermination()
I am currently looking into if there is a way to leverage SparkListeners to intervene within the reconnection logic.
Related
I'm trying to stream messages out of kafka with spark structured streaming in scala as per spark documentation like this:
val sparkConfig = new SparkConf()
.setAppName("Some.app.name")
.setMaster("local")
val spark = SparkSession
.builder
.config(sparkConfig)
.getOrCreate()
val dataframe = spark
.readStream
.format("kafka")
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaEndpoint)
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("kafka.sasl.username", "$ConnectionString")
.option("kafka.sasl.password", kafkaConnectionString)
.option("kafka.sasl.mechanism", "PLAIN")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "SASL_PLAINTEXT")
.option("includeHeaders", "true")
.load()
val outputAllToConsoleQuery = dataframe
.writeStream
.format("console")
.start()
outputAllToConsoleQuery.awaitTermination()
Which of course fails with Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
As per spark documentation here "..the application can be configured via Spark parameters and may not need JAAS login configuration".
I have also read kafka documentation.
I think I can get the idea, but I haven't found a way to actually code it, nor have I found any example.
Could someone provide the code in scala that configures spark structured streaming to authenticate against kafka and use delegation token, without JAAS configuration file?
I am using Spark Structured Streaming to read from a Kafka topic (say topic1) and using SINK to write to another topic (topic1-result). I can see the messages are not being removed from Topic1 after writing to another topic using Sink.
// Subscribe to 1 topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.load()
//SINK to another topic
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("checkpointLocation", "/tmp/checkpoint1")
.option("topic", "topic1-result")
.start()
the documentation says we can not use auto-commit for structured streams
enable.auto.commit: Kafka source doesn’t commit any offset.
but how to acknowledge messages and remove the processed messages from the topic (topic1)
Two considerations:
Messages are not removed from Kafka once you have committed. When your consumer executes commit, Kafka increases the offset of this topic respect to the consumer-group that has been created. But messages remain in the topic depending on the retention time that you configure for the topic.
Indeed, Kafka source doesn´t make the commit, the stream storages the offset that points to the next message in the streaming´s checkpoint dir. So when you stream restarts it takes the last offset to consume from it.
I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful
completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "false"."
The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message).
I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0).
There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).
Could someone help with this?Am i doing something wrong or missing something?
Below is the read and write stream code snippet.
val kafkaDF = ss.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
.option("subscribe", logs)
.option("fetchOffset.numRetries",5)
.option("maxOffsetsPerTrigger", 30000000)
.load()
val query = logDS
.writeStream
.foreachBatch {
(batchDS: Dataset[Row], batchId: Long) =>
batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
}
.trigger(Trigger.ProcessingTime(1800 + " seconds"))
.option("checkpointLocation", s3bucketpath)
.start()
Thanks in advance.
I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.
I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !