Spark structured streaming acknowledge messages - scala

I am using Spark Structured Streaming to read from a Kafka topic (say topic1) and using SINK to write to another topic (topic1-result). I can see the messages are not being removed from Topic1 after writing to another topic using Sink.
// Subscribe to 1 topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.load()
//SINK to another topic
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("checkpointLocation", "/tmp/checkpoint1")
.option("topic", "topic1-result")
.start()
the documentation says we can not use auto-commit for structured streams
enable.auto.commit: Kafka source doesn’t commit any offset.
but how to acknowledge messages and remove the processed messages from the topic (topic1)

Two considerations:
Messages are not removed from Kafka once you have committed. When your consumer executes commit, Kafka increases the offset of this topic respect to the consumer-group that has been created. But messages remain in the topic depending on the retention time that you configure for the topic.
Indeed, Kafka source doesn´t make the commit, the stream storages the offset that points to the next message in the streaming´s checkpoint dir. So when you stream restarts it takes the last offset to consume from it.

Related

How can I stop Kafka from retrying connections?

I have a structured streaming application that reads messages from Kafka and then shuts down. However, if the kafka broker is down or unreachable, my application will attempt to reconnect indefinitely. I would prefer that it shuts down after 2-3 retries and let my orchestration engine handle alerting and retrying.
It seems like Kafka does not expose any configuration options on the consumer side that would allow us to stop attempting reconnects after a certain time period or retry count.
https://github.com/apache/kafka/blob/6ab4d047d563e0fe42a7c0ed6f10ddecda135595/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
The only configurations that I found in the ConsumerConfig that might assist us with this are the following:
kafka.reconnect.backoff.ms
kafka.reconnect.backoff.max.ms
kafka.socket.connection.setup.timeout.max.ms
Here is my Structured Streaming code for reference:
val checkpointUrl = "s3://mybucket/experiment/checkpoint/"
val url = "s3://mybucket/experiment/triggerOnce/"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "X.X.X.X:9092")
.option("kafka.reconnect.backoff.max.ms", 1000)
.option("kafka.socket.connection.setup.timeout.max.ms", 10000)
.option("subscribe", "experiment3")
.load()
.writeStream
.format("csv")
.option("checkpointLocation", checkpointUrl)
.trigger(Trigger.Once())
.start(url)
.awaitTermination()
I am currently looking into if there is a way to leverage SparkListeners to intervene within the reconnection logic.

Spark structured streaming with kafka throwing error after running for a while

I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful
completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "false"."
The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message).
I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0).
There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).
Could someone help with this?Am i doing something wrong or missing something?
Below is the read and write stream code snippet.
val kafkaDF = ss.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
.option("subscribe", logs)
.option("fetchOffset.numRetries",5)
.option("maxOffsetsPerTrigger", 30000000)
.load()
val query = logDS
.writeStream
.foreachBatch {
(batchDS: Dataset[Row], batchId: Long) =>
batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
}
.trigger(Trigger.ProcessingTime(1800 + " seconds"))
.option("checkpointLocation", s3bucketpath)
.start()
Thanks in advance.

Control micro batch of Structured Spark Streaming

I'm reading data from a Kafka topic and I'm putting it to Azure ADLS (HDFS Like) in partitioned mode.
My code is like below :
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr(/*"CAST(key AS STRING)",*/ "CAST(value AS STRING)").as(Encoders.STRING)
df.writeStream
.partitionBy("year", "month", "day", "hour", "minute")
.format("parquet")
.option("path", outputDirectory)
.option("checkpointLocation", checkpointDirectory)
.outputMode("append")
.start()
.awaitTermination()
I have about 2000 records/sec, and my problem is that Spark is inserting the data every 45sec, and I want the data to be inserted immediately.
Anyone know how to control the size of micro batch ?
From the Spark 2.3 version it is available the Continuous processing mode. In the official doc. you can read that only three sinks are supported for this mode and only the Kafka sink is ready for production, and "the end-to-end low-latency processing can be best observed with Kafka as the source and sink"
df
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/0")
.option("topic", "output0")
.trigger(Trigger.Continuous("0 seconds"))
.start()
So, it seems that, at the moment, you can´t use HDFS as sink using the Continuous mode. In your case maybe you can test Akka Streams and the Alpakka connector

Offset Management For Apache Kafka With Apache Spark Batch

I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !

How to use Spark Structured Streaming with Kafka Direct Stream?

I came across Structured Streaming with Spark, it has an example of continuously consuming from an S3 bucket and writing processed results to a MySQL DB.
// Read data continuously from an S3 location
val inputDF = spark.readStream.json("s3://logs")
// Do operations using the standard DataFrame API and write to MySQL
inputDF.groupBy($"action", window($"time", "1 hour")).count()
.writeStream.format("jdbc")
.start("jdbc:mysql//...")
How can this be used with Spark Kafka Streaming?
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Is there a way to combine these two examples without using stream.foreachRDD(rdd => {})?
Is there a way to combine these two examples without using
stream.foreachRDD(rdd => {})?
Not yet. Spark 2.0.0 doesn't have Kafka sink support for Structured Streaming. This is a feature that should come out in Spark 2.1.0 according to Tathagata Das, one of the creators of Spark Streaming.
Here is the relevant JIRA issue.
Edit: (29/11/2018)
Yes, It's possible with Spark version 2.2 onwards.
stream
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
Check this SO post(read and write on Kafka topic with Spark streaming) for more.
Edit: (06/12/2016)
Kafka 0.10 integration for Structured Streaming is now expiramentaly supported in Spark 2.0.2:
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
ds1
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I was having a similar issue w.r.t reading from Kafka source and writing to a Cassandra sink. Created a simple project here kafka2spark2cassandra, sharing in case it could be helpful for anyone.