spark strucuted streaming write errors - apache-kafka

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?

issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.

Related

How can I stop Kafka from retrying connections?

I have a structured streaming application that reads messages from Kafka and then shuts down. However, if the kafka broker is down or unreachable, my application will attempt to reconnect indefinitely. I would prefer that it shuts down after 2-3 retries and let my orchestration engine handle alerting and retrying.
It seems like Kafka does not expose any configuration options on the consumer side that would allow us to stop attempting reconnects after a certain time period or retry count.
https://github.com/apache/kafka/blob/6ab4d047d563e0fe42a7c0ed6f10ddecda135595/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
The only configurations that I found in the ConsumerConfig that might assist us with this are the following:
kafka.reconnect.backoff.ms
kafka.reconnect.backoff.max.ms
kafka.socket.connection.setup.timeout.max.ms
Here is my Structured Streaming code for reference:
val checkpointUrl = "s3://mybucket/experiment/checkpoint/"
val url = "s3://mybucket/experiment/triggerOnce/"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "X.X.X.X:9092")
.option("kafka.reconnect.backoff.max.ms", 1000)
.option("kafka.socket.connection.setup.timeout.max.ms", 10000)
.option("subscribe", "experiment3")
.load()
.writeStream
.format("csv")
.option("checkpointLocation", checkpointUrl)
.trigger(Trigger.Once())
.start(url)
.awaitTermination()
I am currently looking into if there is a way to leverage SparkListeners to intervene within the reconnection logic.

Authenticate spark structured streaming against kafka with delegation token, in scala

I'm trying to stream messages out of kafka with spark structured streaming in scala as per spark documentation like this:
val sparkConfig = new SparkConf()
.setAppName("Some.app.name")
.setMaster("local")
val spark = SparkSession
.builder
.config(sparkConfig)
.getOrCreate()
val dataframe = spark
.readStream
.format("kafka")
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaEndpoint)
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("kafka.sasl.username", "$ConnectionString")
.option("kafka.sasl.password", kafkaConnectionString)
.option("kafka.sasl.mechanism", "PLAIN")
.option("spark.kafka.clusters.cluster.sasl.token.mechanism", "SASL_PLAINTEXT")
.option("includeHeaders", "true")
.load()
val outputAllToConsoleQuery = dataframe
.writeStream
.format("console")
.start()
outputAllToConsoleQuery.awaitTermination()
Which of course fails with Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
As per spark documentation here "..the application can be configured via Spark parameters and may not need JAAS login configuration".
I have also read kafka documentation.
I think I can get the idea, but I haven't found a way to actually code it, nor have I found any example.
Could someone provide the code in scala that configures spark structured streaming to authenticate against kafka and use delegation token, without JAAS configuration file?

Control micro batch of Structured Spark Streaming

I'm reading data from a Kafka topic and I'm putting it to Azure ADLS (HDFS Like) in partitioned mode.
My code is like below :
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr(/*"CAST(key AS STRING)",*/ "CAST(value AS STRING)").as(Encoders.STRING)
df.writeStream
.partitionBy("year", "month", "day", "hour", "minute")
.format("parquet")
.option("path", outputDirectory)
.option("checkpointLocation", checkpointDirectory)
.outputMode("append")
.start()
.awaitTermination()
I have about 2000 records/sec, and my problem is that Spark is inserting the data every 45sec, and I want the data to be inserted immediately.
Anyone know how to control the size of micro batch ?
From the Spark 2.3 version it is available the Continuous processing mode. In the official doc. you can read that only three sinks are supported for this mode and only the Kafka sink is ready for production, and "the end-to-end low-latency processing can be best observed with Kafka as the source and sink"
df
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/0")
.option("topic", "output0")
.trigger(Trigger.Continuous("0 seconds"))
.start()
So, it seems that, at the moment, you canĀ“t use HDFS as sink using the Continuous mode. In your case maybe you can test Akka Streams and the Alpakka connector

Offset Management For Apache Kafka With Apache Spark Batch

I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !

How to use Spark Structured Streaming with Kafka Direct Stream?

I came across Structured Streaming with Spark, it has an example of continuously consuming from an S3 bucket and writing processed results to a MySQL DB.
// Read data continuously from an S3 location
val inputDF = spark.readStream.json("s3://logs")
// Do operations using the standard DataFrame API and write to MySQL
inputDF.groupBy($"action", window($"time", "1 hour")).count()
.writeStream.format("jdbc")
.start("jdbc:mysql//...")
How can this be used with Spark Kafka Streaming?
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Is there a way to combine these two examples without using stream.foreachRDD(rdd => {})?
Is there a way to combine these two examples without using
stream.foreachRDD(rdd => {})?
Not yet. Spark 2.0.0 doesn't have Kafka sink support for Structured Streaming. This is a feature that should come out in Spark 2.1.0 according to Tathagata Das, one of the creators of Spark Streaming.
Here is the relevant JIRA issue.
Edit: (29/11/2018)
Yes, It's possible with Spark version 2.2 onwards.
stream
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
Check this SO post(read and write on Kafka topic with Spark streaming) for more.
Edit: (06/12/2016)
Kafka 0.10 integration for Structured Streaming is now expiramentaly supported in Spark 2.0.2:
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
ds1
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I was having a similar issue w.r.t reading from Kafka source and writing to a Cassandra sink. Created a simple project here kafka2spark2cassandra, sharing in case it could be helpful for anyone.