Azure Databricks to eventhub: com.microsoft.azure.eventhubs.TimeoutException - pyspark

Please help me to fix this issue. I am writing data to event hub from data-bricks(pyspark) streaming query as below:
def foreach_batch_function(df, epoch_id):
df.orderBy("_commit_timestamp")
df.select(to_json(struct("*")).alias("body")).
write.format("eventhubs").options(**ehConf).save()
w_hubble_account_df = hubble_account_df.writeStream\
.option("checkpointLocation", hubble_account_checkpoint_location)\
.foreachBatch(foreach_batch_function).start()
w_hubble_account_df.awaitTermination()
This runs for few minutes and writes few thousands records to event hub successfully and then throws the TimeoutException. This is suppose to run continuously until stopped explicitly. What could be the possible root cause?

Related

Flink SQL does not honor "table.exec.source.idle-timeout" setting

I have a Flink job running FlinkSQL with the following setup:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final EnvironmentSettings settings =
EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setMaxParallelism(env.getParallelism() * 8);
env.getConfig().setAutoWatermarkInterval(config.autowatermarkInterval());
final TableConfig tConfig = tEnv.getConfig();
tConfig.setIdleStateRetention(Duration.ofMinutes(60));
tConfig.getConfiguration().setString("table.exec.source.idle-timeout", "180000 ms");
To test this locally with a Kafka source, I fired a few events to the Flink job. The Flink UI shows it produced one watermark. I waited 3 minutes to see if watermarks advance without sending in new events (i.e idle partition). However, no watermark advancement occurred.
Note: I use a Kafka broker locally with three partitions. and my test data is keyed and hence gets sent to the same partition. However, I am not seeing watermarks advance even if other partitions are idle and if I wait 3 minutes.
Any place in the JOB UI I could see if the value i set for 3 minutes is actually picked up? Am I using the right units(seconds vs ms)
Anything else I could check to test this setting?
We are running Flink 1.12.1.
Update: I see this exception in my Flink SQL job under exceptions: Wonder if there is a mismatch of versions.
2021-10-26 16:38:14
java.lang.NoClassDefFoundError: org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest$PartitionData
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.lambda$null$0(OffsetsForLeaderEpochClient.java:52)
at java.base/java.util.Optional.ifPresent(Unknown Source)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.lambda$prepareRequest$1(OffsetsForLeaderEpochClient.java:51)
at java.base/java.util.HashMap.forEach(Unknown Source)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.prepareRequest(OffsetsForLeaderEpochClient.java:51)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.prepareRequest(OffsetsForLeaderEpochClient.java:37)
at org.apache.kafka.clients.consumer.internals.AsyncClient.sendAsyncRequest(AsyncClient.java:37)
at org.apache.kafka.clients.consumer.internals.Fetcher.lambda$validateOffsetsAsync$5(Fetcher.java:798)
at java.base/java.util.HashMap.forEach(Unknown Source)
at org.apache.kafka.clients.consumer.internals.Fetcher.validateOffsetsAsync(Fetcher.java:774)
at org.apache.kafka.clients.consumer.internals.Fetcher.validateOffsetsIfNeeded(Fetcher.java:498)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2328)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1271)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1235)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1168)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaConsumerThread.run(KafkaConsumerThread.java:249)
The issue was that this setting does not work in Flink 1.12.0 or 1.12.1. I had to upgrade to Flink 1.13.2 and the setting was honored and worked as expected.
The exception was a red herring and not consistently reproducible.

Flink job cant use savepoint in a batch job

Let me start in a generic fashion to see if I somehow missed some concepts: I have a streaming flink job from which I created a savepoint. Simplified version of this job looks like this
Pseduo-Code:
val flink = StreamExecutionEnvironment.getExecutionEnvironment
val stream = if (batchMode) {
flink.readFile(path)
}
else {
flink.addKafkaSource(topicName)
}
stream.keyBy(key)
stream.process(new ProcessorWithKeyedState())
CassandraSink.addSink(stream)
This works fine as long as I run the job without a savepoint. If I start the job from a savepoint I get an exception which looks like this
Caused by: java.lang.UnsupportedOperationException: Checkpoints are not supported in a single key state backend
at org.apache.flink.streaming.api.operators.sorted.state.NonCheckpointingStorageAccess.resolveCheckpoint(NonCheckpointingStorageAccess.java:43)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1623)
at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
I could work around this if I set the option:
execution.batch-state-backend.enabled: false
but this eventually results in another error:
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:673)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:526)
Of course I tried to set the config key taskmanager.memory.managed.consumer-weights (used DATAPROC:70,PYTHON:30) but this doesn't seems to have any effects.
So I wonder if I have a conceptual error and can't reuse savepoints from a streaming job in a batch job or if I simply have a problem in my configuration. Any hints?
After a hint from the flink user-group it turned out that it is NOT possible to reuse a savepoint from the streaming job (https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/execution_mode/#state-backends--state). So instead of running the job as in batch-mode (flink.setRuntimeMode(RuntimeExecutionMode.BATCH)) I just run it in the default execution mode (STREAMING). This has the minor downside that it will run forever and have to be stopped by someone once all data was processed.

Spark structured streaming with kafka throwing error after running for a while

I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful
completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "false"."
The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message).
I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0).
There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).
Could someone help with this?Am i doing something wrong or missing something?
Below is the read and write stream code snippet.
val kafkaDF = ss.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
.option("subscribe", logs)
.option("fetchOffset.numRetries",5)
.option("maxOffsetsPerTrigger", 30000000)
.load()
val query = logDS
.writeStream
.foreachBatch {
(batchDS: Dataset[Row], batchId: Long) =>
batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
}
.trigger(Trigger.ProcessingTime(1800 + " seconds"))
.option("checkpointLocation", s3bucketpath)
.start()
Thanks in advance.

Why is a streaming query still up and running after StreamingQueryManager.awaitAnyTermination?

I want to terminate the spark mapping after a specific time. I'm using sqlContext.streams.awaitAnyTermination(long timeoutMs) for that. But the mapping is not stopping after the given timeout.
I have tried to read from azure event hub and provided 2 min (120000 ms) as a timeout for awaitAnyTermination method. but the mapping is not stopping on azure databricks cluster.
Below is my code. I'm reading from azure eventhub and writing to console and 120000ms to awaitAnyTermination.
import org.apache.spark.eventhubs._
// Event hub configurations
// Replace values below with yours
import org.apache.spark.eventhubs.ConnectionStringBuilder
val connStr = ConnectionStringBuilder()
.setNamespaceName("iisqaeventhub")
.setEventHubName("devsource")
.setSasKeyName("RootManageSharedAccessKey")
.setSasKey("saskey")
.build
val customEventhubParameters = EventHubsConf(connStr).setMaxEventsPerTrigger(5).setStartingPosition(EventPosition.fromEndOfStream)
// reading from the Azure event hub
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()
// write to console
val query = incomingStream.writeStream
.outputMode("append")
.format("console")
.start()
// awaitAnyTermination for shutting down the query
sqlContext.streams.awaitAnyTermination(120000)
I am expecting that mapping should have ended after a timeout. No error but mapping is not stopping.
tl;dr Works as designed.
From the official documentation:
awaitAnyTermination(timeoutMs: Long): Boolean
Returns whether any query has terminated or not (multiple may have terminated).
In other words, no streaming query is going to be terminated at any point in time (before or after the timeoutMs) unless there is an exception or stop.
When using DataBricks and prototyping, this is what I use to stop Spark Structured Streaming Apps in a separate Notebook pane:
import org.apache.spark.streaming._
StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location.
Sometimes application fails with Exception:
_spark_metadata/0 doesn't exist while compacting batch 9
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
We are not able to resolve this issue.
Only solution I found is to delete checkpoint location files which will make the job read the topic/data from beginning as soon as we run the application again. However, this is not a feasible solution for production application.
Does anyone has a solution for this error without deleting checkpoint such that I can continue from where the last run was failed?
Sample code of application:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
[...] // do some processing
dfProcessed.writeStream
.format("csv")
.option("format", "append")
.option("path",hdfsPath)
.option("checkpointlocation","")
.outputmode(append)
.start
The error message
_spark_metadata/n.compact doesn't exist when compacting batch n+10
can show up when you
process some data into a FileSink with checkpoint enabled, then
stop your streaming job, then
change the output directory of the FileSink while keeping the same checkpointLocation, then
restart the streaming job
Quick Solution (not for production)
Just delete the files in checkpointLocation and restart the application.
Stable Solution
As you do not want to delete your checkpoint files, you could simply copy the missing spark metadata files from the old File Sink output path to the new output Path. See below to understand what are the "missing spark metadata files".
Background
To understand, why this IllegalStateException is being thrown, we need to understand what is happening behind the scene in the provided file output path. Let outPathBefore be the name of this path. When your streaming job is running and processing data the job creates a folder outPathBefore/_spark_metadata. In that folder you will find a file named after micro-batch Identifier containing the list of files (partitioned files) the data has been written to, e.g:
/home/mike/outPathBefore/_spark_metadata$ ls
0 1 2 3 4 5 6 7
In this case we have details for 8 micro batches. The content of one of the files looks like
/home/mike/outPathBefore/_spark_metadata$ cat 0
v1
{"path":"file:///tmp/file/before/part-00000-99bdc705-70a2-410f-92ff-7ca9c369c58b-c000.csv","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}
By default, on each tenth micro batch, these files are getting compacted, meaning the contents of the files 0, 1, 2, ..., 9 will be stored in a compacted file called 9.compact.
This procedure continuous for the subsequent ten batches, i.e. in the micro batch 19 the job aggregates the last 10 files which are 9.compact, 10, 11, 12, ..., 19.
Now, imagine you had the streaming job running until micro batch 15 which means the job has created the following files:
/home/mike/outPathBefore/_spark_metadata/0
/home/mike/outPathBefore/_spark_metadata/1
...
/home/mike/outPathBefore/_spark_metadata/8
/home/mike/outPathBefore/_spark_metadata/9.compact
/home/mike/outPathBefore/_spark_metadata/10
...
/home/mike/outPathBefore/_spark_metadata/15
After the fifteenth micro batch you stopped the streaming job and changed the output path of the File Sink to, say, outPathAfter. As you keep the same checkpointLocation the streaming job will continue with micro-batch 16. However, it now creates the metadata files in the new out path:
/home/mike/outPathAfter/_spark_metadata/16
/home/mike/outPathAfter/_spark_metadata/17
...
Now, and this is where the Exception is thrown: When reaching micro batch 19, the job tries to compact the tenth latest files from spark metadata folder. However, it can only find the files 16, 17, 18 but it does not find 9.compact, 10 etc. Hence the error message says:
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
Documentation
The Structured Streaming Programming Guide explains on Recovery Semantics after Changes in a Streaming Query:
"Changes to output directory of a file sink are not allowed: sdf.writeStream.format("parquet").option("path", "/somePath") to sdf.writeStream.format("parquet").option("path", "/anotherPath")"
Databricks has also written some details in the article Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
Error caused by checkpointLocation because checkpointLocation stores old or deleted data information. You just need to delete the folder containing checkpointLocation.
Explore more :https://kb.databricks.com/streaming/file-sink-streaming.html
Example :
df.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "D:/path/dir/checkpointLocation")
.option("path", "D:/path/dir/output")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
You need to do delete directory checkpointLocation.
This article introduces the mechanism and gives a good way to recover from a deleted _spark_metadata folder in Spark Structured Streaming:
https://dev.to/kevinwallimann/how-to-recover-from-a-deleted-sparkmetadata-folder-546j
"Create dummy log files:
If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this:
for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done
This will create the files
/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
Now, the query can be restarted and should finish without errors."
As my previous output folder was not recoverable anymore. I tried this dummy solution, which could work to get rid of the IllegalStateException: _spark_metadata/... doesn't exist exception.