dataproc spark checkpoint best practices? what should I set the checkpoint dir too? - google-cloud-dataproc

I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints()
Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a bucket. Based on log files I can see that my code has progressed past several of the checkpoint() calls however the bucket is empty
sparkContext.setCheckpointDir("gs://myBucket/checkpointDir/")
based on CPU utilization and log messages, it looks like my job is still running and making progress after. any idea what the spark where the checkpoint data?
2022-01-22 18:38:06 WARN DAGScheduler:69 - Broadcasting large task binary with size 4.9 MiB
2022-01-22 18:47:23 WARN BlockManagerMasterEndpoint:69 - No more replicas available for broadcast_50_piece0 !
2022-01-22 18:47:23 WARN BlockManagerMaster:90 - Failed to remove broadcast 50 with removeFromMaster = true - org.apache.spark.SparkException: Could not find BlockManagerEndpoint1.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
kind regards
Andy

Did you manually trigger checkpoint in your code? If not, it won't be automatically triggered. See https://programmer.help/blogs/spark_-correct-use-of-checkpoint-in-spark-and-its-difference-from-cache.html Checkpointing is generally not a way to solve OOM problem in Spark.

Related

spark streaming - waiting for a dead executor

I have a spark streaming application running inside a k8s cluster (using spark-operator).
I have 1 executor, reading batches every 5s from a Kinesis stream.
The Kinesis stream has 12 shards, and the executor creates 1 receiver per shard. I gave it 24 cores, so it should be more than enough to handle it.
For some unknown reason, sometimes the executor crashes. I suspect it is due to memory going over the k8s pod memory limit, which would cause k8s to kill the pod. But I have not been able to prove this theory yet.
After the executor crashes, a new one is created.
However, the "work" stops. The new executor is not doing anything.
I investigated a bit:
Looking at the logs of the pod - I saw that it did execute a few tasks successfully after it was created, and then it stopped because it did not get any more tasks.
Looking in Spark Web UI - I see that there is 1 “running batch” that is not finishing.
I found some docs that say there can always be only 1 active batch at a time. So this is why the work stopped - it is waiting for this batch to finish.
Digging a bit deeper in the UI, I found this page that shows details about the tasks.
So executor 2 finished doing all the tasks it was assigned.
There are 12 tasks that were assigned to executor 1 which are still waiting to finish, but executor 1 is dead.
Why does this happen? Why does Spark not know that executor 1 is dead and never going to finish it's assigned tasks? I would expect Spark to reassign these tasks to executor 2.

High I/O wait when Spark Structured Streaming checkpoint changed to EFS

I'm currently running spark structured streaming application written in python(pyspark) where my source is kafka topic and sink in mongodb. I changed my checkpoint to Amazon EFS, which is distributed on all spark workers and after that I got increased I/o wait, averaging 8%
Currently I have 6000 messages coming to kafka every second, and I get every once in a while a WARN message:
22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up
files for HDFSStateStoreProvider[id = (op=0,part=90),dir =
file:/mnt/efs_max_io/spark/state/0/90]
java.lang.NumberFormatException: For input string: ""
I'm not quite sure if that message has anything to do with high I/O wait and is this behavior expected, or something to be concerned about?

Standby tasks not writing updates to .checkpoint files

I have a Kafka Streams application that is configured to have 1 standby replica created for each task. I have two instances of the application running. When the application starts the application writes .checkpoint files for each of the partitions it is responsible for. It writes these files for partitions owned by both active and standby tasks.
When sending a new Kafka event to be processed by the application, the instance containing that active task for the partition updates the offsets in the .checkpoint file. However, the .checkpoint file for the standby task on the second instance is never updated. It remains at the old offset.
I believe this is causing us to see OffsetOutOfRangeEceptions to be thrown when we rebalance which results in tasks being torn down and created from scratch.
Am I right in thinking that offsets should be written for partitions in both standby and active tasks?
Is this an indication that my standby tasks are not consuming or could it be that it is purely not able to write the offset?
Any ideas what could be causing this behaviour?
Streams version: 2.3.1
This issue has been fixed in Kafka 2.4.0 which resolves the following bug issues.apache.org/jira/browse/KAFKA-8755
Note: The issue looks to only effect applications the are configured OPTIMIZE="all"

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

Will flink resume from the last offset after executing yarn application kill and running again?

I use FlinkKafkaConsumer to consume kafka and enable checkpoint. Now I'm a little confused on the offset management and checkpoint mechanism.
I have already know flink will start reading partitions from the consumer group’s.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration
and the offset will store into checkpoint in remote fileSystem.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance
What happen if I stop the application by executing the yarn application -kill appid
and run the start command like ./bin flink run ...?
Will flink get the offset from checkpoint or from group-id managed by kafka?
If you run the job again without defining a savepoint ($ bin/flink run -s :savepointPath [:runArgs]) flink will try to get the offsets of your consumer-group from kafka (in older versions from zookeeper). But you will loose all other state of your flink job (which might be ignorable if you have a stateless flink job).
I must admit that this behaviour is quite confusing. By default starting a job without a savepoint is like starting from zero. As far as I know only the implementation of the kafka source differs from that behaviour. If you wanna change that behaviour you can set the setStartFromGroupOffsets of the FlinkKafkaConsumer[08/09/10] to false. This is described here: Kafka Consumers Start Position Configuration
It might be worth having a closer look at the documentation of flink: What is a savepoint and how does it differ from checkpoints.
In a nutshell
Checkpoints:
The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink
Savepoints:
Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume
There are currently ongoing discussions on how to "unify" savepoints and checkpoints. Find a lot of technical details here: Flink improvals 47: Checkpoints vs Savepoints