spark streaming - waiting for a dead executor - scala

I have a spark streaming application running inside a k8s cluster (using spark-operator).
I have 1 executor, reading batches every 5s from a Kinesis stream.
The Kinesis stream has 12 shards, and the executor creates 1 receiver per shard. I gave it 24 cores, so it should be more than enough to handle it.
For some unknown reason, sometimes the executor crashes. I suspect it is due to memory going over the k8s pod memory limit, which would cause k8s to kill the pod. But I have not been able to prove this theory yet.
After the executor crashes, a new one is created.
However, the "work" stops. The new executor is not doing anything.
I investigated a bit:
Looking at the logs of the pod - I saw that it did execute a few tasks successfully after it was created, and then it stopped because it did not get any more tasks.
Looking in Spark Web UI - I see that there is 1 “running batch” that is not finishing.
I found some docs that say there can always be only 1 active batch at a time. So this is why the work stopped - it is waiting for this batch to finish.
Digging a bit deeper in the UI, I found this page that shows details about the tasks.
So executor 2 finished doing all the tasks it was assigned.
There are 12 tasks that were assigned to executor 1 which are still waiting to finish, but executor 1 is dead.
Why does this happen? Why does Spark not know that executor 1 is dead and never going to finish it's assigned tasks? I would expect Spark to reassign these tasks to executor 2.

Related

dataproc spark checkpoint best practices? what should I set the checkpoint dir too?

I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints()
Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a bucket. Based on log files I can see that my code has progressed past several of the checkpoint() calls however the bucket is empty
sparkContext.setCheckpointDir("gs://myBucket/checkpointDir/")
based on CPU utilization and log messages, it looks like my job is still running and making progress after. any idea what the spark where the checkpoint data?
2022-01-22 18:38:06 WARN DAGScheduler:69 - Broadcasting large task binary with size 4.9 MiB
2022-01-22 18:47:23 WARN BlockManagerMasterEndpoint:69 - No more replicas available for broadcast_50_piece0 !
2022-01-22 18:47:23 WARN BlockManagerMaster:90 - Failed to remove broadcast 50 with removeFromMaster = true - org.apache.spark.SparkException: Could not find BlockManagerEndpoint1.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
kind regards
Andy
Did you manually trigger checkpoint in your code? If not, it won't be automatically triggered. See https://programmer.help/blogs/spark_-correct-use-of-checkpoint-in-spark-and-its-difference-from-cache.html Checkpointing is generally not a way to solve OOM problem in Spark.

CeleryExecutor: Does the airflow metric "executor.queued_tasks" report the number of tasks in the celery broker?

Using its statsd plugin, Airflow can report on metric executor.queued_tasks as well as some others.
I am using CeleryExecutor and need to know how many tasks are waiting in the Celery broker, so I know when new workers should be spawned. Indeed, I set my workers so they cannot take many tasks concurrently. Is this metric what I need?
Nope. If you want to know how many TIs are waiting in the broker, you'll have to connect to it.
Task instances that are waiting to get picked up in the celery broker are queued according to the Airflow DB, but running according to the CeleryExecutor. This is because the CeleryExecutor considers that any task instance that was successfully sent to the broker is now running (unlike the DB, which waits for a worker to pick it up before marking it as running).
Metric executor.queued_tasks reports the number of tasks queued according to the executor, not the DB.
The number of queued task instances according to the DB is not exactly what you need either, because it reports the number of task instances that are waiting in the broker plus the number of task instances queued to the executor. But when would TIs be stuck in the executor's queue, you ask? When the parallelism setting of Airflow prevents the executor from sending them to the broker.

Kafka Connect assigns same task to multiple workers

I'm using Kafka Connect in distributed mode. A strange behavior I observed multiple times now is that, after some time (can be hours, can be days), what appears to be a balancing error happens: same tasks get assigned to multiple workers. As a result, they run concurrently and, depending on the nature of the connector, fail or produce "unpredictable" outputs.
The simplest configuration I was able to use to reproduce the behavior is: two Kafka Connect workers, two connectors, each connector with one task only. Kafka Connect is deployed into Kubernetes. Kafka itself is in Confluent Cloud. Both Kafka Connect and Kafka are of the same version (5.3.1).
Relevant messages from the log:
Worker A:
[2019-10-30 12:44:23,925] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Successfully joined group with generation 488 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:469)
[2019-10-30 12:44:23,926] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Joined group at generation 488 and got assignment: Assignment{error=0, leader='connect-1-d5c19893-b33c-4f07-85fb-db9736795759', leaderUrl='http://10.16.0.15:8083/', offset=250, connectorIds=[some-hdfs-sink, some-mqtt-source], taskIds=[some-hdfs-sink-0, some-mqtt-source-0], revokedConnectorIds=[], revokedTaskIds=[], delay=0} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1397)
Worker B:
[2019-10-30 12:44:23,930] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Successfully joined group with generation 488 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:469)
[2019-10-30 12:44:23,936] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Joined group at generation 488 and got assignment: Assignment{error=0, leader='connect-1-d5c19893-b33c-4f07-85fb-db9736795759', leaderUrl='http://10.16.0.15:8083/', offset=250, connectorIds=[some-mqtt-source], taskIds=[some-mqtt-source-0], revokedConnectorIds=[], revokedTaskIds=[], delay=0} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1397)
In above log extracts you can observe that same task (some-mqtt-source-0) is assigned to two workers. After this message, I can also see log messages by the task instances on both workers.
This behavior doesn't depend on the connector (I observed it with other tasks as well). It also doesn't happen immediately after the workers are started, but after some time only.
My question is what can be the cause of this behavior?
EDIT 1:
I've tried running 3 workers, instead of two, thinking that it might be a distributed consensus issue. It appears not to be, and having 3 workers doesn't fix the issue.
EDIT 2:
I've noticed that just before a worker A is assigned a task that originally ran on worker B, that worker (B) observes an error joining a group. For example, if tasks get "duplicated" in generation N, worker B would not have a "Successfully joined group with generation N" message in logs. More so, between generation N-1 and N+1, worker B typically logs errors like Attempt to heartbeat failed for since member id and Group coordinator bx-xxx-xxxxx.europe-west1.gcp.confluent.cloud:9092 (id: 1234567890 rack: null) is unavailable or invalid. Worker B typically joins the generation N+1 shortly after the generation N (sometimes in as few as just about 3 seconds later). It is now clear what triggers the behavior. However:
although I understand that there may be temporary issues like these and they are probably normal in a general case, why doesn't rebalancing fix the issue after all servers successfully join the next generation? Although more rebalnce follow - it doesn't correctly distribute the tasks, and keeps the "duplicates" forever (until I restart workers).
it appears that in some periods, rebalance happens almost once per several hours, and in other periods it happens every 5 minutes (precisely up to seconds); what could be the reason? what is the normal?
what could be the reason for "Group coordinator is unavailable or invalid" errors, given that I use a Confluent Cloud, and are there any configuration parameters that can be tweaked in Kafka Connect in order to make it more resilient with regards to this error? I know there are session.timeout.ms and heartbeat.interval.ms, but the documentation is so minimalist it is not even clear what is the practical impact of changing these parameters to smaller or bigger values.
EDIT 3:
I observed that the issue is not critical for sink tasks: although same sink tasks get assigned to multiple workers, corresponding consumers are assigned to different partitions as they normally should, and everything works almost as it should - I simply got more tasks than I originally asked for. However, in case of source tasks, the behavior is breaking - tasks run concurrently and compete for resources on the source side.
EDIT 4:
Meanwhile, I downgraded Kafka Connect to version 2.2 (Confluent Platform 5.2.3) - a pre-"Incremental Cooperative Rebalancing" version. It works fine for last 2 days. So, I assume the behavior is related to the new rebalancing mechanism.
As mentioned in the comments, Jira Kafka-9184 was made to address this problem, and it has been resolved.
The fix is available in versions 2.3.2 and above.
As such the answer is now: upgrading to a recent version should prevent this problem from occurring.

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?
For a job that takes 15 seconds running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.
AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.
First, make sure that you really need Spark since from what you described it could be an overkill.
Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.
For streaming related jobs -> the key would be to avoid IO in your case - as the job seems to take only 15 seconds. Push your messages to a queue ( AWS SQS ). Have a AWS step function triggered by a Cloudwatch event (implements a schedule like Cron in your case every 30 mins - to call a AWS Step function) to read messages from SQS and process them in a lambda ideally.
So one option (serverless):
Streaming messages --> AWS SQS -> (every 30 mins cloudwatch triggers a step function ) -> which triggers a lambda service to process all messages in the queue
https://aws.amazon.com/getting-started/tutorials/scheduling-a-serverless-workflow-step-functions-cloudwatch-events/
Option 2:
Streaming messages ---> AWS SQS -> Process messages using a Python application/Java Spring application having a scheduled task that wakes up every 30 mins and reads messages from queue and processes it in memory.
I have used option 2 for solving analytical problems, although my analytical problem took 10 mins and was data intensive.Option 2 in addition, needs to monitor the virtual machine (container) where the process is running. On the other hand Option 1 is serverless. Finally it all comes down to the software stack you already have in place and the software needed to process the streaming data.

Partitioning of Apache Spark

I have a cluster consisting of 1 master 10 worker nodes. When I set number of partition as 3, I wonder that does the master node use only 3 worker nodes or use all of them? Because it shows that all of them are used.
The question is not so clear about what are you asking, However following things might help
When you start the job with 10 executors, spark application master gets all the resource from yarn. So all the executors are already associated with the spark job.
However if your data partition is less than the number of executors available, the rest of the executors will be sitting idle. Hence it is not a good idea keeping the number of partition less than the executor count.