Zombie giant unkillable task blocks Druid at restart - druid

I'm running Apache Druid 0.17 deploying with nohup ./bin/start-nano-quickstart > mylog.log. As the deep storage I am using s3 and I have parquet extension enabled and all work fine. I could ingest with several small spark partitioned parquet datasources from s3 correctly. All the remaining configurations are untouched.
As I tried loading a giant datasource to test the performance and resource usage the task died after a couple of hours because of OutOfMemory.(It was expected)
2020-02-07T17:32:20,519 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[arc_2016-09-29T12:00:00.000Z_2016-09-29T13:00:00.000Z_2020-02-07T17:22:45.965Z] for sequenceName[index_parallel_arc_chgindko_2020-02-07T14:59:32.337Z].
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Now every time I restart Druid, it starts that giant task and it is impossible to kill it. Even when the task apparently dies or turns in waiting status the CPU usage is about 140% and I cannot submit new tasks to Druid. I tried to access the Derby database manually to find the task and remove it but I was not successful and this solution is really nasty. I know that I can change the database in the configuration so the next time I will have a fresh Druid but it is not a good solution as I will miss all other datasources. How can I get ready of this long running zombie task?

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Cassandra pod is taking more bootstrap time than expected

I am running Cassandra as a Kubernetes pod . One pod is having one Cassandra container.we are running Cassandra of version 3.11.4 and auto_bootstrap set to true.I am having 5 node in production and it holds 20GB data.
Because of some maintenance activity and if I restart any Cassandra pod it is taking 30 min for bootstrap then it is coming UP and Normal state.In production 30 min is a huge time.
How can I reduce the bootup time for cassandra pod ?
Thank you !!
If you're restarting the existing node, and data is still there, then it's not a bootstrap of the node - it's just restart.
One of the potential problems that you have is that you're not draining the node before restart, and all commit logs need to be replayed on the start, and this can take a lot of time if you have a lot of data in commit log (you can just check system.log on what Cassandra is doing at that time). So the solution could be is to execute nodetool drain before stopping the node.
If the node is restarted before crash or something like, you can thing in the direction of the regular flush of the data from memtable, for example via nodetool flush, or configuring tables with periodic flush via memtable_flush_period_in_ms option on the most busy tables. But be careful with that approach as it may create a lot of small SSTables, and this will add more load on compaction process.

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html