Pauses between jobs - concourse

We run a Bosh deployment on GCE and it is generally very stable. We scale workers from 2 at off-peak to 6 during the day. 4 core / 3.6 GB RAM / 100GB SSD machines.
Performance is generally good, but on occasion there are long pauses in between jobs (2-3 minutes), particularly if a lot of jobs are in progress. The spinner in the UI stops moving for the previous completed job, but the next does not start.
I presume this is related to provisioning containers for the next job, but is there some way we can speed up that operation. Is there some resource we can increase or tune that would improve that lag.
Thanks.

Related

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

Slow reads on MongoDB from Spark - weird task allocation

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

CPU Utilization while using locust

We are planning to use locust for performance testing. I have started locust in distributed mode on Kubernetes, with 800 Users for a duration of 5 minutes. Hatch rate is 100 as well. After a couple of minutes, I can see the below warning on the worker log.
[2020-07-15 07:03:15,990] pipeline1-locust-worker-1-gp824/WARNING/root: Loadgen CPU usage above 90%! This may constrain your throughput and may even give inconsistent response time measurements!
I am unable to figure what is 90% here since I have not specified any resource limits. Is it the 90% of node capacity? Which is unlikely since we use beefy nodes, 16Vcpus, and 128Gb memory. Can anyone give any insight?
It is 90% of one core (which is all a single locust process can utilize because of the python GIL) (measured using https://psutil.readthedocs.io/en/latest/#psutil.Process.cpu_percent)
If you have 16vcpu you need 16 processes to utilize the whole node.
I guess we should clarify the message.

Spark unfinished stages. Spark app is idle

I've faced with situation when running cluster on AWS EMR, that one stage remained 'running' when execution plan continue to progress. Look at screen from Spark UI (job 4 has running tasks, however job 7 in progress). My question is how to debug such situation, if there are any tips that I can find at DAG?
My thought that it could be some memory issue because data is tough, and there are a lot of spills to disk. However I am wondering why spark stays idle for hour. Is it related to driver memory issues?
UPD1:
Based on Ravi requests:
(1) check the time since they are running and the GC time also. If GC
time is >20% of the execution time it means that u r throttled by the
memory.
No, it is not an issue.
(2) check number of active tasks in the same page.
That's really weird, there are executors with more active tasks than cores capacity (3x time more for some of executors), however I do not see any executors failures.
(3) see if all executors are equally spending time in running the job
Not an issue
(4) what u showed above is the job what abt stages etc? are they also
paused for ever?

Spark over Yarn some tasks are extremely slower

I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long