dataflow allocated too much disk space - google-cloud-storage

I am running a dataflow job scaling from 40 workers to more, the job is taking 9.77 TB persistent disk storage now and hit the following error.
Autoscaling: Unable to reach resize target in zone us-central1-c. QUOTA_EXCEEDED:
Quota 'DISKS_TOTAL_GB' exceeded. Limit: 10240.0 in region us-central1.
The job shouldn't emit that much data as the result. So I am wondering what is the role of PD allocated in this case. Also how is it estimated for each worker?
Here's the dataflow job link: https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-06-24_11_41_19-5444495474275650044?project=wikidetox&angularJsUrl=%2Fdataflow%2FjobsDetail%2Flocations%2Fus-central1%2Fjobs%2F2018-06-24_11_41_19-5444495474275650044%3Fproject%3Dwikidetox&authuser=1
Thank you,
Yiqing

The DISKS_TOTAL_GB quota is for hard drives allocated to your job, not for data emitted by it.
Is this a streaming job? I believe streaming jobs use pretty large hard drives to persist data about job execution. You can increase the DISK_TOTAL_GB quota for that project / zone, and you should be fine.

Related

How to estimate RAM and CPU per Kubernetes Pod for a Spring Batch processing job?

I'm trying to estimate hardware resources for a Kubernetes Cluster to be able to handle the following scenarios:
On a daily basis I need to read 46,3 Million XML messages 10KB each (approx.) from a queue and then insert them in a Spark instance and in a Sybase DB instance. I need to come out with an estimation of how many pods I will need to process this amount of data and how much RAM and how many vCPUs will be required per pod in order to determine the characteristics of the nodes of the cluster. The reason behind all this is that we have some budget restrictions and we need to have an idea of the sizing before starting the corresponding development.
The second scenario is the same as the one already described but 18,65 times bigger, i.e. 833,33 Million XML messages per day. This is expected to be the case within a couple of years.
So far we plan to use Spring Batch with partitioning steps. I need orientation on how to determine the ideal Spring Batch configuration, required RAM, and required CPU per POD, as well as the number of PODS.
I will greatly appreciate any comments from your side.
Thanks in advance.

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

GCP CloudSQL (PostgreSQL) Crash During Stored Procedure Execution and Failover

I have a stored procedure in GCP CloudSQL (PostgreSQL v9.0.23). It works find in lower environments; but when it runs in Production (with significantly more volume), it crashes the DB itself which results in a Failover.
When we checked the metrics, what we found out is that the memory is more than 90% just before it crashes (15 GB out of the 16GB instance memory). Also the Read / Writes are very high >1000 Ops per second.
The SP does some select and insert statements. Any suggestions to improve this situation helps.
Thanks in advance.
As you have mentioned that the Cloud SQL instance is running smoothly with a small amount of workload but crashing with the Production environment where more intensive workloads are there, it seems the issue is with the instance size. So I would suggest you increase the instance size as per your need.
Also you have mentioned that the memory usage is 15 GB out of 16 GB which amounts to nearly 94%. As per this document your Cloud SQL instance will not be covered under Cloud SQL SLA if memory usage is over 90% for more than 6 hours of duration. So I would suggest you keep the memory usage within 90%. Also I would suggest keeping the CPU utilization as mentioned in this document. To know when your instance reaches any threshold I will suggest you set a monitoring alert for that metrics as mentioned here.
If increasing your instance size doesn’t help I would recommend you to create a support ticket with Google Cloud Support so that they can investigate in detail.

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.

Redshift Workload Managment: Memory

I want to setup Redshift Workload Management to handle
50% ETL
30% Tableau Rpts
20% adhoc queries.
I'm wondering what happens to un-allocated memory as my ETL only runs at night?
What happens to the 50% memory my ETL queue is allocated for in the day time when that queue is idle?
I read the Redshift documentation and it only says
Any unallocated memory is managed by Amazon Redshift
and is not descriptive.
Workload Management (WLM) is a way of dividing available memory amongst queues.
If you allocate 50% to the ETL queue and you are not running any ETL jobs, then you have wasted 50% of the cluster's memory for that period of time.
A better approach might be to create separate queues based upon the workload. For example:
One queue for small, quick queries (eg used on real-time Dashboards)
Another queue for larger queries
Amazon Redshift is getting 'smarter' at figuring out how to prioritize queries but you can certainly tweak it with thoughtful use of WLM.