I want to setup Redshift Workload Management to handle
50% ETL
30% Tableau Rpts
20% adhoc queries.
I'm wondering what happens to un-allocated memory as my ETL only runs at night?
What happens to the 50% memory my ETL queue is allocated for in the day time when that queue is idle?
I read the Redshift documentation and it only says
Any unallocated memory is managed by Amazon Redshift
and is not descriptive.
Workload Management (WLM) is a way of dividing available memory amongst queues.
If you allocate 50% to the ETL queue and you are not running any ETL jobs, then you have wasted 50% of the cluster's memory for that period of time.
A better approach might be to create separate queues based upon the workload. For example:
One queue for small, quick queries (eg used on real-time Dashboards)
Another queue for larger queries
Amazon Redshift is getting 'smarter' at figuring out how to prioritize queries but you can certainly tweak it with thoughtful use of WLM.
Related
How does Scylla guarantee/keeps write latency low for write workload case, as more write would produce more memflush and compaction? Is there a throttling to it? Would be really helpful if someone can asnwer.
In essence, Scylla provides consistent low latency by parallelizing the problem, and then properly prioritizing user-facing vs. back-office tasks.
Parallelizing - Scylla uses a shard-per-thread architecture. Each thread is responsible for all activities for its token range. Reads, writes, compactions, repairs, etc.
Prioritizing - Each thread is scheduled according to the priorities of the tasks. High priority tasks like dealing with read (query) and write (commitlog) receive the highest priority. Back-office tasks such as memtable flushes, compaction and repair are only done when there are spare cycles. Which - given the nanosecond scale of modern CPUs - there usually are.
If there are not enough spare cycles, and RAM or Disk start to fill, Scylla will bump the priority of the back-office tasks in order to save the node. So that will, in fact, inject some latency. But that is an indication that you are probably undersized, and should add some resources.
I would recommend starting with the Scylla Architecture whitepaper at https://go.scylladb.com/real-time-big-data-database-principles-offer.html. There are also many in-depth talks from Scylla developers at https://www.scylladb.com/resources/tech-talks/
For example, https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-2019/ talks at great depth about shard-per-core.
https://www.scylladb.com/tech-talk/oltp-or-analytics-why-not-both/ talks at great depth about task prioritization.
Memtable flush is more urgent than regular compaction as we on one hand want to flush late, in order to create fewer sstables in level 0 and on the other, we like to evacuate memory from ram. We have a memory controller which automatically determine the flush condition. It is done in the background while operations for to the commitlog and get flushed according to the configured criteria.
Compaction is more of a background operation and we have controllers for it too. Go ahead and search the blog for compaction
My DB is reaching the 100% of CPU utilization and increasing the number of CPU is not working anymore.
What kind of information should I consider to create my Google Cloud SQL? How do you set up the DB configuration?
Info I have:
For 10-50 minute a day I have 120 request/seconds and the CPU reaches 100% of utilization
Memory usage is the maximum 2.5GB during this critical period
Storage usage is currently around 1.3GB
Current configuration:
vCPUs: 10
Memory: 10 GB
SSD storage: 50 GB
Unfortunately, there is no magic formula for determining the correct database size. This is because queries have variable load - some are small and simple and take no time at all, others are complex or huge and take lots of resources to complete.
There are generally two strategies to dealing with high load: Reduce your load (use connection pooling, optimize your queries, cache results), or increase the size of your database (add additional CPUs, Storage, or Read replicas).
Usually, when we have CPU utilization, it is because the CPU is overloaded or we have too many database tables in the same instances. Here are some common issues and fixes provided by Google’s documentation:
If CPU utilization is over 98% for 6 hours, your instance is not properly sized for your workload, and it is not covered by the SLA.
If you have 10,000 or more database tables on a single instance, it could result in the instance becoming unresponsive or unable to perform maintenance operations, and the instance is not covered by the SLA.
When the CPU is overloaded, it is recommended to use this documentation to view the percentage of available CPU your instance is using on the Instance details page in the Google Cloud Console.
It is also recommended to monitor your CPU usage and receive alerts at a specified threshold, set up a Stackdriver alert.
Increasing the number of CPUs for your instance should reduce the strain of your instance. Note that changing CPUs requires an instance restart. If your instance is already at the maximum number of CPUs, shard your database to multiple instances.
Google has this very interesting documentation about investigating high utilization and determining whether a system or user task is causing high CPU utilization. You could use it to troubleshoot your instance and find what's causing the high CPU utilization.
According to the documentation defining query queues "Any unallocated memory is managed by Amazon Redshift and can be temporarily given to a queue if the queue requests additional memory for processing".
I set up:
- the default queue (60% resources)
- a specific queue (0% resources)
All queries in the specific queue will hang indefinitely, I would expect redshift to utilise the remaining 40% memory to complete my query.
The reason I want this set up is to divert a monthly batch of many queries to a separate queue that uses at most 40% of available resources. This is only required for a short period of time however and I want this memory available for general use.
I tried setting it to 'blank' instead of 0% and this lets queries run but the system table STV_WLM_SERVICE_CLASS_CONFIG implies that the specific queue has the full 40% available (this is unchanging).
Thanks.
I am running a dataflow job scaling from 40 workers to more, the job is taking 9.77 TB persistent disk storage now and hit the following error.
Autoscaling: Unable to reach resize target in zone us-central1-c. QUOTA_EXCEEDED:
Quota 'DISKS_TOTAL_GB' exceeded. Limit: 10240.0 in region us-central1.
The job shouldn't emit that much data as the result. So I am wondering what is the role of PD allocated in this case. Also how is it estimated for each worker?
Here's the dataflow job link: https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-06-24_11_41_19-5444495474275650044?project=wikidetox&angularJsUrl=%2Fdataflow%2FjobsDetail%2Flocations%2Fus-central1%2Fjobs%2F2018-06-24_11_41_19-5444495474275650044%3Fproject%3Dwikidetox&authuser=1
Thank you,
Yiqing
The DISKS_TOTAL_GB quota is for hard drives allocated to your job, not for data emitted by it.
Is this a streaming job? I believe streaming jobs use pretty large hard drives to persist data about job execution. You can increase the DISK_TOTAL_GB quota for that project / zone, and you should be fine.
I would like to know what what happens when AWS RDS CPU utilisation is 100%?
Do the database requests fail or are the requests put on hold until the CPU utilisation drops below 100%?
I'm using RDS Postgres and thanks in advance for your help.
Your query performance will degrade. Further queries will fail.
If your RDS is the sole database instance for your application, your entire application could come to a stand still.
You will need to figure out if the CPU is peaking due to high load or if there is one such query that is consuming all the resources.
If its under heavy load, adding another replica might help if its read heavy. If its write heavy, you may need to scale up to the next instance or probably think about sharding your datasets.
This could lead to a lot of issues namely:
There is a good chance that if the CPU remains at 100% consistently, your instance will crash. Now, if this a Multi AZ instance, an automatic failover can reduce the downtime incurred due to any unexpected reboot to around 2min. However, if this is a Single AZ instance, the downtime can be significantly longer.
Your Db instance won't accept any new connections despite not hitting the value of 'max_connections'. There is a good chance that some of the existing transactions will roll back due to performance degradation.
Continuous spike to 100% CPU may lead to memory pressure, ie, very high swap usage and low freeable memory and an eventual instance crash.
Workload will reach the threshold and read and write IOPS won't go further.