Aurora Postgresql Reader Performace - postgresql

I am using AWS Aurora postgresql Cluster(One Reader Instance, One Writer Instance).
Until yesterday, my application was operating to use only the writer instance. So far, the CPU metric of the wirter instance has never exceeded 20%.
Today's read-only operation was modified to use a reader instance, but the CPU metric of the reader instance continued to increase around 5% every 30 minutes. (In the first 30 minutes after deployment, it went up to 20% and then decreased again, after 60 minutes it rose to 25% and then decreased again...65%)
Currently, it has been changed to use only the Writer instance again.
Replication was still ongoing, and no other monitoring indicators were found to be unusual.
Why is this happening?
Is it the process of memory warn up of the reader instance?

Related

GCP CloudSQL (PostgreSQL) Crash During Stored Procedure Execution and Failover

I have a stored procedure in GCP CloudSQL (PostgreSQL v9.0.23). It works find in lower environments; but when it runs in Production (with significantly more volume), it crashes the DB itself which results in a Failover.
When we checked the metrics, what we found out is that the memory is more than 90% just before it crashes (15 GB out of the 16GB instance memory). Also the Read / Writes are very high >1000 Ops per second.
The SP does some select and insert statements. Any suggestions to improve this situation helps.
Thanks in advance.
As you have mentioned that the Cloud SQL instance is running smoothly with a small amount of workload but crashing with the Production environment where more intensive workloads are there, it seems the issue is with the instance size. So I would suggest you increase the instance size as per your need.
Also you have mentioned that the memory usage is 15 GB out of 16 GB which amounts to nearly 94%. As per this document your Cloud SQL instance will not be covered under Cloud SQL SLA if memory usage is over 90% for more than 6 hours of duration. So I would suggest you keep the memory usage within 90%. Also I would suggest keeping the CPU utilization as mentioned in this document. To know when your instance reaches any threshold I will suggest you set a monitoring alert for that metrics as mentioned here.
If increasing your instance size doesn’t help I would recommend you to create a support ticket with Google Cloud Support so that they can investigate in detail.

Slow reads on MongoDB from Spark - weird task allocation

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

"frozen" Google compute engine instance with PostgreSQL

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
Debian 9 with PG 11.9
Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.
I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time.
As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.
At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again.
I only resized my disk, now my instance disk is 20GB and is working fine.
Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.
If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.
Changing a machine type
1.- Go to the VM Instances page.
2.- In the Name column, click your instance.
From the instance details page, complete the following steps:
a) Click the Stop button to stop the instance, if you have not stopped it yet.
b) After the instance stops, click the Edit button at the top of the page.
c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.
d) Save your changes and start again your VM Instance.
You can resize your disk following this guide or with the following command:
gcloud compute disks resize DISK_NAME --size DISK_SIZE
Or with the Console:
Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.
After you resize the disk, you must resize the file system so that the operating system can access the additional space.
Note: Do not resize boot disks beyond 2 TB because this is the limit.
Edit1
You mentioned that the logs don’t show information about the issue when the instance is frozen.
Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.
For Debian, this logs should be in the following path:
/var/log/kern.log
Also the messages log could help
/var/log/messages
You can obtain more information about the logs in this link.
Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).
You can consult this URL to get more information.
Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage
Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.
This link explains better this configuration.
Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.
Edit2
I have found an article that describes a similar problem and it said that:
When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.
And this is the recommendation they give.
When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.
You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.

Mongodb pod consuming memory even though it is in idle state

When Inserting data in mongodb its memory usage increases then
the data base is dropped and connections are closed, but still the memory usage continue to increase.
I have already configured wiredTiger to 700mb
As you can see the graph in the screen shot attached down,at every 30 mins data insertion and deletion takes place , which consumes max 10 minutes of time and then the connection breaks but as you can see in graph the memory usage continues to increase which then reaches its max limit and then the kuberntes pod starts showing trouble

How to calculate the number of CPU, memory and storage that my Google Cloud SQL needs

My DB is reaching the 100% of CPU utilization and increasing the number of CPU is not working anymore.
What kind of information should I consider to create my Google Cloud SQL? How do you set up the DB configuration?
Info I have:
For 10-50 minute a day I have 120 request/seconds and the CPU reaches 100% of utilization
Memory usage is the maximum 2.5GB during this critical period
Storage usage is currently around 1.3GB
Current configuration:
vCPUs: 10
Memory: 10 GB
SSD storage: 50 GB
Unfortunately, there is no magic formula for determining the correct database size. This is because queries have variable load - some are small and simple and take no time at all, others are complex or huge and take lots of resources to complete.
There are generally two strategies to dealing with high load: Reduce your load (use connection pooling, optimize your queries, cache results), or increase the size of your database (add additional CPUs, Storage, or Read replicas).
Usually, when we have CPU utilization, it is because the CPU is overloaded or we have too many database tables in the same instances. Here are some common issues and fixes provided by Google’s documentation:
If CPU utilization is over 98% for 6 hours, your instance is not properly sized for your workload, and it is not covered by the SLA.
If you have 10,000 or more database tables on a single instance, it could result in the instance becoming unresponsive or unable to perform maintenance operations, and the instance is not covered by the SLA.
When the CPU is overloaded, it is recommended to use this documentation to view the percentage of available CPU your instance is using on the Instance details page in the Google Cloud Console.
It is also recommended to monitor your CPU usage and receive alerts at a specified threshold, set up a Stackdriver alert.
Increasing the number of CPUs for your instance should reduce the strain of your instance. Note that changing CPUs requires an instance restart. If your instance is already at the maximum number of CPUs, shard your database to multiple instances.
Google has this very interesting documentation about investigating high utilization and determining whether a system or user task is causing high CPU utilization. You could use it to troubleshoot your instance and find what's causing the high CPU utilization.