I have a production environment setup with Postgres CloudSql instance. My database in around 30GB and I have ram of 8GB on master and 16GB on slave. But one weird thing happening with me is that the memory usage on both master and slave is stuck at 43%. I am not sure what is the reason for same. Can anyone help regarding this?
I cannot tell what number exactly the graph represents, but I assume it is allocated memory.
Then that would be fine, because the "free" RAM is actually used by the kernel to cache files, and PostgreSQL uses that memory indirectly via the kernel cache.
Related
We are running load test against an application that hits a Postgres database.
During the test, we suddenly get an increase in error rate.
After analysing the platform and application behaviour, we notice that:
CPU of Postgres RDS is 100%
Freeable memory drops on this same server
And in the postgres logs, we see:
2018-08-21 08:19:48 UTC::#:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.
But since we're on RDS, we cannot access system logs /var/log messages to confirm.
So can somebody:
confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?
I didn't find the answer here:
http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html
In terms of memory allocation, that's the recommendation:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory, check the ReadIOPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadIOPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadIOPS, your working set was not almost completely in memory.
Continue to scale up until ReadIOPS no longer drops dramatically after
a scaling operation, or ReadIOPS is reduced to a very small amount.
For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.
Also, that's their recommendation to troubleshoot possible OS issues:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console, or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring, see Enhanced
Monitoring
There's a lot of good recommendations there, including query tuning.
Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:
Aurora features a distributed, fault-tolerant, self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas, point-in-time recovery, continuous backup to Amazon S3,
and replication across three Availability Zones.
EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object, and from that connection you
create statements. When you execute the statments you get a result
set.
Now, every one of these objects needs to be closed, but if you close
statement, the entry set is closed, and if you close the connection
all the statements are closed and their result sets.
We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.
The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the PostgreSQL database.
My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed, so the result set they returned never closed as well. This
meant the database didn't free the relevant cursor resources in the
DB, and since it was over a huge table it took a lot of RAM.
Hope it helps!
TLDR: If you need PostgreSQL on AWS and you need rock solid stability, run PostgreSQL on EC2 (for now) and do some kernel tuning for overcommitting
I'll try to be concise, but you're not the only one who has seen this and it is a known (internal to Amazon) issue with RDS and Aurora PostgreSQL.
OOM Killer on RDS/Aurora
The OOM killer does run on RDS and Aurora instances because they are backed by linux VMs and OOM is an integral part of the kernel.
Root Cause
The root cause is that the default Linux kernel configuration assumes that you have virtual memory (swap file or partition), but EC2 instances (and the VMs that back RDS and Aurora) do not have virtual memory by default. There is a single partition and no swap file is defined. When linux thinks it has virtual memory, it uses a strategy called "overcommitting" which means that it allows processes to request and be granted a larger amount of memory than the amount of ram the system actually has. Two tunable parameters govern this behavior:
vm.overcommit_memory - governs whether the kernel allows overcommitting (0=yes=default)
vm.overcommit_ratio - what percent of system+swap the kernel can overcommit. If you have 8GB of ram and 8GB of swap, and your vm.overcommit_ratio = 75, the kernel will grant up to 12GB or memory to processes.
We set up an EC2 instance (where we could tune these parameters) and the following settings completely stopped PostgreSQL backends from getting killed:
vm.overcommit_memory = 2
vm.overcommit_ratio = 75
vm.overcommit_memory = 2 tells linux not to overcommit (work within the constraints of system memory) and vm.overcommit_ratio = 75 tells linux not to grant requests for more than 75% of memory (only allow user processes to get up to 75% of memory).
We have an open case with AWS and they have committed to coming up with a long-term fix (using kernel tuning params or cgroups, etc) but we don't have an ETA yet. If you are having this problem, I encourage you to open a case with AWS and reference case #5881116231 so they are aware that you are impacted by this issue, too.
In short, if you need stability in the near term, use PostgreSQL on EC2. If you must use RDS or Aurora PostgreSQL, you will need to oversize your instance (at additional cost to you) and hope for the best as oversizing doesn't guarantee you won't still have the problem.
I have a MongoDB instance in a cloud on AWS EC2 t2.micro (30GB storage, 1GB ram) running in Docker and in that database I have a single collection which stores 411 thousand documents, an this takes ~700MB disk space.
On my local computer, if I run this in mongo shell:
db.my_collection.find().skip(200000).limit(1)
then I get the correct results, but if I run this
db.my_collection.find().skip(220000).limit(1)
then MongoDB shuts down. Why? What should I do, to access these data?
It appears that your system doesn't have enough RAM to fulfill mongodb demand. When a Linux system is critically low in memory, kernel starts killing processes to avoid system crash itself.
I believe, this is what happening in your case too. Mongodb is not even getting chance to write a log. I'd recommend to increase RAM or if it's not feasible, add more swap space. This will prevent system crash but mongodb will keep working though very very slow.
Please visit these excellent resources on Linux and it's behavior.
https://unix.stackexchange.com/questions/136291/will-linux-start-killing-my-processes-without-asking-me-if-memory-gets-short
https://serverfault.com/questions/480266/how-to-know-if-the-server-runs-out-of-ram-before-crashing-down
Attempted to migrate my production environment from Native Postgres environment (hosted on AWS EC2) to RDS Postgres (9.4.4) but it failed miserably. The CPU utilisation of RDS Postgres instances shooted up drastically when compared to that of Native Postgres instances.
My environment details goes here
Master: db.m3.2xlarge instance
Slave1: db.m3.2xlarge instance
Slave2: db.m3.2xlarge instance
Slave3: db.m3.xlarge instance
Slave4: db.m3.xlarge instance
[Note: All the slaves were at Level 1 replication]
I had configured Master to receive only write request and this instance was all fine. The write count was 50 to 80 per second and they CPU utilisation was around 20 to 30%
But apart from this instance, all my slaves performed very bad. The Slaves were configured only to receive Read requests and I assume all writes that were happening was due to replication.
Provisioned IOPS on these boxes were 1000
And on an average there were 5 to 7 Read request hitting each slave and the CPU utilisation was 60%.
Where as in Native Postgres, we stay well with in 30% for this traffic.
Couldn't figure whats going wrong on RDS setup and AWS support is not able to provide good leads.
Did anyone face similar things with RDS Postgres?
There are lots of factors, that maximize the CPU utilization on PostgreSQL like:
Free disk space
CPU Usage
I/O usage etc.
I came across with the same issue few days ago. For me the reason was that some transactions was getting stuck and running since long time. Hence forth CPU utilization got inceased. I came to know about this, by running some postgreSql monitoring command:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
This command shows the time from which a transaction is running. This time should not be greater than one hour. So killing the transaction which was running from long time or that was stuck at any point, worked for me. I followed this post for monitoring and solving my issue. Post includes lots of useful commands to monitor this situation.
I would suggest increasing your work_mem value, as it might be too low, and doing normal query optimization research to see if you're using queries without proper indexes.
I'm loading the same amount of data (~100kb) on both my local server and a test Amazon EC2 server, but the response is 2x slower on EC2. Both are running Apache 2 and MongoDB on the same machine. On my local server, the response is about 209ms versus approximately 455ms on EC2.
I've setup a simple query and AJAX call that grabs point data to display on the map based on the current viewport of the device.
How can I debug this issue? How can I make it as faster as my local server? I even tried experimenting with different types of instances to make sure the specs are the same, but no luck. I also realize it could be because of network latency.
Local computer specs:
Intel Core i5 # 3.30GHz
8GB RAM
64-bit Windows 8
Amazon EC2 specs (m4.large):
2.4 GHz Intel Xeon Haswell (2 vCPUs)
8GB RAM
Amazon Linux
A remote query to EC2 is unlikely to return the result of the AJAX query as fast as your local server because it has network latency, while your local server does not. Measure the time in your AJAX handler from the start of the query to the point where it is ready to return data to get a meaningful baseline for comparison.
MongoDB is very sensitive to data being in RAM vs on disk. Depending on how you configured your EC2 instance, and on your local hardware, chances are pretty good that your local hardware is faster. EC2 instances can be configured to use SSD storage, and you can configure a guaranteed IOPS figure.
Is the 100KB the size of the result set, or the amount of data needed to form the result set? If you process 4GB of data down to get a 100KB result set, there's a good chance that disk IO is involved. If the amount of data you need to pull is small, repeat the test a few times to ensure data is entirely in RAM.
Finally, if both local and EC2 are pulling data from RAM, there's a good chance that your local CPU core is just faster than the EC2 CPU core, and that your RAM access is faster as well. EC2 is designed to provide low-cost commodity hardware. Developer setups are often much faster.
If you cannot account for the speed differences given the factors above, update your question with the time measurements that exclude network latency and provide more detailed specifications about your hardware. Update the question to indicate whether the data you are retrieving from MongoDB should be entirely in RAM, given it's size and the amount of RAM on your instance.
We started using memcached on the test server for our social media project and having some problems on ram usage.
We have created a cluster with 1 server node running with just 1 cache bucket sized 128 mb but when we check memcached.exe ram usage from the task manager it' s ram usage rises continously 1mb per second.
Any workaround on this?
Thanks!
If you're using our 1.0.3 product (the current version of our Memcached server) there is a known issue where deleting the default bucket causes a memory leak. Can you let me know whether you deleted the default bucket?
Also, we just released beta 4 of our 1.6.0 product which has support for both Membase buckets as well as Memcached buckets. I would certainly appreciate you taking a look and trying it out. I know it has fixed the memory leak issue.
Thanks so much.
Perry