Memory issues with new version of Mongodb - mongodb

I'm using Mongodb on my Windows server 2012 for more than two years. Since the last update some weird issues started to happen which in the end lead to usage of the entire RAM memory.
The service Iv'e configured for Mongodb is as follows:
logpath=d:\data\log\mongod.log
dbpath=d:\data\db
storageEngine=wiredTiger
rest=true
#override port
port=27017
#configsvr = true
shardsvr = true
And in order to limit the Cache memory usage Iv'e added the following line:
wiredTigerCacheSizeGB=10
And this is where the weird stuff started happening. When I check the task manager it says that now Mongodb is really limited to 10GB as I defined in the service but it is actually using a lot more than 10GB.
In the first image you can see the memory consumption sorted by RAM consumption
While in fact the machine I'm using has 28GB in total
This crazy consumption leads to failure in the scripts I'm running, even the most basic ones, even when I only run simple queries like 'count' or 'distinct', I believe that this is a direct results of the memory consumption.
When I checked the log files I saw that there are many open connections that even when the session ends it indicates that still the same amount of connections is opened:
So in the end I have two major questions:
1. Is there a way of solving this issue without downgrading the Mongodb version?
2. The config file looks right? is everything there is necessary?

Memory usage in WiredTiger is a two-level cache:
First is the WiredTiger cache as controlled by --wiredTigerCacheSizeGB
Second is the Operating System filesystem cache. MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes
See also WiredTiger memory usage
For OS filesystem cache, MongoDB doesn't manage the memory it uses directly - it lets the OS manage it. Windows will try to use every last scrap of physical memory if it can - but lots of it should and will be thrown out if other processes request memory.
An alternative is to run mongod in a container (e.g. lxc, cgroups, Docker, etc.) that does not have access to all of the RAM available in a system.
Having said the above:
You are also running another database in the server i.e. mysqld. MongoDB, like some databases will perform better on a dedicated server to reduce memory contention.
Task Manager shows mongod is using 10GB, although the machine is using up to ~28GB. This may or may not be mongod as you have other processes as well.
Useful resources:
FAQ: Memory diagnostics for WiredTiger
FAQ: MongoDB Cache Handling
MongoDB Production Notes

Related

How to limit mongodb's RAM usage?

I am running tests on the server, as a result of which mongodb starts using a lot of RAM. Mongo version 3.6.8, arm architecture on the server, there is no swap file. I've tried all the solutions from here Is there any option to limit mongodb memory usage? . In particular, when using cgroups, mongo exceeds the limit and dies.
I will also say that the memory that mongodb uses for the cache is not returned if necessary.
I will be glad of any help.

AWS RDS with Postgres : Is OOM killer configured

We are running load test against an application that hits a Postgres database.
During the test, we suddenly get an increase in error rate.
After analysing the platform and application behaviour, we notice that:
CPU of Postgres RDS is 100%
Freeable memory drops on this same server
And in the postgres logs, we see:
2018-08-21 08:19:48 UTC::#:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.
But since we're on RDS, we cannot access system logs /var/log messages to confirm.
So can somebody:
confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?
I didn't find the answer here:
http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html
In terms of memory allocation, that's the recommendation:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory, check the ReadIOPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadIOPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadIOPS, your working set was not almost completely in memory.
Continue to scale up until ReadIOPS no longer drops dramatically after
a scaling operation, or ReadIOPS is reduced to a very small amount.
For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.
Also, that's their recommendation to troubleshoot possible OS issues:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console, or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring, see Enhanced
Monitoring
There's a lot of good recommendations there, including query tuning.
Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:
Aurora features a distributed, fault-tolerant, self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas, point-in-time recovery, continuous backup to Amazon S3,
and replication across three Availability Zones.
EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object, and from that connection you
create statements. When you execute the statments you get a result
set.
Now, every one of these objects needs to be closed, but if you close
statement, the entry set is closed, and if you close the connection
all the statements are closed and their result sets.
We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.
The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the PostgreSQL database.
My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed, so the result set they returned never closed as well. This
meant the database didn't free the relevant cursor resources in the
DB, and since it was over a huge table it took a lot of RAM.
Hope it helps!
TLDR: If you need PostgreSQL on AWS and you need rock solid stability, run PostgreSQL on EC2 (for now) and do some kernel tuning for overcommitting
I'll try to be concise, but you're not the only one who has seen this and it is a known (internal to Amazon) issue with RDS and Aurora PostgreSQL.
OOM Killer on RDS/Aurora
The OOM killer does run on RDS and Aurora instances because they are backed by linux VMs and OOM is an integral part of the kernel.
Root Cause
The root cause is that the default Linux kernel configuration assumes that you have virtual memory (swap file or partition), but EC2 instances (and the VMs that back RDS and Aurora) do not have virtual memory by default. There is a single partition and no swap file is defined. When linux thinks it has virtual memory, it uses a strategy called "overcommitting" which means that it allows processes to request and be granted a larger amount of memory than the amount of ram the system actually has. Two tunable parameters govern this behavior:
vm.overcommit_memory - governs whether the kernel allows overcommitting (0=yes=default)
vm.overcommit_ratio - what percent of system+swap the kernel can overcommit. If you have 8GB of ram and 8GB of swap, and your vm.overcommit_ratio = 75, the kernel will grant up to 12GB or memory to processes.
We set up an EC2 instance (where we could tune these parameters) and the following settings completely stopped PostgreSQL backends from getting killed:
vm.overcommit_memory = 2
vm.overcommit_ratio = 75
vm.overcommit_memory = 2 tells linux not to overcommit (work within the constraints of system memory) and vm.overcommit_ratio = 75 tells linux not to grant requests for more than 75% of memory (only allow user processes to get up to 75% of memory).
We have an open case with AWS and they have committed to coming up with a long-term fix (using kernel tuning params or cgroups, etc) but we don't have an ETA yet. If you are having this problem, I encourage you to open a case with AWS and reference case #5881116231 so they are aware that you are impacted by this issue, too.
In short, if you need stability in the near term, use PostgreSQL on EC2. If you must use RDS or Aurora PostgreSQL, you will need to oversize your instance (at additional cost to you) and hope for the best as oversizing doesn't guarantee you won't still have the problem.

MongoDB cant access document above at specific skip

I have a MongoDB instance in a cloud on AWS EC2 t2.micro (30GB storage, 1GB ram) running in Docker and in that database I have a single collection which stores 411 thousand documents, an this takes ~700MB disk space.
On my local computer, if I run this in mongo shell:
db.my_collection.find().skip(200000).limit(1)
then I get the correct results, but if I run this
db.my_collection.find().skip(220000).limit(1)
then MongoDB shuts down. Why? What should I do, to access these data?
It appears that your system doesn't have enough RAM to fulfill mongodb demand. When a Linux system is critically low in memory, kernel starts killing processes to avoid system crash itself.
I believe, this is what happening in your case too. Mongodb is not even getting chance to write a log. I'd recommend to increase RAM or if it's not feasible, add more swap space. This will prevent system crash but mongodb will keep working though very very slow.
Please visit these excellent resources on Linux and it's behavior.
https://unix.stackexchange.com/questions/136291/will-linux-start-killing-my-processes-without-asking-me-if-memory-gets-short
https://serverfault.com/questions/480266/how-to-know-if-the-server-runs-out-of-ram-before-crashing-down

Mongodb hotfix KB2731284

I installed MongoDb on a windows server 2008 R2 and the hotfix KB2731284 is not installed, but I cannot restart the server easily.
In the hotfix description, I got this message "You run an application that uses the FlushViewOfFile() function to clean up memory-mapped files from the paged memory pool." (see https://support.microsoft.com/en-us/kb/2731284)
My question is, when the funtion FlushViewOfFile() is called? My application is just writing in a collection and get data from it. Do I risk to get some wrong behaviors?
I think you can run MongoDb without applying the Hotfix, but I would not recommend it. In long time you may run into problems. They have included some fixes in MongoDB to workaround the problem.
A detailed description of the problem can be found here and here.
See also this.
On Windows, Memory Mapped File flushes are synchronous operations. When the OS Virtual Memory Manager is asked to flush a memory mapped file, it makes a synchronous write request to the file cache manager in the OS. This causes large I/O stalls on Windows systems with high Disk IO latency, while on Linux the same writes are asynchronous.
The problem becomes critical on high-latency disk drives like Azure persistent storage (10ms). This behavior results in very long bg flush times, capping disk IOPS at 100. On low latency storage (local storage and AWS) the problem is not that visible.
On Windows 7 and Windows Server 2008 R2 when applying the hotfix you get a better file allocation performance what is relevant for MongoDB

mongodb flushing mmap takes around 20 secs with no updates being required

Hi One of our customers is running mongodb V2.2.3 on a 64 bit windows server 2008 R2 Enterprise.
We're currently seeing mmap flush times of over 20 seconds every minute.
What is confusing me is that it isn't doing any writes to the disk. (Disk write bytes is next to 0)
Our programme which access the data has been temporary turned off.
so all that is connected is a mongo shell.
Mongostat and mongotop aresn't showing anything
The database has 130 million records. There are 356 files for mmap.
Any sugestions on what could be causing this?
Thanks
If your working set is significantly larger than memory, and MongoDB is constantly going to disk for reads (and not just the normal spikes when syncing writes to disk), then you really should be sharding to spread the data across multiple machines/instances.
Given the behaviour you have described and that you have a large number of files for mmap, I suspect the underlying performance issue is SERVER-12401 in the MongoDB Jira issue tracker:
On Windows, Memory Mapped File flushes are synchronous operations. When the OS Virtual Memory Manager is asked to flush a memory mapped file, it makes a synchronous write request to the file cache manager in the OS. This causes large I/O stalls on Windows systems with high Disk IO latency, while on Linux the same writes are asynchronous.
There are a few possible ways to improve the flush performance on Windows, including code changes in both the MongoDB server and the Windows O/S. There is some ongoing work to address these issues, now that the synchronous flushing behaviour on Windows has been confirmed.
If you are using higher latency local storage (for example, spinning disks) you may be able to mitigate the issue by upgrading to SSD or better spec'd drives.
I would suggest upvoting/watching SERVER-12401 and the related Jira issues for updates.
It would also be worth upgrading from MongoDB 2.2 to a newer version as 2.2 is now past end-of-life for updates. There have been two major production release branches since then, including significant improvements in general performance/features as well as Windows support.