I run a mongodb server in a EC2 instance on AWS. For a while it ran flawlessly, nevery spiking above 50% of memory usage.
Suddenly, the behaviour changed to an ever increasing curve, it never goes significantly down, it keeps growing througout the day, until it reaches a peak (sometimes it is 100%, sometimes 90 or 80) and suddenly drops to 50% (perhaps some Hypervisor activity here?).
This behavior is NOT, in any way, compatible with the usage behavior of the application server this database is serving. Below is a graph comparing incoming requests x db memory usage.
What are the things I should be looking into to diagnose this memory issue? I already looked at the amount of open connections at any given time, and it is very low (<20), so I don't think it could be that. Any other ideas?
This issue doesn't seem to impact database performance, but I can't run any maintenance jobs (like archiving and backup) with memory that high - I had cases in which the database crashed.
Related
We run a small NodeJS app that manages subscriptions for our mobile apps. It's backend is a MongoDB with only 100MB of memory. Currently the data size is around 120MB. It's all hosted on a PaaS called Nodechef.
After running for about a week the Mongo server hit 98MB/100MB in memory usage. Not knowing what would happen, we forced a restart and it dropped back to 70MB or so. It's slowly creeping back up.
A few questions:
Is this normal behavior for Mongo to keep growing in memory up to the max?
What happens when it hits max? Will it reboot or crash, or do some kind of garbage collection?
Is restarting weekly a pretty normal fix for this type of issue?
According to this you can try setting hostInfo.system.memLimitMB but I am surprised MongoDB runs at all with 100 MB available memory (if this is accurate).
If the MongoDB process runs out of memory (i.e. requests a memory allocation which is denied) it is likely to immediately terminate.
This is a Nodechef specific answer based on how they handle this. Other PaaS may handle differently:
"When it hits 125%, SWAP included, it will auto restart itself. It does a graceful shutdown so there should not be any problems there.
In regards to if this is normal, depends, i have seen cases where the app does not close cursors properly causing a cursor leak on the database server which results in memory continuously increasing. Another issue could also be memory fragmentation on the server itself. As long as the restarts are not happening hourly, you are more than fine. Taking a couple week to hit its peak is ok."
I've seen articles recommending that resource requests/limit should be implemented. However, none I've found that discuss on what numbers to fill in.
For example, consider a container use zero CPU while idle, 80% under normal user requests and 200% CPU when hit by some rare requests:
If I put the maximum, 2000m as CPU request then a core would sit idle most of the time
On the other hand, if I request 800m and several pods are hitting their CPU limit at the same time the context switch overhead will kicks in
There are also cases like
Internal tools that sit idle most of the time, then jump to 200% on active use
Apps that have different peak time. For example, a SaaS that people use during working hours and a chatbot that start getting load after people leave work. It'd be nice if they could share the unused capacity.
Ideally vertical pod autoscaler would probably solve these problems automatically, but it is still in alpha today.
What I've been doing is to use telegraf to collect resource usage, and use the 95th percentile while the limit is set to 1 CPU and twice the memory request.
The problem with this method is
App that utilize multicores during startup, then under a core throughout their life will take longer to starts. I've observed a 2 minutes Spring startup become 5 minutes
Apps that are rarely used will have less resource reserved, and so have to rely on bursting capacity when it get invoked. This could be a problem if it has a surge in popularity.
I'm using server side rendering with Angular universal, and PM2 as the process manager, in a Digital Ocean droplet of 8 GB Memory / 80 GB Disk / Ubuntu 16.04.3 x64 / 4 vCPUs.
I use a 6GB swap file, and the available memory when "free -m" is the following:
total used free shared buff/cache available
Mem: 7983 1356 5290 16 1335 6278
Swap: 6143 88 6055
The ram used looks fine. There are 4 processes with Cluster Mode of PM2.
Every 6-8 hours, when the memory reaches ~88% in my Digital Ocean panel, the CPU goes very high, the web application does not respond correctly and PM2 has to restart the process, not sure for how long the web application does not work well.
Here is an image of what happens:
Performance is fine when working normally:
I think I'm missing some sort of configuration or something, since this happens always at the same periods of time.
EDIT1 So far I fixed some incompatibilities in my code (the app was working, but sometimes failed due to this), and added a memory limit in pm2 of 1GB. I'm not sure if this is the way to go since I'm a bit new to process management, but the CPU levels are fine now. Any comment is appreciated. I leave a picture of the current behaviour, every time one of the four processes reach 1GB, its restarted:
EDIT2 I add 3 more images, 2 showing top processes from Digital Ocean, and one showing Keymetrics status:
EDIT3 I figured out some memory leaks from my Angular app (I forgot to unsubscribe from a couple of subscriptions) and the system behaviour improved, but the memory line is still going up. I'll keep investigating about memory leaking in Angular and see if I've made some other mistakes:
It looks like your Angular Universal app is leaking memory, it should not increase over time as you observe but stay mostly flat.
You can try to find the memory leak (looks like you already found an issue and have a suspicion what else it could be).
Another thing you can try is periodically restart your app.
See here for example how to restart your pm2 process every couple of hours to reset and prevent the OOM situation that you've been running into.
In our (edge) case, the kubernetes healthcheck was the cause of the issue. The healthcheck accessed the main page by an internal IP. The page used the caller URL (in this case its IP) to load some resources which it couldn't find that way. This lead to an error and was somehow cached and slowly used up all memory. We had the same very linear rise in memory even during nights because of the regularity of the healthcheck.
We solved the problem by letting the healthcheck call "/health" where we only return a 200 code.. as one should do anyway.
my system since last couple of days is behaving strangely. I am a regular user of pycharm software, and it used to work on my system very smoothly with no hiccups at all. But since last couple of days, whenever I start pycharm, my CPU utilization behaves strangly, like in the image: Unusual CPU util
I am confused as when I go to processes or try ps/top in terminal, there are no process which is utilizing cpu more then 1 or 2%. So I am not sure where these resources are getting consumed.
By unusual CPU util I mean, That first CPU1 is getting used 100% for couple or so minutes, then CPU2. Which is, only one cpu's utilization goes to 100% for sometime followed by other's. This goes on for 10 to 20 minutes. then system comes back to normal.
P.S.: I don't think this problem is related to pycharm, as I face similar issues while doing other work also, just that I always face this with pycharm for sure.
POSSIBLE CAUSE: I suspect you have a thrashing problem. The CPU usage of your applications are low because none of them are actually getting much useful work done. All the processing is being taken up by moving memory pages to and from the disk. Your CPU usage probably settles down after a time because your application has entered a state where its memory working set has shrunk to a point where it all can be held in memory at one time.
This has probably happened because one of the apps on your machine is handling a larger data set than before, and so requires more addressable memory. Another possibility is that, for some reason, a lot more apps are running on your machine.
POTENTIAL SOLUTION: There are several ways you can address this. The simplest is to put more RAM on your machine. If this doesn't work or isn't possible, you'll have to figure out which app is the memory hog. You may simply have to work with smaller problems/data-sets or offload some of the apps onto a different box.
MIGRATING CPU LOAD: Operating systems will move tasks (user apps, kernel) around for many different reasons. The reasons can range anywhere from it being just plain random to certain apps having more of their addressable memory in one bank vs another. Given that you are probably doing a lot of thrashing, I'm not surprised that the processor your app is running is randomized over time.
I dropped out of the CS program at my university... So, can someone who has a full understanding of Computer Science please tell me: what is the meaning of Dirty and Resident, as relates to Virtual Memory? And, for bonus points, what the heck is Virtual Memory anyway? I am using the Allocations/VM Tracker tool in Instruments to analyze an iOS app.
*Hint - try to explain as if you were talking to an 8-year old kid or a complete imbecile.
Thanks guys.
"Dirty memory" is memory which has been changed somehow - that's memory which the garbage collector has to look at, and then decide what to do with it. Depending on how you build your data structures, you could cause the garbage collector to mark a lot of memory as dirty, having each garbage collection cycle take longer than required. Keeping this number low means your program will run faster, and will be less likely to experience noticeable garbage collection pauses. For most people, this is not really a concern.
"Resident memory" is memory which is currently loaded into RAM - memory which is actually being used. While your application may require that a lot of different items be tracked in memory, it may only require a small subset be accessible at any point in time. Keeping this number low means your application has lower loading times, plays well with others, and reduces the risk you'll run out of memory and crash as your application is running. This is probably the number you should be paying attention to, most of the time.
"Virtual memory" is the total amount of data that your application is keeping track of at any point in time. This number is different from what is in active use (what's being used is marked as "Resident memory") - the system will keep data that's tracked but not used by your application somewhere other than actual memory. It might, for example, save it to disk.
WWDC 2013 - 410 Fixing Memory Issues Explains this nicely. Well worth a watch since it also explains some of the practical implications of dirty, resident and virtual memory.