What to put for Kubernetes resource requests/limits? - kubernetes

I've seen articles recommending that resource requests/limit should be implemented. However, none I've found that discuss on what numbers to fill in.
For example, consider a container use zero CPU while idle, 80% under normal user requests and 200% CPU when hit by some rare requests:
If I put the maximum, 2000m as CPU request then a core would sit idle most of the time
On the other hand, if I request 800m and several pods are hitting their CPU limit at the same time the context switch overhead will kicks in
There are also cases like
Internal tools that sit idle most of the time, then jump to 200% on active use
Apps that have different peak time. For example, a SaaS that people use during working hours and a chatbot that start getting load after people leave work. It'd be nice if they could share the unused capacity.
Ideally vertical pod autoscaler would probably solve these problems automatically, but it is still in alpha today.

What I've been doing is to use telegraf to collect resource usage, and use the 95th percentile while the limit is set to 1 CPU and twice the memory request.
The problem with this method is
App that utilize multicores during startup, then under a core throughout their life will take longer to starts. I've observed a 2 minutes Spring startup become 5 minutes
Apps that are rarely used will have less resource reserved, and so have to rely on bursting capacity when it get invoked. This could be a problem if it has a surge in popularity.

Related

Server collapses when memory reaches 88%

I'm using server side rendering with Angular universal, and PM2 as the process manager, in a Digital Ocean droplet of 8 GB Memory / 80 GB Disk / Ubuntu 16.04.3 x64 / 4 vCPUs.
I use a 6GB swap file, and the available memory when "free -m" is the following:
total used free shared buff/cache available
Mem: 7983 1356 5290 16 1335 6278
Swap: 6143 88 6055
The ram used looks fine. There are 4 processes with Cluster Mode of PM2.
Every 6-8 hours, when the memory reaches ~88% in my Digital Ocean panel, the CPU goes very high, the web application does not respond correctly and PM2 has to restart the process, not sure for how long the web application does not work well.
Here is an image of what happens:
Performance is fine when working normally:
I think I'm missing some sort of configuration or something, since this happens always at the same periods of time.
EDIT1 So far I fixed some incompatibilities in my code (the app was working, but sometimes failed due to this), and added a memory limit in pm2 of 1GB. I'm not sure if this is the way to go since I'm a bit new to process management, but the CPU levels are fine now. Any comment is appreciated. I leave a picture of the current behaviour, every time one of the four processes reach 1GB, its restarted:
EDIT2 I add 3 more images, 2 showing top processes from Digital Ocean, and one showing Keymetrics status:
EDIT3 I figured out some memory leaks from my Angular app (I forgot to unsubscribe from a couple of subscriptions) and the system behaviour improved, but the memory line is still going up. I'll keep investigating about memory leaking in Angular and see if I've made some other mistakes:
It looks like your Angular Universal app is leaking memory, it should not increase over time as you observe but stay mostly flat.
You can try to find the memory leak (looks like you already found an issue and have a suspicion what else it could be).
Another thing you can try is periodically restart your app.
See here for example how to restart your pm2 process every couple of hours to reset and prevent the OOM situation that you've been running into.
In our (edge) case, the kubernetes healthcheck was the cause of the issue. The healthcheck accessed the main page by an internal IP. The page used the caller URL (in this case its IP) to load some resources which it couldn't find that way. This lead to an error and was somehow cached and slowly used up all memory. We had the same very linear rise in memory even during nights because of the regularity of the healthcheck.
We solved the problem by letting the healthcheck call "/health" where we only return a 200 code.. as one should do anyway.

MongoDB memory consumption - it keeps rising

I run a mongodb server in a EC2 instance on AWS. For a while it ran flawlessly, nevery spiking above 50% of memory usage.
Suddenly, the behaviour changed to an ever increasing curve, it never goes significantly down, it keeps growing througout the day, until it reaches a peak (sometimes it is 100%, sometimes 90 or 80) and suddenly drops to 50% (perhaps some Hypervisor activity here?).
This behavior is NOT, in any way, compatible with the usage behavior of the application server this database is serving. Below is a graph comparing incoming requests x db memory usage.
What are the things I should be looking into to diagnose this memory issue? I already looked at the amount of open connections at any given time, and it is very low (<20), so I don't think it could be that. Any other ideas?
This issue doesn't seem to impact database performance, but I can't run any maintenance jobs (like archiving and backup) with memory that high - I had cases in which the database crashed.

Unusual spikes in CPU utilization in CentOS 6.6 while starting pycharm

my system since last couple of days is behaving strangely. I am a regular user of pycharm software, and it used to work on my system very smoothly with no hiccups at all. But since last couple of days, whenever I start pycharm, my CPU utilization behaves strangly, like in the image: Unusual CPU util
I am confused as when I go to processes or try ps/top in terminal, there are no process which is utilizing cpu more then 1 or 2%. So I am not sure where these resources are getting consumed.
By unusual CPU util I mean, That first CPU1 is getting used 100% for couple or so minutes, then CPU2. Which is, only one cpu's utilization goes to 100% for sometime followed by other's. This goes on for 10 to 20 minutes. then system comes back to normal.
P.S.: I don't think this problem is related to pycharm, as I face similar issues while doing other work also, just that I always face this with pycharm for sure.
POSSIBLE CAUSE: I suspect you have a thrashing problem. The CPU usage of your applications are low because none of them are actually getting much useful work done. All the processing is being taken up by moving memory pages to and from the disk. Your CPU usage probably settles down after a time because your application has entered a state where its memory working set has shrunk to a point where it all can be held in memory at one time.
This has probably happened because one of the apps on your machine is handling a larger data set than before, and so requires more addressable memory. Another possibility is that, for some reason, a lot more apps are running on your machine.
POTENTIAL SOLUTION: There are several ways you can address this. The simplest is to put more RAM on your machine. If this doesn't work or isn't possible, you'll have to figure out which app is the memory hog. You may simply have to work with smaller problems/data-sets or offload some of the apps onto a different box.
MIGRATING CPU LOAD: Operating systems will move tasks (user apps, kernel) around for many different reasons. The reasons can range anywhere from it being just plain random to certain apps having more of their addressable memory in one bank vs another. Given that you are probably doing a lot of thrashing, I'm not surprised that the processor your app is running is randomized over time.

OSB: Analyzing memory of proxy service

I have multiple proxies in a message flow.Is there a way in OSB by which I can monitor the memory utilization of each proxy ? I'm getting OOM, want to investigate which proxy is eating away all/most memory.
Thanks !
If you're getting OOME then it's either because a proxy is not freeing up all the memory it uses (so will eventually fail even with one request at a time), or you use too much memory per invocation and it dies over a certain threshold but is fine under low load. Do you know which it is?
Either way, you will want to generate a heap dump on OOME so you can investigate what's going on. It's annoying but sometimes necessary. A colleague had to do that recently to fix some issues (one problem was an SB-transport platform bug, one was a thread starvation issue due to a platform work manager bug, the last one due to a Muxer bug when used in exalogic).
If it just performs poorly under load, then you'll need to do the usual OSB optimisations, like use fewer Assign steps (but assign more variables per step), do a lot more in xquery rather than proxy steps, especially loops that don't need a service callout, since they can easily be rolled into a for loop in xquery; you know, all the standard stuff.

How does Scala's Lift manage state?

I'm quite impressed by what Lift 2.0 brings to the table with Actors and StatefulSnippets, etc, but I'm a little worried about the memory overhead of these things. My question is twofold:
How does Lift determine when to garbage collect state objects?
What does the memory footprint of a page request look like?
If a web crawler dances across the footprint of the site, are they going to be opening up enough state objects to drown out a modest VPS (512M)? The question is very obviously application dependent, but I'm curious if anyone has any real world figures they can throw out at me.
Lift stores state information in a session, so once the session is destroyed the state associated with that session goes away.
Within the session, Lift tracks each page that state is allocated for (e.g., mapping between an ajax button in the browser and a function on the server) and have a heart-beat from the browser. Functions for pages that have not seen the heartbeat in 10 minutes are unreferenced so the JVM can garbage collection them. All of this is tunable, so you can change heart-beat frequency, function lifespan, etc., but in practice the defaults work quite well.
In terms of session explosion, yeah... that's a minor issue. Popular sites (including http://demo.liftweb.net/ ) experience it. The example code (see http://github.com/lift/lift/tree/master/examples/example/ ) detects sessions that were created by a single request and then abandoned and expires those early. I'm running demo.liftweb.net with 256MB of heap size (that'd fit in a 512MB VPS) and occasionally, the session count rises over 1,000, but that's quickly tamped down for search engine traffic.
I think the question about memory footprint was once answered somewhere on the mailing list, but I can’t find it at the moment.
Garbage collection is done after some idle time. There is, however, an example on the wiki which uses some better heuristics to kill off sessions spawned by web crawlers.
Of course, for your own project it makes sense to check memory consumption with something like VisualVM while spawning a couple of sessions yourself.