Taking a look at one of my applications running in a production server, I noticed that there is a "normal but strange" behavior of memory usage.
Let me explain: watching the execution of my JBoss 4.2.2 without any application deployed, I can see that is constantly grows and frees the used heap space, usually a few megabytes in a development server. When I deploy my application, the pattern is the same, but using more memory in average.
Well, in the production server I can see that my JBoss, even without any workload, has a minimum memory usage of 1.5GB. Still without any workload, the heap usage grows to 3.6GB, when a minor GC runs and the heap usage comes back to 1.5GB. Every 40s my JBoss grows its heap usage from 1,5GB to 3.6GB, and this pattern repeats indefinitely. When the workload grows, the difference is that the period for growing the memory usage falls to 8s.
So, my question is: is this normal?
Related
After upgrading from PostgreSQL 9.6 to 10 and also updating the overlying application (Trend Micro Deep Security), we see an increase in the overall shared memory utilization by more than 300%.
Currently the shared memory is around 3GB, which aligns with the shared_buffer parameter, set to 3072MB. Using ps, top and pmap I was able to tell that almost the entire shared memory is used by postgres-related processes.
However, I would like to know what's the actual cause of that increase. Is there anyway to identify the real root cause?
We're running our application in Wildfly 14.0.1, with a -Xmx of 4096, running with OpenJDK 11.0.2. I've been using VisualVM 1.4.2 to monitor our heap since we previously were having OOM exceptions (because our -Xmx was only 512 which was incredibly bad).
While we are well within our memory allocation now, we have no more OOM exceptions happening, and even with a good amount of clients and processing happening we're nowhere near the -Xmx4096 (the servers have 16GB so memory isn't an issue), I'm seeing some strange heap behavior that I can't figure out where it's coming from.
Using VisualVM, Eclipse MemoryAnalyzer, as well as heaphero.io, I get summaries like the following:
Total Bytes: 460,447,623
Total Classes: 35,708
Total Instances: 2,660,155
Classloaders: 1,087
GC Roots: 4,200
Number of Objects Pending for Finalization: 0
However, in watching the Heap Monitor, I see the Used Heap over a 4 minute time period increase by about 450MB before the GC runs and drops back down only to spike again. Here's an image:
This is when no clients are connected and nothing is actively happening in our application. We do use Apache File IO to monitor remote directories, we have JMS topics, etc. so it's not like the application is completely idle, but there's zero logging and all that.
My largest objects are the well-known io.netty.buffer.PoolChunk, which in the heap dumps are about 60% of my memory usage, the total is still around 460MB so I'm confused why the heap monitor is going from ~425MB to ~900MB repeatedly, and no matter where I take my snapshots, I can't see any large increase of object counts or memory usage.
I'm just seeing a disconnect between the heap monitor, and .hprof analysis. So there doesn't see a way to tell what's causing the heap to hit that 900MB peak.
My question is if these heap spikes are totally expected when running within Wildfly, or is there something within our application that is spinning up a bunch of objects that then get GC'd? In doing a Component report, objects in our application's package structure make up an extremely small amount of the dump. Which doesn't clear us, we easily could be calling things without closing appropriately, etc.
I have a question based on my experience trying to implement memory requests/limits correctly in an OpenShift OKD cluster. I started by setting no request, then watching to see what cluster metrics reported for memory use, then setting something close to that as a request. I ended up with high-memory-pressure nodes, thrashing, and oom kills. I have found I need to set the requests to something closer to the VIRT size in ‘top’ (include the program binary size) to keep performance up. Does this make sense? I'm confused by the asymmetry between request (and apparent need) and reported use in metrics.
You always need to leave a bit of memory headroom for overhead an memory spills. If for some reason the container exceeds the memory, either from your application, from your binary of some garbage collection system it will get killed. For example, this is common in Java apps, where you specify a heap and you need an extra overhead for the garbage collector and other things such as:
Native JRE
Perm / metaspace
JIT bytecode
JNI
NIO
Threads
This blog explains some of them.
I'm testing running monbodb on the kubernetes platform where I can limit the resources used by the running container.
Say I set the memory limit to 256Mb. The problem is that for example while making backup memory consumption increases to the limit and container gets restarted by kubernetes.
So the question is is there a way to limit mongodb memory consumption for my case so that it would not cause the crush by exeeding memory limit set by platform.
I could of course increase the limit but I'm interested in a principal solution and would like to understand this process better because I don't really now how memory consumed by mongodb and container os. Is it possible to tune mongodb/underlying linux os to work inside existing limits.
The limits that you have set are good enough for a monogodb pod, these are the limits used by the community as well.
The only way I think you can get around this for backups is to increase the memory limits, but still it might fail, because in other places on stackoverflow people have experienced OOM killing on VMs with memory of giga bytes. MongoDB basically tries to eat any and every memory that is made available to it.
Also there are other ways to backup mongodb: https://dba.stackexchange.com/questions/76130/how-to-backup-large-mongodb-database
I am not sure how this aligns in the k8s world.
In my server everyday 3:00AM GC is running and Heapspace is filling in a flash.
This causing site outage. ANY inputs?
following are my JVM settings.I am using JBOSS server.
-Dprogram.name=run.sh -server -Xms1524m -Xmx1524m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -XX:NewSize=512m -XX:MaxNewSize=512m -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -Djavax.net.ssl.trustStorePassword=changeit -Dcom.sun.management.jmxremote.port=8888 -Djava.rmi.server.hostname=192.168.100.140 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote
Any suggestions really helpful..
(This turned out somewhat long; there is an actual suggestion for a fix at the end.)
Very very briefly, garbage collection when you use -XX:+UseConcMarkSweepGC works like this:
All objects are allocated in the so-called young generation. This is typically a couple of hundred megs up to a gig in size, depending om VM settings, number of CPU:s and total heap size. The young generation is collected in a stop-the-world pause, followed by a parallel (multiple CPU) compacting (moving objects) collection. The young generation is sized so as to make this pause reasonably large.
When objects have survived (are still reachable) young gen they get promoted to "old-gen" (old generation).
The old generation is where -XX:+UseConcMarkSweepGC kicks in. In the default mode (without -XX:+UseConcMarkSweepGC) when the old generation becomes full the entire heap is collected and compacted (moving around, eliminating fragmentation) at once in a stop-the-world copy. That pause will typically be longer than young-gen pauses because the entire heap is involved, which is bigger.
With CMS (-XX:+UseConcMarkSweepGC) the work to compact the old generation is mostly concurrent (meaning, running in the background with the application not paused). This work is also not compacting; it works more like malloc()/free() and you are subject to fragmentation.
The main upside of CMS is that when things work well, you avoid long pause times that are linear in the size of the heap, because the main work is cone concurrently (there are some stop-the-world steps involved but they are supposed to usually be short).
The two primary downsides are that:
You are subject to fragmentation because old-gen is not compacted.
If you don't finish a concurrent collection cycle before old-gen fills up, or if fragmentation prevents allocation, the resulting full collection of the entire heap is not parallel as it is with the default collector. I..e, only one CPU is used. That means that when/if you do hit a full garbage collection, the pause will be longer than it would have been with the default collector.
Now... your logs. "Concurrent mode failure" is intended to convey that the concurrent mark/sweep work did not complete in time for another young-gen GC that needs to promote surviving objects into the old generation. The "promotion failed" is rather that during promotion from young-gen to old-gen, an object was unable to be allocated in old-gen due to fragmentation.
Unless you are hitting a true bug in the JVM, the sudden increase in heap usage is almost certainly from your application, JBoss, or some external entity acting on your application. So I can't really help with that. However, what is likely happening is a combination of two things:
The spike in activity is causing an increase in heap usage too quick for the concurrent collection to complete in time.
Old-gen is too fragmented, causing problems especially when the old-gen is almost full.
I should also point out now that the default behavior of CMS is to try to postpone concurrent collections as long as possible (yet not too long) for performance reasons. The later it happens, the more efficient (in terms of CPU usage) the collection is. However, a trade-off is that you're increasing the risk of not finishing in time (which again, will trigger a full GC and a long pause). It should also (I have not made empirical tests here, but it stands to reason) result in fragmentation being a greater concern; basically the more full old-gen is when an object is promoted, the greater is the likelyhood that the object's promotion will worsen fragmentation concerns (too long to go into details here).
In your case, I would do two things:
Keep figuring out what is causing the activity. I would say it's fairly unlikely that it is a GC/JVM bug.
Re-configure the JVM to trigger concurrent collection cycles earlier in order to avoid the heap every becoming so full that fragmentation becomes a particularly huge concern, and giving it more time to complete in time even during your sudden spikes of activity.
You can accomplish (2) most easily be using the JVM options
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
in order to explicitly force the JVM to kick start a CMS cycle at a certain level of heap usage (in this example 75% - you may need to change that; the lower the percentage, the earlier it will kick in).
Note that depending on what your live size is (the number of bytes that are in fact live and reachable) in your application, forcing an earlier CMS cycle may also require that you increase your heap size to avoid CMS running constantly (not a good use of CPU).