How to identify and monitor "stop the world" garbage collection / memory leak in Java third party "blackbox" Restful application - rest

I have been given an interesting task of identifying a stop the world garbage collection / memory leak in a third party "black box" restful application, which is in production.
The application is load balanced, and recently, the application had a stop-the-world garbage collection on all server instances, which led to a production service outage.
I (we) don't have access to the third-party code.
This is what I have done so far:
I have been ensuring the JVM command line parameters are correct. The container is Jetty, OpenJdk 8, with the CMS garbage collector.
I have successfully been using VisualVM, with Memory Pools and Visual GC plugins to profile the app (-verbosegc is enabled).
My intention is to look at the amount of traffic we get in production (for each API endpoint), and run a soak test. I will increase the test load, with the intention of causing the stop the world GC to happen.
There is no specific out-of-memory exception, "just" a stop-the-world, with the application threads suspended. After 5-10 minutes, the application starts to accept requests again (502 on the load balancer go).
I have already looked at How to find a Java Memory Leak
I am at a disadvantage not being able to look at the source code.
Can someone please give me any further tips, or strategies on how to track down what is causing the stop-the-world GC, and memory leak.
Here are the JVM parameters which are being used:
java -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Xms6g -Xmx6g -XX:MetaspaceSize=2g -XX:MaxMetaspaceSize=2g
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-Dsun.net.client.defaultConnectTimeout=10000
-Dsun.net.client.defaultReadTimeout=30000
-XX:+DisableExplicitGC -d64 -verbose:gc -Xloggc:/var/log/gc.log
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/heapdump.hprof
-XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled
-XX:+ParallelRefProcEnabled
-XX:+UseLargePagesInMetaspace -XX:MaxGCPauseMillis=100
Thanks
Miles.

Related

kubernetes pod high cache memory usage

I have a java process which is running on k8s.
I set Xms and Xmx to process.
java -Xms512M -Xmx1G -XX:SurvivorRatio=8 -XX:NewRatio=6 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -jar automation.jar
My expectation is that pod should consume 1.5 or 2 gb memory, but it consume much more, nearly 3.5gb. its too much.
if ı run my process on a virtual machine, it consume much less memory.
When ı check memory stat for pods, ı reliase that pod allocate too much cache memory.
Rss nearly 1.5GB is OK. Because Xmx is 1gb. But why cache nearly 3GB.
is there any way to tune or control this usage ?
/app $ cat /sys/fs/cgroup/memory/memory.stat
cache 2881228800
rss 1069154304
rss_huge 446693376
mapped_file 1060864
swap 831488
pgpgin 1821674
pgpgout 966068
pgfault 467261
pgmajfault 47
inactive_anon 532504576
active_anon 536588288
inactive_file 426450944
active_file 2454777856
unevictable 0
hierarchical_memory_limit 16657932288
hierarchical_memsw_limit 9223372036854771712
total_cache 2881228800
total_rss 1069154304
total_rss_huge 446693376
total_mapped_file 1060864
total_swap 831488
total_pgpgin 1821674
total_pgpgout 966068
total_pgfault 467261
total_pgmajfault 47
total_inactive_anon 532504576
total_active_anon 536588288
total_inactive_file 426450944
total_active_file 2454777856
total_unevictable 0
A Java process may consume much more physical memory than specified in -Xmx - I explained it in this answer.
However, in your case, it's not even the memory of a Java process, but rather an OS-level page cache. Typically you don't need to care about the page cache, since it's the shared reclaimable memory: when an application wants to allocate more memory, but there is not enough immediately available free pages, the OS will likely free a part of the page cache automatically. In this sense, page cache should not be counted as "used" memory - it's more like a spare memory used by the OS for a good purpose while application does not need it.
The page cache often grows when an application does a lot of file I/O, and this is fine.
Async-profiler may help to find the exact source of growth:
run it with -e filemap:mm_filemap_add_to_page_cache
I demonstrated this approach in my presentation.

Catalina-utility-1 and Catalina-utility-2 excessive memory allocation

What are these Catalina-utility-X threads and why do they burn through so much heap? They consume 20x more memory than my worker thread, which I was trying to debug because I thought it was memory-intensive.
This snapshot is taken on a tomcat instance on my Eclipse IDE, which is receiving zero http traffic - only doing the work on that worker thread (or so I thought).

CPU usage of Jboss JVM goes upto 99% and stays there

I am doing load testing on my application using jmeter and I have a situation where the cpu usage by the applications jvm goes to 99% and it stays there. Application still work, I am able to login and do some activity. But, it’s understandably slower.
Details of environment:
Server: AMD Optrom, 2.20 Ghz, 8 Core, 64bit, 24 GB RAM. Windows Server 2008 R2 Standard
Application server: jboss-4.0.4.GA
JAVA: jdk1.6.0_25, Java HotSpot(TM) 64-Bit Server VM
JVM settings:
-Xms1G -Xmx10G -XX:MaxNewSize=3G -XX:MaxPermSize=12G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000
Database: MySql 5.6 (in a different machine)
Jmeter: 2.13
My scenario is that, I make 20 users of my application to log into it and perform normal activity that should not be bringing huge load. Some, minutes into the process, JVM of Jboss goes up and it never comes back. CPU usage will remain like that till JVM is killed.
To help better understand, here are few screen shots.
I found few post which had cup # 100%, but nothing there was same as my situation and could not find a solution.
Any suggestion on what’s to be done will be great.
Regards,
Sreekanth.
To understand the root cause of the high CPU utilization, we need to check the CPU data and thread dumps at same time.
Capture 5-6 thread dumps at the time of the issue. Similarly capture CPU consumption thread-by-thread basis.
Generally the root cause of the CPU issue would be problems with threads like BLOCKED threads, long running threads, dead-lock, long running loops etc. That can be resolved by going through the stacks of the threads.

How to calculate launch configuration properties?

in Eclipse Jboss 7.1 VM arguments
My RAM 8GB
vm arguments have a this like statements ;
-server -Xms64m
-Xmx512m
-XX:MaxPermSize=1024m
how to calculate this bold numbers?
**
Caused by: java.lang.OutOfMemoryError: Java heap space
**
You are getting that error because your server used up all of its available memory (in your case, 512mb). You can increase Xmx param, which sets the maximum amount of memory your server can use.
OutOfMemoryError can happen because of insufficient memory assignment, or memory leaks (objects that java's garbage collector can't delete, despite not being needed).
There is no magic rule to calculate those params, they depend on what you are deploying to jboss, how much concurrent users, etc, etc, etc.
You can try increasing Xmx param, and check with jvisualvm the memory usage, see how it behaves..

Heroku memory leak with Play2 scala

Was doing some stretch (ab) test to my 1 heroku dyno and dev database with 20 connections limit.
During the calls (that access database with squeryl the heap allocation is increasing causing R14 (memory more than 512MB))
I cannot seem to reproduce the problem (at that levels at least locally).
Is there any way to get heroku heap dump and analyze it to get some clue?
Is there any known issues with play2, scala, squeryl and heroku memory leak?
Update
If i do System.gc at the end of the controller everything seems to be fine and slower ofc...I create a lot of object at that call but shouldn't heroku's JVM take care of gc? Also if i schedule gc call periodically don't free memory
There's a great article for troubleshooting memory issues on Heroku:
https://devcenter.heroku.com/articles/java-memory-issues
In your case, you can add the GC flags to JAVA_OPTS to see memory details. I'd suggest the following flags:
heroku config:add JAVA_OPTS="-Xmx384m -Xss512k -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps"
There's also a simple java agent that you can add to your process if you want a little more info from JMX about your memory. You can also take a look at monitoring addons like New Relic if you want to go into more depth, but I think you should be fine with the flags and java agent.
I had this issue as well, and answered it here.
I had the same issue. Heroku is telling you the machine is running out
of memory, not the Java VM. There is actually a bug in the Heroku Play
2.2 deployment, the startup script reads java_opts, not JAVA_OPTS.
I fixed it by setting both:
heroku config:add java_opts='-Xmx384m -Xms384m -Xss512k -XX:+UseCompressedOops'
heroku config:add JAVA_OPTS='-Xmx384m -Xms384m -Xss512k -XX:+UseCompressedOops'
I also had to set -Xms otherwise I got an error saying the min and max
were incompatible. I guess Play2.2 was using a default higher than
384m.