What are these Catalina-utility-X threads and why do they burn through so much heap? They consume 20x more memory than my worker thread, which I was trying to debug because I thought it was memory-intensive.
This snapshot is taken on a tomcat instance on my Eclipse IDE, which is receiving zero http traffic - only doing the work on that worker thread (or so I thought).
Related
in slurm, what will happen if the resources I required is not enough during the running of the job?
For example, #SBATCH --memory=10G; #SBATCH --cpus-per-task=2; python mytrain.py is in myscript.sh. After I run sbatch myscript.sh the job is allocated the required cpu (2) and memory (10 G) successfully. But during the running of the job, the program need more memory than 10 Gb (like loading a big video dataset), I found the job would not be killed. The job will still work normally.
So my question is: is there any side effect when I underestimate the resource I need? (memory seems okay, but is it stll okay if the required cpu number is not enough?)
Slurm can be configured to constrain the jobs into their resource requests(the most usual setup) , which does not seem to be the case in the cluster you are using.
If it were the case, your job would be killed when trying to use more memory than requested, and it would be limited to the physical CPUs you requested.
In your case, using more memory than requested can lead to memory exhaustion on the node on which your job is running, possibly, having your processes (but also possibly processes of other jobs on the same node!), killed by the OOM killer. Using more CPUs than requested means the processes started by your job will compete with the processes of other jobs for the same physical CPU, leading to a general slow-down of all jobs on the node because of a large number of context switches. Jobs being slowed down can then exceed their maximum time and get killed.
Underestimating resources can thus lead to loss of your jobs. If nodes are shared among jobs, it can also lead to loss of jobs from other users.
We are trying to deploy an apache Flink job on a K8s Cluster, but we are noticing an odd behavior, when we start our job, the task manager memory starts with the amount assigned, in our case is 3 GB.
taskmanager.memory.process.size: 3g
eventually, the memory starts decreasing until it reaches about 160 MB, at that point, it recovers a little memory so it doesn't reach its end.
that very low memory often causes that the job is terminated due to task manager heartbeat exception even when trying to watch the logs on Flink dashboard or doing the job's process.
Why is it going so low on memory? we expected to have that behavior but in the range of GB because we assigned those 3Gb to the task manager even if we change our task manager memory size we have the same behavior.
Our Flink conf looks like this:
flink-conf.yaml: |+
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
taskmanager.rpc.port: 6122
taskmanager.memory.process.size: 3g
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9999
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
is there a recommended configuration on K8s for memory or something that we are missing on our flink-conf.yml?
Thanks.
Your configuration looks fine. It's most likely an issue with your code and some kind of memory leak. This is a very good answer describing what may be the problem.
You can try setting a limit on the JVM heap with taskmanager.memory.task.heap.size that you give the JVM some extra room to do GC, etc. But in the end, if you are allocating something that is not being referenced you will run into the situation.
Presumably, you are using your memory to store your state in which case you can also try RockDB as a state backend in case you are storing large objects.
What are your requests/limits in you deployment templates? If there are no specified request sizes you may be seeing your cluster resources get eaten.
I am doing load testing on my application using jmeter and I have a situation where the cpu usage by the applications jvm goes to 99% and it stays there. Application still work, I am able to login and do some activity. But, it’s understandably slower.
Details of environment:
Server: AMD Optrom, 2.20 Ghz, 8 Core, 64bit, 24 GB RAM. Windows Server 2008 R2 Standard
Application server: jboss-4.0.4.GA
JAVA: jdk1.6.0_25, Java HotSpot(TM) 64-Bit Server VM
JVM settings:
-Xms1G -Xmx10G -XX:MaxNewSize=3G -XX:MaxPermSize=12G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000
Database: MySql 5.6 (in a different machine)
Jmeter: 2.13
My scenario is that, I make 20 users of my application to log into it and perform normal activity that should not be bringing huge load. Some, minutes into the process, JVM of Jboss goes up and it never comes back. CPU usage will remain like that till JVM is killed.
To help better understand, here are few screen shots.
I found few post which had cup # 100%, but nothing there was same as my situation and could not find a solution.
Any suggestion on what’s to be done will be great.
Regards,
Sreekanth.
To understand the root cause of the high CPU utilization, we need to check the CPU data and thread dumps at same time.
Capture 5-6 thread dumps at the time of the issue. Similarly capture CPU consumption thread-by-thread basis.
Generally the root cause of the CPU issue would be problems with threads like BLOCKED threads, long running threads, dead-lock, long running loops etc. That can be resolved by going through the stacks of the threads.
Can anyone confirm if Netty 3.5.7 introduced a change that causes an NIO threadpool of 200 threads to be created?
We have a webapp that we're running in Tomcat 7 and I've noticed that at some point there is a new block of 200 NIO threads - all labeled "New I/O Worker #". I've verified that with 3.5.6, this threadpool is not initialized with 200 threads, but only a boss thread. As soon as I replaced the the jar with 3.5.7, I now have 200 NIO threads + the boss thread.
If this change was introduced with 3.5.7, is it possible to control the pool size with some external configuration? I ask because we don't explicitly use Netty, it's used by a 3rd party JAR.
Thanks,
Bob
Netty switched to not lazy start workers anymore because of the overhead of synchronization. I guess that could be the problem you see.
The only help here is to change the worker-count when create the Nio*ChannelFactory. 200 is a way to high anyway.
I've developed a Netty based TCP server to receive maintain connection with GSM/GPRS based devices and to persist those data in MySql database. Currently 5K connections are handled. Devices send periodic messages with interval of 30-60 secs, but connections are kept alive to maintain duplex communication.
The server application consumes 1-2% CPU in normal operation with peaks up to 10%, average load is very low. However after 6 hours to 48 hours normal operation, server application hangs up with constant 100% CPU consumption, thread dump indicates that epoll selector is the reason for high CPU usage. Applications still keeps connections for a few hours, then CPU consumption increases to 200% and most of the connections are released.
In the beginning of the project we used MINA and had the same issue with 1K active connections, that is why we switched to Netty. Until 5K connections Netty was much more stable and hang up period was 1-2 weeks.
Our server configuration:
I7-2600 Quad Core CPU,
8 GB Ram, Centos 5.0,
Open JDK 6.0,
Netty 3.2.4 (Netty is updated to 3.5.2 a few hours ago)
In order to overcome this problem we will update JDK to 7.0 (JDK has a new I/O implementation optimized for asynchronous operations) and try different OS including FreeBSD, Windows Server since each operating system has different strategies for handling I/O.
Any help will be appreciated, thanks..
This sounds like the Epoll bug.
The app is proxying connections to backend systems. The proxy has a pool of channels that it can use to send requests to the backend systems. If the pool is low on channels, new channels are spawned and put into the pool so that requests sent to the proxy can be serviced. The pools get populated on app startup, so that is why it doesn't take long at all for the CPU to spike through the roof (22 seconds into the app lifecycle). Source
Netty has a workaround built-in. Not sure from which version though, will have to update later.
System.setProperty("org.jboss.netty.epollBugWorkaround", "true");