Why is my Netty based TCP server hanging up with 100% CPU usage? - centos

I've developed a Netty based TCP server to receive maintain connection with GSM/GPRS based devices and to persist those data in MySql database. Currently 5K connections are handled. Devices send periodic messages with interval of 30-60 secs, but connections are kept alive to maintain duplex communication.
The server application consumes 1-2% CPU in normal operation with peaks up to 10%, average load is very low. However after 6 hours to 48 hours normal operation, server application hangs up with constant 100% CPU consumption, thread dump indicates that epoll selector is the reason for high CPU usage. Applications still keeps connections for a few hours, then CPU consumption increases to 200% and most of the connections are released.
In the beginning of the project we used MINA and had the same issue with 1K active connections, that is why we switched to Netty. Until 5K connections Netty was much more stable and hang up period was 1-2 weeks.
Our server configuration:
I7-2600 Quad Core CPU,
8 GB Ram, Centos 5.0,
Open JDK 6.0,
Netty 3.2.4 (Netty is updated to 3.5.2 a few hours ago)
In order to overcome this problem we will update JDK to 7.0 (JDK has a new I/O implementation optimized for asynchronous operations) and try different OS including FreeBSD, Windows Server since each operating system has different strategies for handling I/O.
Any help will be appreciated, thanks..

This sounds like the Epoll bug.
The app is proxying connections to backend systems. The proxy has a pool of channels that it can use to send requests to the backend systems. If the pool is low on channels, new channels are spawned and put into the pool so that requests sent to the proxy can be serviced. The pools get populated on app startup, so that is why it doesn't take long at all for the CPU to spike through the roof (22 seconds into the app lifecycle). Source
Netty has a workaround built-in. Not sure from which version though, will have to update later.
System.setProperty("org.jboss.netty.epollBugWorkaround", "true");

Related

grpc unary-stream with redis pubsub - degradation with too many clients

We have a python grpc (grpcio with asyncio) server which performs server side streaming of data consumed from redis PUB/SUB (using aioredis 2.x) , combining up to 25 channels per stream. With low traffic everything works fine, as soon as we reach 2000+ concurrent streams , the delivery of messages start falling behind.
Some setup details and what we tried so far:
The client connections to GRPC are loadbalanced over kubernetes cluster with Ingress-NGINX controller, and it seems scaling (we tried 9 pods with 10 process instances each) doesn't help at all (loadbalancing is distributed evenly).
We are running a five node redis 7.x cluster with 96 threads per replica.
Connecting to redis with CLI client while GRPC falls behind - individual channels are on time while GRPC streams are falling behind
Messages are small in size (40B) with a variable rate anywhere between 20-200 per second on each stream.
Aioredis seems to be opening a new connection for each pubsub subscriber even if we're using capped connection pool for each grpc instance.
Memory/CPU utilisation is not dramatic as well as Network I/O, so we're not getting bottlenecked there
Tried identical setup with a very similar grpc server written in Rust, with similar results
#mike_t, As you have mentioned in the comment, switching from Redis Pub/Sub to zmq has helped in resolving the issue.
ZeroMQ (also known as ØMQ, 0MQ, or zmq) is an open-source universal messaging library, looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast.
You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks.
It has a score of language APIs and runs on most operating systems.

Wildfly 8.2 stop incoming connections suddenly

We have a wildfly 8.2 app server which allocates 6GB of server RAM. Sometime due to havey transaction count wildfly has stop receiving incoming connections. But when I check server (not app server, it is our VM ) memory, it uses 4GB of RAM. Then I checked Wildfly app server's heap memory it did not use at least 25% of allocate heap size. Why is that? When I restart wildfly App server, All things work normally and when it comes that kind of load, above scenario happen again
Try to increase the connection-limit as suggested in this SO question
You can Dump HTTP requests as given here
Also, are you getting any errors in your console? Please post them as well.

CPU usage of Jboss JVM goes upto 99% and stays there

I am doing load testing on my application using jmeter and I have a situation where the cpu usage by the applications jvm goes to 99% and it stays there. Application still work, I am able to login and do some activity. But, it’s understandably slower.
Details of environment:
Server: AMD Optrom, 2.20 Ghz, 8 Core, 64bit, 24 GB RAM. Windows Server 2008 R2 Standard
Application server: jboss-4.0.4.GA
JAVA: jdk1.6.0_25, Java HotSpot(TM) 64-Bit Server VM
JVM settings:
-Xms1G -Xmx10G -XX:MaxNewSize=3G -XX:MaxPermSize=12G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCompressedOops -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000
Database: MySql 5.6 (in a different machine)
Jmeter: 2.13
My scenario is that, I make 20 users of my application to log into it and perform normal activity that should not be bringing huge load. Some, minutes into the process, JVM of Jboss goes up and it never comes back. CPU usage will remain like that till JVM is killed.
To help better understand, here are few screen shots.
I found few post which had cup # 100%, but nothing there was same as my situation and could not find a solution.
Any suggestion on what’s to be done will be great.
Regards,
Sreekanth.
To understand the root cause of the high CPU utilization, we need to check the CPU data and thread dumps at same time.
Capture 5-6 thread dumps at the time of the issue. Similarly capture CPU consumption thread-by-thread basis.
Generally the root cause of the CPU issue would be problems with threads like BLOCKED threads, long running threads, dead-lock, long running loops etc. That can be resolved by going through the stacks of the threads.

mongodb flushing mmap takes around 20 secs with no updates being required

Hi One of our customers is running mongodb V2.2.3 on a 64 bit windows server 2008 R2 Enterprise.
We're currently seeing mmap flush times of over 20 seconds every minute.
What is confusing me is that it isn't doing any writes to the disk. (Disk write bytes is next to 0)
Our programme which access the data has been temporary turned off.
so all that is connected is a mongo shell.
Mongostat and mongotop aresn't showing anything
The database has 130 million records. There are 356 files for mmap.
Any sugestions on what could be causing this?
Thanks
If your working set is significantly larger than memory, and MongoDB is constantly going to disk for reads (and not just the normal spikes when syncing writes to disk), then you really should be sharding to spread the data across multiple machines/instances.
Given the behaviour you have described and that you have a large number of files for mmap, I suspect the underlying performance issue is SERVER-12401 in the MongoDB Jira issue tracker:
On Windows, Memory Mapped File flushes are synchronous operations. When the OS Virtual Memory Manager is asked to flush a memory mapped file, it makes a synchronous write request to the file cache manager in the OS. This causes large I/O stalls on Windows systems with high Disk IO latency, while on Linux the same writes are asynchronous.
There are a few possible ways to improve the flush performance on Windows, including code changes in both the MongoDB server and the Windows O/S. There is some ongoing work to address these issues, now that the synchronous flushing behaviour on Windows has been confirmed.
If you are using higher latency local storage (for example, spinning disks) you may be able to mitigate the issue by upgrading to SSD or better spec'd drives.
I would suggest upvoting/watching SERVER-12401 and the related Jira issues for updates.
It would also be worth upgrading from MongoDB 2.2 to a newer version as 2.2 is now past end-of-life for updates. There have been two major production release branches since then, including significant improvements in general performance/features as well as Windows support.

What are possible reasons for memcached to be significantly slower on a remote server?

I have a PHP/Apache server with 12GB of RAM. I have been running Memcached on the same machine with 6GB of allotted RAM.
I wanted to run Memcached on a separate server (same datacenter, vlan, subnet), just as I do for MySQL. I setup a separate, identical server with the same memcached configuration.
I am seeing a roughly 10x page load time using Memcached from the remote server than what I get when running locally. I have primed both caches and I still have a 10x load time from remote.
I'm having trouble trouble shooting this.
You're loading 500kb of data per pageload, in all small keys? How many keys per pageload is this?
Latency to a remote server is very low, but running many roundtrips is still a bad idea. Memcached clients support multi-get operations, where you batch many keys into a single request/response with much lower latency.
Just for info, DDR3-1333 is about 10667 MB/s.
If you have, let's say, Gigabit ethernet, I guess it can explains some of the problems you are experiencing...