Wildfly 9 exception "Too many open files" - wildfly

I already configure the ulimit and /proc/sys/fs/file-max as 1000000 on our server.
When I find "Too many open files" exception in server.log, I use "lsof | wc", then the result is larger than 1000000, so it is not a configuration issue. The result of "lsof | wc" is from 300000 to 1000000. And the value of "lsof -p {wildfly pid}" is 1534.
If I use "lsof" list all open files, the result is as:
java 23032 23570 star 803r FIFO 0,8 0t0 159435626 pipe
java 23032 23570 star 804r FIFO 0,8 0t0 159427236 pipe
java 23032 23570 star 805r FIFO 0,8 0t0 159416919 pipe
java 23032 23570 star 806r FIFO 0,8 0t0 159425566 pipe
"23032" is the wildfly pid. This pid has more than 200 tid (thread id), each tid(thread?) open 1534 files. When sometime passed, the total number of opened files is larger than 1000000 and "Too many open files" is thrown.
Could anyone can help answer what is the root cause and how to fix the issue or give a workaround for this?
Many thanks for this help.

Try Wildfly 10.1. If you are using SSL there have been some bug fixes that may be relevant.

Related

the BPF filter dit not work with vlan packets

I captured some packets with pcapplusplus on our Ubuntu server, and wrote to .pcap files, then I read the .pcap files, it just worked fine; but when I set the filter with BPF Syntax,it could not read from the .pcap files, the filter is just a tcp string, and it worked well with the example input.pcap, but not work with my pcap files,
pcpp::IFileReaderDevice* reader = pcpp::IFileReaderDevice::getReader("input.pcap");
// verify that a reader interface was indeed created
if (reader == NULL)
{
printf("Cannot determine reader for file type\n");
exit(1);
}
// open the reader for reading
if (!reader->open())
{
printf("Cannot open input.pcap for reading\n");
exit(1);
}
// create a pcap file writer. Specify file name and link type of all packets that
// will be written to it
pcpp::PcapFileWriterDevice pcapWriter("output.pcap", pcpp::LINKTYPE_ETHERNET);
// try to open the file for writing
if (!pcapWriter.open())
{
printf("Cannot open output.pcap for writing\n");
exit(1);
}
// create a pcap-ng file writer. Specify file name. Link type is not necessary because
// pcap-ng files can store multiple link types in the same file
pcpp::PcapNgFileWriterDevice pcapNgWriter("output.pcapng");
// try to open the file for writing
if (!pcapNgWriter.open())
{
printf("Cannot open output.pcapng for writing\n");
exit(1);
}
// set a BPF filter for the reader - only packets that match the filter will be read
if (!reader->setFilter("tcp"))
{
printf("Cannot set filter for file reader\n");
exit(1);
}
// the packet container
pcpp::RawPacket rawPacket;
// a while loop that will continue as long as there are packets in the input file
// matching the BPF filter
while (reader->getNextPacket(rawPacket))
{
// write each packet to both writers
printf("matched ...\n");
pcapWriter.writePacket(rawPacket);
pcapNgWriter.writePacket(rawPacket);
}
Here are some packets:[enter image description here][1]
[1]: https://i.stack.imgur.com/phYA0.png , Anyone can help ?
TL;DR. You need to use the filter vlan and tcp to catch TCP packets with a VLAN tag.
Explanations
We can look at what BPF filter is generated when using only tcp:
$ tcpdump -d -i eth0 tcp
(000) ldh [12]
(001) jeq #0x86dd jt 2 jf 7
(002) ldb [20]
(003) jeq #0x6 jt 10 jf 4
(004) jeq #0x2c jt 5 jf 11
(005) ldb [54]
(006) jeq #0x6 jt 10 jf 11
(007) jeq #0x800 jt 8 jf 11
(008) ldb [23]
(009) jeq #0x6 jt 10 jf 11
(010) ret #262144
(011) ret #0
We can see 2 bytes are first loaded from offset 12 in the packet. That corresponds to the Ethertype in the Ethernet header. It is then used to check if we are parsing an IPv6 (jeq #0x86dd) or IPv4 (jeq #0x800) packet.
However, when there is a VLAN tag, the Ethertype field is shifted by 4 bytes (the length of the VLAN tag field). For packets with a VLAN tag, we should therefore be reading the Ethertype at offset 16.
Using filter vlan and tcp implements this change, by first checking that there is a VLAN tag and then taking it into account when reading the Ethertype. Therefore, to filter TCP packets regardless of whether they have a VLAN tag, you'll need tcp or (vlan and tcp).
#pchaigno is correct; you need to do vlan and tcp or, to catch both VLAN-encapsulated and non-VLAN-encapsulated TCP packets, tcp or (vlan and tcp).

When writing message length is more than 1024B(mtu), it failed in softroce mode

When I am writing message length is more than 1024B(mtu), it failed in softroce mode, pls help check why.
Using the standard tool ib_write_lat to test:
when ib_write_lat -s 1024 -n 5
When ib_write_lat -s 1025 -n 5, it fails.
My softroce version in in Red Hat Enterprise Linux Server release 7.4 (Maipo)
Is it a bug in softroce?
No it isn't a bug. I had similar problems.
What did you configure at your interface configuration?
I expect that you have a MTU of 1500 Bytes configured (or leaved the default value), this will result in RoCE using 1024. If you configure your interface MTU to 4200 you can use the ib_write_lat command with up to 4096 bytes.
InfiniBand protocol Maximum Transmission Unit (MTU) defines several fix size MTU: 256, 512, 1024, 2048 or 4096 bytes.
RoCE based application that uses RDMA that runs over Ethernet should take into account that the RoCE MTU is smaller than the Ethernet MTU. (normally 1500 is the default).
https://community.mellanox.com/docs/DOC-1447

PlayFramework Hangs After Days

The server run successfully at one time, but it hangs after days with no error logs. Then, all requests would not get the response.
This is the start command with options
sudo /opt/dev -Dhttps.port=443 -Dhttp.port=9000 -J-Xms3277m -J-Xmx3277m -J-XX:ParallelGCThreads=2 -J-Xmn2574M -J-XX:+UseConcMarkScMarkSweepGC -J-XX:+CMSClassUnloadingEnabled -J-server &
/opt/dev is the script file generated from activator stage
===========server info==========
linux: Ubuntu 14.04.5 LTS
ram: 4G
openjdk version "1.8.0_141"
===========process info========
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
15037 root 20 0 5978800 2.280g 31216 S 0.0 58.3 63:33.82 java
===========port info ===================
tcp6 :::9000 :::* LISTEN 15037/java
tcp6 :::443 :::* LISTEN 15037/java
===========other info==========
play version 2.3.2
scala version 2.11.1
akka setting
akka.jvm-exit-on-fatal-error = false
play.akka.jvm-exit-on-fatal-error = false
akka.default-dispatcher.fork-join-executor.pool-size-max =64
akka.actor.debug.receive = on
===========================================
These steps could help identify the problem.. or they could be just first steps in this direction.
Try to start with adding -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/where/to/put/hprof according you start script params think you need to use -J-XX instead of -XX. This will create heap-dump in case of OOM.
Add logging in endpoints (at start and at end) to be able to check if play receives request or even this does not happen.
While you have unresponsive play, try to check open file descriptors and compare it with your limits. To check you can find pid of your java process and call sudo ls -al /proc/7333/fd/|wc -l to see your limits use ulimit -a.
Would be nice to try to control akka queues. For the case if you use same dispatcher for frontend requests purposes and for some backoffice processing (dispatcher could be filled with long background tasks)
I would do all the diagnostic steps that Evgeny suggested, plus:
Change "akka.jvm-exit-on-fatal-error" and "play.akka.jvm-exit-on-fatal-error" to true, this may be masking your problem.
Take a stack dump of the running process when it is in this state and use that to identify the problem or post it here. See How to get a complete stack trace of a running java program that is taking 100% cpu?

Cannot kill python program on port 8000 causing Tryton server socket.error

I have been getting more deeply involved in python for scientific computing (as a hobby) over the last 2 years and as I also have a medical degree, I really, really want to get a copy of GNU Health running on my new Kubuntu 15.10 OS so I can learn how it all works and play around with it! I followed the instructions to install it on this page: https://en.wikibooks.org/wiki/GNU_Health/Installation
I got pretty much to the end but when I try to launch the tryton server with ./trytond I get this error message:
[Thu Oct 29 10:25:02 2015] INFO:trytond.server:using /home/gnuhealth/gnuhealth/tryton/server/config/trytond.conf as configuration file
[Thu Oct 29 10:25:02 2015] INFO:trytond.server:initialising distributed objects services
Traceback (most recent call last):
File "./trytond", line 80, in <module>
trytond.server.TrytonServer(options).run()
File "/home/gnuhealth/gnuhealth/tryton/server/trytond-3.4.6/trytond/server.py", line 71, in run
self.start_servers()
File "/home/gnuhealth/gnuhealth/tryton/server/trytond-3.4.6/trytond/server.py", line 178, in start_servers
self.jsonrpcd.append(JSONRPCDaemon(hostname, port, ssl))
File "/home/gnuhealth/gnuhealth/tryton/server/trytond-3.4.6/trytond/protocols/jsonrpc.py", line 382, in __init__
self.server = server_class((interface, port), handler_class, 0)
File "/home/gnuhealth/gnuhealth/tryton/server/trytond-3.4.6/trytond/protocols/jsonrpc.py", line 317, in __init__
bind_and_activate)
File "/usr/lib/python2.7/SocketServer.py", line 420, in __init__
self.server_bind()
File "/home/gnuhealth/gnuhealth/tryton/server/trytond-3.4.6/trytond/protocols/jsonrpc.py", line 346, in server_bind
SimpleJSONRPCServer.server_bind(self)
File "/usr/lib/python2.7/SocketServer.py", line 434, in server_bind
self.socket.bind(self.server_address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 98] Address already in use
On further investigation with sudo netstat -pant | grep 8000 I get
tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN 2516/python
I have tried to kill this python program running on port 8000 every different way I could find but it keeps coming back with a new number in front i.e.
tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN 916/python
I kill it then...
tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN some_other_number etc../python
Can someone please explain what is going on with this python program keeping on restarting and how I fix this one little problem getting in the way of me starting the server!?
I was looking at the install instructions you mentioned.
Look at this section:
Activate Network Devices for the JSON-RPC Protocol
The Tryton GNU Health server listens to localhost at port 8000, not allowing direct connections from other workstations.
editconf
You can edit the parameter listen in the [jsonrpc] section , to activate the network device so workstations in your net can connect. For example, the following block
[jsonrpc]
listen = *:8000
will allow to connect to the server in the different devices of your system.
Check if you can change the value of the port and see if it works.
Use a port number that is unused. Use this command to check whether the port number is available. It has to be greater than 1024.
netstat -nlp | grep <self-chosen-hopefully-unused-port-number>
Hope this helps.

JBOSS 6.0.0 Server Crashed when i use AJP Protocol - Too many open files

J BOSS 6.0.0 Server Crashed when i use AJP Protocol.
System showing the below exception continuously.
2012-08-21 16:12:51,750 ERROR [org.apache.tomcat.util.net.JIoEndpoint] (ajp-0.0.0.0-8009-Acceptor-0) Socket accept failed: java.net.SocketException: Too many open files
at java.net.PlainSocketImpl.socketAccept(Native Method) [:1.6.0_24]
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) [:1.6.0_24]
at java.net.ServerSocket.implAccept(ServerSocket.java:462) [:1.6.0_24]
at java.net.ServerSocket.accept(ServerSocket.java:430) [:1.6.0_24]
at org.apache.tomcat.util.net.DefaultServerSocketFactory.acceptSocket(DefaultServerSocketFactory.java:61) [:6.0.0.Final]
at org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:343) [:6.0.0.Final]
at java.lang.Thread.run(Thread.java:662) [:1.6.0_24]
check the maximum number of fd in system by entering cat /proc/sys/fs/file-max if you have 65535 it should be ok but you can increase it to 200000
check the ulimit by entering 'ulimit -n' , on my side that gaves 1024, so i increase it in the file /etc/security/limites.conf and add :
* soft nofile 2048
* hard nofile 2048
Finally you can check the fd byt entering lsof -p xxx|wc -l
all those explanation come from this which saves my life each time i faced this issue.
Issue is because of max threads & connectionTimeout set in the Server.xml in the JBOSS server.
AJP Protocol's connectionTimeout default value is infinite.
so, connectionTimeout value set to 120000 (2 minutes).
So, problem (Too many open files) never replicated. Its always better to set the optimal configuration setting instead of default values.
To assist this issue some other configuration changes has been made. those are,
"max threads" value for ajp protocol changed from 1500 to 150.
"ulimit -n" value has changed from 1024 to 8192.
The problem with AJP is that its connectionTimeout default value is infinite. To resolve the issue I suggest that you switch to http or even better https.