Max number of socket on Linux - sockets

It seems that the server is limited at ~32720 sockets...
I have tried every known variable change to raise up this limit.
But the server stay limited at 32720 opened socket, even if there is still 4Go of free memory and 80% of idle cpu...
Here's the configuration
~# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63931
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 798621
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 2048
cpu time (seconds, -t) unlimited
max user processes (-u) 63931
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
net.netfilter.nf_conntrack_max = 999999
net.ipv4.netfilter.ip_conntrack_max = 999999
net.nf_conntrack_max = 999999
Any thoughts ?

If you're dealing with openssl and threads, go check your /proc/sys/vm/max_map_count and try to raise it.

In IPV4, the TCP layer has 16 bits for the destination port, and 16 bits for the source port.
see http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Seeing that your limit is 32K I would expect that you are actually seeing the limit of outbound TCP connections you can make. You should be able to get a max of 65K sockets (this would be the protocol limit). This is the limit for total number of named connections. Fortunately, binding a port for incoming connections only uses 1. But if you are trying to test the number of connections from the same machine, you can only have 65K total outgoing connections (for TCP). To test the amount of incoming connections, you will need multiple computers.
Note: you can call socket(AF_INET,...) up to the number of file descriptors available, but
you cannot bind them without increasing the number of ports available. To increase the range, do this:
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
(cat it to see what you currently have--the default is 32768 to 61000)
Perhaps it is time for a new TCP like protocol that will allow 32 bits for the source and dest ports? But how many applications really need more than 65 thousand outbound connections?
The following will allow 100,000 incoming connections on linux mint 16 (64 bit)
(you must run it as root to set the limits)
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/ip.h>
void ShowLimit()
{
rlimit lim;
int err=getrlimit(RLIMIT_NOFILE,&lim);
printf("%1d limit: %1ld,%1ld\n",err,lim.rlim_cur,lim.rlim_max);
}
main()
{
ShowLimit();
rlimit lim;
lim.rlim_cur=100000;
lim.rlim_max=100000;
int err=setrlimit(RLIMIT_NOFILE,&lim);
printf("set returned %1d\n",err);
ShowLimit();
int sock=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
sockaddr_in maddr;
maddr.sin_family=AF_INET;
maddr.sin_port=htons(80);
maddr.sin_addr.s_addr=INADDR_ANY;
err=bind(sock,(sockaddr *) &maddr, sizeof(maddr));
err=listen(sock,1024);
int sockets=0;
while(true)
{
sockaddr_in raddr;
socklen_t rlen=sizeof(raddr);
err=accept(sock,(sockaddr *) &raddr,&rlen);
if(err>=0)
{
++sockets;
printf("%1d sockets accepted\n",sockets);
}
}
}

Which server are you talking about ? It might be it has a hardcoded max, or runs into other limits (max threads/out of address space etc.)
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 has some tuning to needed to achieve a lot of connection, but it doesn't help if the server application limits it in some way or another.

Check the real limits of the running process with.
cat /proc/{pid}/limits
The max for nofiles is determined by the Kernel, the following as root would increase the max to 100,000 "files" i.e. 100k CC
echo 100000 > /proc/sys/fs/file-max
To make it permanent edit /etc/sysctl.conf
fs.file-max = 100000
You then need the server to ask for more open files, this is different per server. In nginx, for example, you set
worker_rlimit_nofile 100000;
Reboot nginx and check /proc/{pid}/limits
To test this you need 100,000 sockets in your client, you are limited in the testing to the number of ports in TCP per IP address.
To increase the local port range to maximum...
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
This gives you ~64000 ports to test with.
If that is not enough, you need more IP addresses. When testing on localhost you can bind the source/client to an IP other than 127.0.0.1 / localhost.
For example you can bind your test clients to IPs randomly selected from 127.0.0.1 to 127.0.0.5
Using apache-bench you would set
-B 127.0.0.x
Nodejs sockets would use
localAddress
/etc/security/limits.conf configures PAM: its usually irrelevant for a server.
If the server is proxying requests using TCP, using upstream or mod_proxy for example, the server is limited by ip_local_port_range. This could easily be the 32,000 limit.

If you're considering an application where you believe you need to open thousands of sockets, you will definitely want to read about The C10k Problem. That page discusses many of the issues you will face as you scale up your number of client connections to a single server.

On Gnu+Linux, maximum is what you wrote. This number is (probably) stated somewhere in networking standards. I doubt you really need so many sockets. You should optimize the way you are using sockets instead of creating dozens all the time.

In net/socket.c the fd is allocated in sock_alloc_fd(), which calls get_unused_fd().
Looking at linux/fs/file.c, the only limit to the number of fd's is sysctl_nr_open, which is limited to
int sysctl_nr_open_max = 1024 * 1024; /* raised later */
/// later...
sysctl_nr_open_max = min((size_t)INT_MAX, ~(size_t)0/sizeof(void *)) &
-BITS_PER_LONG;
and can be read using sysctl fs.nr_open which gives 1M by default here. So the fd's are probably not your problem.
edit you then probably checked this as well, but would you care to share the output of
#include <sys/time.h>
#include <sys/resource.h>
int main() {
struct rlimit limit;
getrlimit(RLIMIT_NOFILE,&limit);
printf("cur: %d, max: %d\n",limit.rlim_cur,limit.rlim_max);
}
with us?

Generally having too much live connections is a bad thing. However, everything depends on the application and the patterns it communicates with its clients.
I suppose there is a pattern when clients have to be permanently async-connected and it is the only way a distributed solution might work.
Assumimg there are no bottlenecks in memory/cpu/network for the current load, and keeping in mind that to leave idle open connection is the only way distributed applications consumes less resources (say, connection time, and the overall/peak memory), overall OS network performance might be higher than using best practices we all know.
Good question and it needs for a solution. The problem is nobody can answer this. I would suggest to use divide & conquer technique and when the bottleneck is found return to us.
Please take apart your application on testbed and you will find the bottleneck.

Related

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

Snort not showing blocked/dropped packets

I'm trying to detect ping flood attacks with Snort. I have included the rule
(drop icmp any any -> any any (itype:8; threshold, track by_src, count 20, seconds; msg:"Ping flood attack detected"; sid:100121))
in the Snort's ddos.rule file.
I'm attacking using the command
hping3 -1 --fast
The ping statistics in the attacking machine says
100% packet loss
However, the Snort action stats shows the verdicts as
Block ->0.
Why is this happening?
A few things to note:
1) This rule is missing the value for seconds. You need to specify a timeout value, you currently have "seconds;" You need something like "seconds 5;". Since this is not valid I'm not sure when snort is actually going to generate an alert, which means it may just be dropping all of the icmp packets, but not generating any alerts.
2) This rule is going to drop EVERY icmp packet for itype 8. The threshold only specifies when to alert, not when to drop. So this is going to drop all packets that match and then generate 1 alert per 20 that it drops. See the manual on rule thresholds here.
3) If you do not have snort configured in inline mode, you will not be able to actually block any packets. See more on the information about the three different modes here.
If you just want to detect and drop ping floods you should probably change this to use the detection_filter option, instead of threshold. If you want to allow legitimate pings, and drop ping floods you do not want to use threshold because the way you have this rule written it will block all icmp itype 8 packets. If you use detection_filter you can write a rule that if snort sees 20 pings in 5 seconds from the same source host then drop. Here is an example of what your rule might look like:
drop icmp any any -> any any (itype:8; detection_filter:track by_src, count 20, seconds 5; sid:100121)
If snort sees 20 pings from the same source host within 5 seconds of each other it will then drop and generate an alert. See the snort manual for detection filters here.
With this configuration, you can allow legitimate pings on the network and block ping floods from the same source host.

How much memory does an inet stream socket use in Node.js?

Of course data can be buffered and grow if the client is too slow to read the server's writes [1].
But what is the default buffer size? I assume it's whatever is configured in /proc/sys/net/ipv4/tcp_rmem and tcp_wmem (assuming Linux)...
I'm trying to do some basic capacity planning. If I have a VPS with 512 MB RAM, and I assume the OS et al will use ~ 100MB, my app has ~ 400MB for whatever it wants to do. If each connected client (regular old TCP/IP socket) requires say 8KB (4KB read, 4KB write) by default, I have capacity for 400MB / 8KB = ~ 50000 clients.
[1] http://nodejs.org/docs/v0.4.7/api/all.html#socket.bufferSize
I don't know off the top of my head and it probably varies from platform to platform but here's how you can find out!
Use this code:
var net = require('net');
net.createServer(function (socket) {
socket.on('data', function(data) {
console.log('chunk length: ' + data.length);
});
}).listen(function() {
console.log("Server listening on %j", this.address());
});
And then cat a large file (like an ISO) through 'nc localhost $port' using the port number that the script spits out when it starts up, and watch the output to see what the largest chunk size is. On my OS X machine, it looks like the largest buffer is 40960 bytes, but it might be different on yours.

What is the max size of AF_UNIX datagram message in Linux?

Currently I'm hitting a hard limit of 130688 bytes. If I try and send anything larger in one message I get a ENOBUFS error.
I have checked the net.core.rmem_default, net.core.wmem_default, net.core.rmem_max, net.core.wmem_max, and net.unix.max_dgram_qlen sysctl options and increased them all but they have no effect because these deal with the total buffer size not the message size.
I have also set the SO_SNDBUF and SO_RCVBUF socket options, but this has the same issue as above. The default socket buffer size are set based on the default socket options anyways.
I've looked at the kernel source where ENOBUFS is returned in the socket stack, but it wasn't clear to me where it was coming from. The only places that seem to return this error have to do with not being able to allocate memory.
Is the max size actually 130688? If not can this be changed without recompiling the kernel?
AF_UNIX SOCK_DATAGRAM/SOCK_SEQPACKET datagrams need contiguous memory. Contiguous physical memory is hard to find, and the allocation fails, logging something similar to this on the kernel log:
udgc: page allocation failure. order:7, mode:0x44d0
[...snip...]
DMA: 185*4kB 69*8kB 34*16kB 27*32kB 11*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3788kB
Normal: 13*4kB 6*8kB 100*16kB 62*32kB 24*64kB 10*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 7012kB
[...snip...]
unix_dgram_sendmsg() calls sock_alloc_send_skb() lxr1, which calls sock_alloc_send_pskb() with data_len = 0 and header_len = size of datagram lxr2. sock_alloc_send_pskb() allocates header_len from "normal" skbuff buffer space, and data_len from scatter/gather pages lxr3. So, it looks like AF_UNIX sockets don't support scatter/gather on current Linux.

Getting the IO count

I am using xen hypervisor. I am trying to get the IO count of the VMs running on top of the xen hypervisor. Can someone suggest me some way or tool to get the IO count ? I tried using xenmon and virt-top. Virt-top doesnt give any value and xenmon always shows 0. Any suggestions to get the number of read or write calls made by a VM or the read and write(Block IO) bandwidth of a particular VM. Thanks !
Regards,
Sethu
You can read this directly from sysfs on most systems. You want to open the following directory:
/sys/devices/xen-backend
And look for directories starting with vbd-
The nomenclature is:
vbd-{domain_id}-{vbd_id}/statistics
Inside, you'll find what you need, which is:
br_req - Number of block read requests
oo_req - Number of 'out of' requests (no room left in list to service any given request)
rd_req - Number of read requests
rd_sect - Number of sectors read
wr_sect - Number of sectors written
The br_req will be an aggregate count of things like write barriers, aborts, etc.
Note, for this to work, The kernel has to be told to export Xen attributes via sysfs, but most Xen packages have this enabled. Additionally, the location in sysfs might be different with earlier versions of Xen.
have you tried xentop?
There is also bwm-ng (check your distro). It shows block utilization per disk (real/virtual). If you know the name of the virtual disk attached to the VM, then you can use bwm-ng to get those stats.