Best way to generate million tcp connection - sockets

I need to find a best way to generate a million tcp connections. (More is good,less is bad). As quickly as possible machinely :D
Why do I need this ? I am testing a nat, and I want to load it with as many entries as possible.
My current method is to generate a subnet on a dummy eth and serially connect from that dummy to actual eth to lan to nat to host.
subnetnicfake----routeToRealEth----RealEth---cable---lan----nat---host.
|<-------------on my machine-------------------->|

One million simultaneous TCP sessions might be difficult: If you rely on standard connect(2) sockets API to create the functions, you're going to use a lot of physical memory: each session will require a struct inet_sock, which includes a struct sock, which includes a struct sock_common.
I quickly guessed at sizes: struct sock_common requires at roughly 58 bytes. struct sock requires roughly 278 bytes. struct inet_sock requires roughly 70 bytes.
That's 387 megabytes of data before you have receive and send buffers. (See tcp_mem, tcp_rmem, tcp_wmem in tcp(7) for some information.)
If you choose to go this route, I'd suggest setting the per-socket memory controls as low as they go. I wouldn't be surprised if 4096 is the lowest you can set it. (SK_MEM_QUANTUM is PAGE_SIZE, stored into sysctl_tcp_rmem[0] and sysctl_tcp_wmem[0].)
That's another eight gigabytes of memory -- four for receive buffers, four for send buffers.
And that's in addition to what the system requires for your programs to open one million file descriptors. (See /proc/sys/fs/file-max in proc(5).)
All of this memory is not swappable -- the kernel pins its memory -- so you're really only approaching this problem on a 64-bit machine with at least eight gigabytes of memory. Probably 10-12 would do better.
One approach taken by the Paketto Keiretsu tools is to open a raw connection, perform all the TCP three-way handshakes using a single raw socket, and try to compute whatever is needed, rather than store it, to handle much larger amounts of data than usual. Try to store as little as possible for each connection, and don't use naive lists or trees of structures.
The Paketto Keiretsu tools were last updated around 2003, so they still might not scale into the million range well, but they would definitely be my starting point if this were my problem to solve.

After searching for many days, I found the problem. Apparently this problem is well thought over, and it should be ,since its so very basic. The problem was, I didnt know what this problem should be called . Among know-ers, it apparently called as c10k problem. What I wanted is c1m problem. However there seems to be some effort done to get C500k . or Concurrent 500k connections.
http://www.kegel.com/c10k.html AND
http://urbanairship.com/blog/2010/09/29/linux-kernel-tuning-for-c500k/
#deadalnix.
Read above links ,and enlighten yourself.

Have you tried using tcpreplay? You could prepare - or capture - one or more PCAP network capture files with the traffic that you need, and have one or more instances of tcpreplay replay them to stress-test your firewall/NAT.

as long as you have 65536 port available in TCP, this is impossible to achive unless you have an army of servers to connect to.
So, then, what is the best way ? Just open as many connection as you can on servers and see what happens.

Related

What if my mmap virtual memory exceeds my computer’s RAM?

Background and Use Case
I have around 30 GB of data that never changes, specifically, every dictionary of every language.
Client requests to see the definition of a word, I simply respond with it.
On every request I have to conduct an algorithmic search of my choice so I don’t have to loop through the over two hundred million words I have stored in my .txt file.
If I open the txt file and read it so I can search for the word, it would take forever due to the size of the file (even if that file is broken down into smaller files, it is not feasible nor it is what I want to do).
I came across the concept of mmap, mentioned to me as a possible solution to my problem by a very kind gentleman on discord.
Problem
As I was learning about mmap I came across the fact that mmap does not store the data on the RAM but rather on a virtual RAM… well regardless of which it is, my server or docker instances may have no more than 64 GB of RAM and that chunk of data taking 30 of them is quite painful and makes me feel like there needs to be an alternative that is better. Even on a worst case scenario, if my server or docker container does not have enough RAM for the data stored on mmap, then it is not feasible, unless I am wrong as to how this works, which is why I am asking this question.
Questions
Is there better solution for my use case than mmap?
Will having to access such a large amount of data through mmap so I don’t have to open and read the file every time allocate RAM memory of the amount of the file that I am accessing?
Lastly, if I was wrong about a specific statement I made on what I have written so far, please do correct me as I am learning lots about mmap still.
Requirements For My Specific Use Case
I may get a request from one client that has tens of words that I have to look up, so I need to be able to retrieve lots of data from the txt file effectively.
The response to the client has to be as quick as possible, the quicker the better, I am talking ideally a less than three seconds, or if impossible, then as quick as it can be.

What is the fastest way to react to UDP datagrams in Python?

I am writing a specific application which must render an image on-screen based on data coming over UDP protocol, with a minimal latency possible. Overall program design doesn't matter, code cleanliness or maintainability doesn't matter, either. I need latencies of 1 ms (for processing the datagram and calling a callback which flips the videobuffer) or below that.
Right now, I am considering the following approaches: socket.socket().recvfrom(), selectors.DefaultSelector().register(), asyncore.dispatcher, multiprocessing.Process, concurrent.futures.ProcessPoolExecutor().submit(), twisted.internet.protocol.ConnectedDatagramProtocol
While socket.recvfrom() is the simplest approach, it implies a while true loop underneath, I believe, is blocking and doesn't allow listening for datagrams if they were sent before the recvfrom() method was called (i.e. doesn't have a buffer), and my application is gonna receive datagrams at a 1500 Hz frequency.
Is twisted framework fast enough? Is it select-based or callback-based?
My personal preference is that I use ProcessPoolExecutor, although I think having to call a callback wastes some fractions of a millisecond, while procedural single-threaded code is the fastest.
I also want to avoid function call overhead, therefore cannot afford calling some sort of a callback for every datagram I receive.
Which should I choose?
This question is not really answerable as asked.
Most likely, handling UDP datagrams is not going to be the limiting factor in this application. At the very least, you're probably going to need to authenticate or at least error-check this data, which means you'll need to do some cryptography.
You'll need to implement something and measure in the environments you care about. An answer that might be correct for one version of linux for one profile of data might be completely wrong on a slightly different platform with slightly different input data characteristics. All of the approaches you mentioned are probably fine performance-wise, so I'd focus on something that allows you to organize your code in a reasonable way and then only switch strategies if it doesn't meet your performance budget.
To answer your more specific question here, Twisted itself is based around multiplexors (we try to avoid select which has performance and scalability issues, but yes, select is one mechanism it can use).

What is the Riak per-key overhead using the Bitcask backend?

It's a simple question with apparently a multitude of answers.
Findings have ranged anywhere from:
a. 22 bytes as per Basho's documentation:
http://docs.basho.com/riak/latest/references/appendices/Bitcask-Capacity-Planning/
b. 450~ bytes over here:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-August/005178.html
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/004292.html
c. And anecdotal records that state overheads anywhere in the range of 45 to 200 bytes.
Why isn't there a straight answer to this? I understand it's an intricate problem - one of the mailing list entries above makes it clear! - but is even coming up with a consistent ballpark so difficult? Why isn't Basho's documentation clear about this?
I have another set of problems related to how I am to structure my logic based on the key overhead (storing lots of small values versus "collecting" them in larger structures), but I guess that's another question.
The static overhead is stated on our capacity planner as 22 bytes because that's the size of the C struct. As noted on that page, the capacity planner is simply providing a rough estimate for sizing.
The old post on the mailing list by Nico you link to is probably the best complete accounting of bitcask internals you will find and is accurate. Figuring in the 8bytes for a pointer to the entry and the 13bytes of erlang overhead on the bucket/key pair you arrive at 43 bytes on a 64 bit system.
As for there not being a straight answer ... actually asking us (via email, the mailing list, IRC, carrier pigeon, etc) will always produce an actual answer.
Bitcask requires all keys to be held in memory. As far as I can see the overhead referenced in a) is the one to be used when estimating the total amount of RAM bitcask will require across the cluster due to this requirement.
When writing data to disk, Riak stores the actual value together with various metadata, e.g. the vector clock. The post mentioning 450 bytes listed in b) appears to be an estimate of the storage overhead on disk and would therefore probably apply also to other backends.
Nico's post seems to contain a good and accurate explanation.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.