Sending a large file over network continuously - sockets

We need to write software that would continuously (i.e. new data is sent as it becomes available) send very large files (several Tb) to several destinations simultaneously. Some destinations have a dedicated fiber connection to the source, while some do not.
Several questions arise:
We plan to use TCP sockets for this task. What failover procedure would you recommend in order to handle network outages and dropped connections?
What should happen upon upload completion: should the server close the socket? If so, then is it a good design decision to have another daemon provide file checksums on another port?
Could you recommend a method to handle corrupted files, aside from downloading them again? Perhaps I could break them into 10Mb chunks and calculate checksums for each chunk separately?
Thanks.

Since no answers have been given, I'm sharing our own decisions here:
There is a separate daemon for providing checksums for chunks and whole files.
We have decided to abandon the idea of using multicast over VPN for now; we use a multi-process server to distribute the files. The socket is closed and the worker process exits as soon as the file download is complete; any corrupted chunks need to be downloaded separately.
We use a filesystem monitor to capture new data as soon as it arrives to the tier 1 distribution server.

Related

Sharing memory between processes on different computers

Can someone help me with sharing memory between three or more machines, each machine having its own copy of the memory to speed up the read operation
For example, I first create a socket to communicate between these processes, but how can I make memory visible between the machines? I know how make it visible in one machine.
EDIT: Maybe we should use server machine to manage shared memory read and write operation?
You cannot share memory across machine boundaries. You have to serialize the data being shared, such as with an IPC mechanism like a named pipe or a socket. Transmit the shared data to each machine, where they then copy it into their own local memory. Any changes to the local memory has to be transmitted to the other machines so they have an updated local copy.
If you are having problems implementing that, they you need to show what you have actually attempted.

Download multiple files vs single one big file&unzip via socket

I need my client to download 30Mb worth of files.
Following is the setup.
They are comprised of 3000 small files.
They are downloaded through tcp bsd socket.
They are stored in client's DB as they get downloaded.
Server can store all necessary files in memory.(no file access on server side)
I've not seen many cases where client downloads such large number of files which I suspect due to server side's file access.
I'm also worried if multiplexer(select/epoll) will be overwhelmed by excessive network request handling.(Do I need to worry about this?)
With the above suspicions, I zipped up 3000 files to 30 files.
(overall size doesn't change much because the files are already compressed files(png))
Test shows,
3000 files downloading is 25% faster than 30files downloading & unzipping.
I suspect it's because client device's is unable to download while unzipping & inserting into DB, I'm testing on handheld devices.. iPhone..
(I've threaded unzipping+DB operation separate from networking code, but DB operation seems to take over the whole system. I profiled a bit, and unzipping doesn't take long, DB insertion does. On server-side, files are zipped and placed in memory beforehand.)
I'm contemplating on switching back to 3000 files downloading because it's faster for clients.
I wonder what other experienced network people will say over the two strategies,
1. many small data
2. small number of big data & unzipping.
EDIT
For experienced iphone developers, I'm threading out the DB operation using NSOperationQueue.
Does NSOperationQueue actually threads out well?
I'm very suspicious on its performance.
-- I tried posix thread, no significant difference..
I'm answering my own question.
It turned out that inserting many images into sqlite DB at once in a client takes long time, as a result, network packet in transit is not delivered to client fast enough.
http://www.sqlite.org/faq.html#q19
After I adopted the suggestion in the faq to speed up "many insert", it actually outperforms the "many files download individually strategy".

Per socket data transfer limitations. Are download managers useful yet?

The idea behind breaking up a download into multiple segments with different ranges is for increasing download speed. This works if the server has a per connection limit. A server without that limitation theoretically servers the same bytes with one or more connections.
My question is if download managers still speed up downloading from such a server or it's just a useless effort. In other words is there any limitations per TCP socket connection by default or not?
No. There are no limitations per socket. Most OS:es will try to share the bandwidth equally between all sockets unless QoS is specified.
While a server could throttle bandwidth usage per connection, they typically do not bother. If a response is big enough that it could be effectively throttled then it's about the same impact to slower clients if a fast download just completes sooner.
Splitting a download into pieces may actually hurt your client's performance because of the way TCP operates -- it has a "slow start" mechanism that reduces throughput on new connections.
Websites that implement throttling will typically do so between their various virtual hosts (so that the download site doesn't starve out a more interactive one) or will do so based on the remote IP address.
By far the primary benefit of a download manager is that it will simply continue the download if the connection gets broken.

Append-only server performance

I'm building a small web server for learning purposes.
For each incoming POST request I'm planning to append the content to a file.
I'm using ZeroMQ sockets for communicating with the file-append process. Do I need to take special care with the file operations (fopen, fseek)?
Considering a typical Amazon EC2 instance and that each request has at most 1kb, how many file-append operations per second can my server handle?
Thanks!
Basic concerns should be followed, what happens if multiple processes are run and receive messages. What happens if you run out of disk space or a write fails?
Are you after synchronous writing to disk or is buffered, and potential of log corruption, acceptable? fopen and friends are buffered, consider open and friends for non-buffered writes.
Performance is tied to whether you can batch writes, use buffering, or want synchronous writes to disk. I think Amazon provide some IOPS details, certainly other developers have published results:
http://www.thebitsource.com/featured-posts/rackspace-cloud-servers-versus-amazon-ec2-performance-analysis/
http://blog.dt.org/index.php/2010/06/amazon-ec2-io-performance-local-emphemeral-disks-vs-raid0-striped-ebs-volumes/
https://forums.aws.amazon.com/thread.jspa?messageID=132387

serving large file using select, epoll or kqueue

Nginx uses epoll, or other multiplexing techniques(select) for its handling multiple clients, i.e it does not spawn a new thread for every request unlike apache.
I tried to replicate the same in my own test program using select. I could accept connections from multiple client by creating a non-blocking socket and using select to decide which client to serve. My program would simply echo their data back to them .It works fine for small data transfers (some bytes per client)
The problem occurs when I need to send a large file over a connection to the client. Since i have only one thread to serve all client till the time I am finished reading the file and writing it over to the socket i cannot resume serving other client.
Is there a known solution to this problem, or is it best to create a thread for every such request ?
When using select you should not send the whole file at once. If you e.g. are using sendfile to do this it will block until the whole file has been sent. Instead use a small buffer, and send a little data at a time to each client. Then use select to identify when the socket is again ready to be written to and send some more until all data has been sent. This will allow you to handle multiple clients in parallel.
The simplest approach is to create a thread per request, but it's certainly not the most scalable approach. I think at this time basically all high-performance web servers use various asynchronous approaches built on things like epoll (Linux), kqueue (BSD), or IOCP (Windows).
Since you don't provide any information about your performance requirements, and since all the non-threaded approaches require restructuring your application to use these often-complex asynchronous techniques (as described in the C10K article and others found from there), for now your best bet is just to use the threaded approach.
Please update your question with concrete requirements for performance and other relevant data if you need more.
For background this may be useful reading http://www.kegel.com/c10k.html
I think you are using your callback to handle a single connection. This is not how it was designed. Your callback has to handle the whatever-thousand of connections you are planning to serve, i.e from the number of file descriptor you get as parameter, you have to know (by reading the global variables) what to do with that client, either read() or send() or ... whatever