I'm adding peer-to-peer bluetooth using GameKit to an iPhone shoot-em-up, so speed is vital. I'm sending about 40 messages a second each way, most of them with the faster GKSendDataUnreliable, all serializing with NSCoding. In testing between a 3G and 3GS, this is slowing the 3G down a lot more than I'd like. I'm wondering where I should concentrate my efforts to speed it up.
How much slower is GKSendDataReliable? For the few packets that have to get there, would it be faster to send a GKSendDataUnreliable and have the peer send an acknowledgement so I can send again if I don't get the Ack within, say, 100ms?
How much faster would it be to create the NSData instance using a regular C array rather than archiving with the NSCoding protocol? Is this serialization process (for about a dozen floats) just as slow as you'd expect from an object creation/deallocation overhead, or is something particularly slow happening?
I heard that (for example) sending four seperate sets of data is much, much slower, than sending one piece of data four times the size. Would I make a significant saving by sending separate packets of data that wouldn't always go together in the same packet when they happen at the same time?
Are there any other bluetooth performance secrets I've missed?
Thanks for your help.
I'm not a bluetooth expert, but in general sending data using reliable is 1.5x the speed of sending data unreliable. I would avoid trying to send an ACK back using an unreliable method because then you're going to have to put in all kinds of ridiculous logic to detect whether the ACK failed to arrive which will slow you down much more than just using a reliable send.
Sending data has a high latency which means that sending 4 small packets is going to take more time than sending 1 packet with a 4x sized payload. Any time you can increase the payload size to make fewer sends you will get a performance benefit.
If you know the size and shape of the data that you are sending and receiving, you can also squeeze out some performance by sending byte arrays or arrays of numbers rather than using NSCoding because the NSCoding is going to consume some time to serialize and de-serialize (a step you can skip if you're just sending arrays) and the amount of data you send will be slightly more with NSCoder than it would be with a raw array.
Related
I am not able to retrive file through sockets, it divides the byte array if data increases than a certain length. How can I resolve this issue?
Sockets i/o (and especially when using UDP) is a low level communication protocol that gives you a lot of power but also requires you to solve problems like packets getting chopped up, getting dropped or arriving out of order. You can figure that all out yourself, but you may also want to look at using a package like socket.io-client-dart that takes care of all those edge cases.
I have a REST API that also serves SSE's to send events to clients. The expected load can be anywhere up to 10k concurrent. Luckily since the client never sends data we don't have to worry about polling the connections, however our next bottleneck becomes sending data to the FDs.
We broadcast a static payload out to either all the connections, or a subset of them. Is there a good way to do this without without spending over 60% of our profiled CPU time inside of syscall?
I've seen some talk around tcp splicing such that we vmsplice our payload into a pipe then tee it into n many pipes, then splice into n many socket. But is this the ideal way of achieving it?
Could something like memfd_create + sendfile work?
Is going to these lengths even worth it to save copying the payload ~10k times?
I am doing an application that will imply reading a lot of data sent to my socket.
The problem I have is whether if should I rely on the socket setReceiveBufferSize, put a big value there to hope that it will gather all the data that I have until I am able to process it, or use a BlockingQueue to put everything there and then process it from another thread that keeps pooling and processing data?
Also is it a bad design if I let the queue with the max number of elements? ( so I'm just telling it, "yeah receive as many element as you'd like"), I'm referring to the memory consumption if I will receive a really big number of elements?
Regards,
Aurelian
A large socket receive buffer is always a good idea, but for TCP windowing/throughput reasons, not because you may be slow reading. You should be aiming to read the input as fast as possible, certainly as fast as it arrives. The proposed BlockingQueue is a complete waste of time and space. If the socket receive buffer fills up, the sender will stall. That's all you need.
I'm using select() to listen for data on multiple sockets. When I'm notified that there is data available, how much should I read()?
I could loop over read() until there is no more data, process the data, and then return back to the select-loop. However, I can imagine that the socket recieves so much data so fast that it temporarily 'starves' the other sockets. Especially since I am thinking of using select also for inter-thread communication (message-passing style), I'd like to keep latency low. Is this an issue in reality?
The alternative would be to always read a fixed size of bytes, and then return to the loop. The downside here would be added overhead when there is more data available than fits into my buffer.
What's the best practice here?
Not sure how this is implemented on other platforms, but on Windows the ioctlsocket(FIONREAD) call tells you how many bytes can be read by a single call to recv(). More bytes could be in the socket's queue by the time you actually call recv(). The next call to select() will report the socket is still readable, though.
The too-common approach here is to read everything that's pending on a given socket, especially if one moves to platform-specific advanced polling APIs like kqueue(2) and epoll(7) enabling edge-triggered events. But, you certainly don't have to! Flip a bit associated with that socket somewhere once you think you got enough data (but not everything), and do more recv(2)'s later, say at the very end of the file-descriptor checking loop, without calling select(2) again.
Then the question is too general. What are your goals? Low latency? Hight throughput? Scalability? There's no single answer to everything (well, except for 42 :)
In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?
Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.