I running freeRTOS and lwip 1.4.1 with the socket api in use on an stm32 processor (stm32f407).
Overall it works pretty fine.
I can send and receive data with udp and tcp.
But in a timewindow of 3 to 7 days I see a strange behavior.
My Problem
Every 3 to 7 days my client (Windows 10, which sends 1-2 HTTP-Requests per second) fails to send those requests. When this happens, there are ~10 Requests successively, which are failing. In very few moments, the stack won't regenerate at all.
My Guess
I think I have possibly missconfigured something in my LWIP config.
Because the stack is well used and shouldn't have any bugs in this direction
My Ethernet settings
server and client are directly connected, no switch,hub or router in between.
server (stm32/lwip):
static, 192.168.168.2
netmask, 255.255.255.0
client (win10) eth0:
static, 192.168.168.1
netmask, 255.255.255.0
client (win10) eth1:
dhcp, to normal working network
My Tries
At the moment I have tests running which are sending ~7-8 Requests per second, but the error doesn't apply more often.
I played around with the lwip config:
more memory for the stack
more pbufs
bigger pbufs
with/without backlog
But everything without improving of this connection problem.
Could it be because of the often reused port numbers from the client, which could make this problem?
Here I have the relevant part of the lwip debuging output:
tcp debugging output
https://pastebin.com/a9JabhET
Here the Wireshark log:
orig screenshot
hole wireshark log:
https://www.file-upload.net/download-12682664/debug_tcp_00001_20170828172950.html
And here my lwipopts.h:
lwip configuration:
https://pastebin.com/cW0v4hF6
It seems a memory problem, but as it is temporary, it could be a timeout on something.
I suggest to use the memory stats functions of LwIP, and also to enable the ARP debug messages.
I have a new Job and am no longer working on this issue.
Befor I stated my new job I could show that it was not a memory Problem on LwIP (I defined unreasonable large pbufs and memorypools) they never reached their limits.
The problem was in the DMA driver for the ETH. When once reached the memory chain end of the DMA driver the chain elements never got freed, so I run into RBU (Receive Buffer Underrun) problems and the RBU Flag never got reseted again and the DMA ETH driver was hanging in this RBU interrupt (even if there where enough LwIP buffs to write to from DMA chain). So I added a sledgehammer fix to the DMA driver and disabled the RBU interrupt (I am polling the RBU flag in multiple situations and clear it if needed, and start to read again from ETH).
I think since then the problem is more or less "solved". Not nice, but it worked.
I've got some information of my coworker at my old working place:
The RBU Interrupt and the clear did not work, because our used CAN stack did not work very well with FreeRTOS, the CAN stack used on busy systems much over 90% of CPU time, which let to the strange behaviour in ETH driver and LWIP.
Related
I have a program using a bi-directional TCP socket to send messages from the host PC to a VLinx ethernet-to-serial converter and then on to a PLC via RS-232. During heavy traffic the socket will intermittently stop communicating although all soft tests of the connection show that it is connected, active and writeable. I suspect that something is interrupting the connection causing the socket to close with out FIN/ACK. How can I test to see where this disconnect might be occuring?
The program itself is written in VB6 and uses Catalyst SocketTools/SocketWrench as opposed to the standard Winsock library. The methodology, properties and code seem to be sound since the same setup works reliably at two other sites. It's just this one site in particular where this problem occurs. It only happens during production when there is traffic on the network and can lose connection anywhere between 20 - 100 times per 10-hour day.
There are redundant tests in place to catch this loss of communication and keep the system running. We have tests on ACK messages, message queue size, time between transmissions (tokens on 2s interval), etc. Typically, the socket will not be unresponsive for more than 30 seconds before it is caught, closed and re-established which works properly >99% of the time.
Previously I had enabled the SocketTools logging capabilities which did not capture any relevant information. Most recently I have tried to have the system ping the VLinx on the first sign of a missed message (2.5 seconds). Those pings have always been successful, meaning that if there is a momentary loss of connection at a switch or AP it does not stay disconnected for long.
I do not have access to the network hardware aside from the PC and VLinx that we own. The facility's IT is also not inclined to help track these kinds of things down because they work on a project-based model.
Does anyone have any suggestions what I can do to try and determine where the problem is occurring so that I can then try to come up with a permanent solution to this issue rather than the band-aid of reconnecting multiple times per day?
A tool like Wireshark may be helpful in seeing what's going on at the network level. The logging facility in SocketTools/SocketWrench can only report what's going on at the API level, and it sounds like whatever the underlying problem is occurs at a lower level in the TCP stack.
If this is occurring after periods of relative inactivity, followed by a burst of activity, one thing you could try doing is enabling keep-alive and see if that makes any difference.
We keep hearing about unreliability of udp that it may reach or not reach or just reach out of order (Signifying delay).
Where is it held until delivered?
Since its connection less if you keep sending packets without a network connection where will it go? Driver buffer?
Similarly when the receiver is not reachable is the packet immediately lost or does it float around a bit expecting host to be available soon? if yes then where?
On a direct connection from one device to another, with no intervening devices, there shouldn't be a problem. Where you can run into problems is where you go through a bunch of switches and routers (like the Internet).
A few reasons:
If a switch drops a frame, there is no mechanism to resend the frame.
Routers will buffer packets when they get congested, and packets can
be dropped if the buffers are full, or they may be purposely dropped to prevent congestion.
Load balancing can cause packets to be delivered out of order.
You have no control over anything outside your network.
Where is it held until delivered?
Packet buffering can occur if packets arrive faster than the device can read. Buffering can be either at NIC of the device or software queue of device driver or in the software queue between driver and stack. But, if the rate of arrival
is much higher such that it cannot be handled by these buffering mechanisms, then it will get dropped at those appropriate layer/location (based on design).
Since its connection less if you keep sending packets without a
network connection where will it go? Driver buffer?
If there is no network, there might be no other intermediate network devices and hence there should not be significant problems. But, it also depends on your architecture / design / configurations. If the configured value of internal OS receive buffer limit / socket buffer size (SO_RCVBUF, rmem_max, rmem_default) is exceeded, there can be drops here. And, if the software queue in respective device driver overflows or the software queue between device driver & stack of the device overflows, there can be drops here. Also, if the CPU is busy addressing another priority task where by it suspends reception, there can be drops here.
Similarly when the receiver is not reachable is the packet immediately
lost or does it float around a bit expecting host to be available
soon? if yes then where?
If there is no reachable destination, it shall be dropped by router.
Also, note that the particular router shall also drop the packet if the TTL/hoplimit count (in IP) is zero by the time the packet reaches this router.
Recently we ran into what looked like a connectivity issue when a particular customer of ours installed our product. We ultimately traced it to a low MTU (~1300 bytes) being configured on one of the devices in the network. In this particular deployment, we had two Windows machines running our application communicating with one another, and their link MTUs were set at 1500.
One thing that made this particularly difficult to troubleshoot, was that our application would work fine during the handshake phase (where only small requests are sent), but would sometimes fail sending a specific request of size ~4KB across the network. If it makes a difference, the application is written in C# and these are WCF messages.
What could account for this indeterminism? I would have expected this to always fail, as the message size we were sending was always larger than the link MTU perceived by the Windows client, which would lead to at least one full 1500-byte packet, which would lead to problems. Is there something in TCP that could make it prefer smaller packets, but only sometimes?
Some other things that we thought might be related:
1) The sockets were constantly being set up and torn down (as the application received what it interpreted as a network failure), so this doesn't appear to be related to TCP slow start.
2) I'm assuming that WCF "quickly" pushes the entire 4KB message to the socket, so there's always something to send that's larger than 1500 bytes.
3) Using WireShark, I didn't spot any TCP retransmissions which might explain why only subsets of the buffer were being sent.
4) Using WireShark, I saw a single 4KB IP packet being sent, which perhaps indicates that TCP Segment Offloading is being implemented by the NIC? (I'm not sure how TSO would look on WireShark). I didn't see in WireShark the 4KB request being broken down to multiple IP packets, in either successful or unsuccessful instances.
5) The customer claims that there's no route between the two Windows machines that circumvents the "problematic" device with the small MTU.
Any thoughts on this would be appreciated.
Recently, I started using the System.Net.Sockets class introduced in the Mango release of WP7 and have generally been enjoying it, but have noticed a disparity in the latency of transmitting data in debug mode vs. running normally on the phone.
I am writing a "remote control" app which transmits a single byte to a local server on my LAN via Wifi as the user taps a button in the app. Ergo, the perceived responsiveness/timeliness of the app is highly important for a good user experience.
With the phone connected to my PC via USB cable and running the app in debug mode, the TCP connection seems to transmit packets as quickly as the user taps buttons.
With the phone disconnected from the PC, the user can tap up to 7 buttons (and thus case 7 "send" commands with 1 byte payloads before all 7 bytes are sent.) If the user taps a button and waits a little between taps, there seems to be a latency of 1 second.
I've tried setting Socket.NoDelay to both True and False, and it seems to make no difference.
To see what was going on, I used a packet sniffer to see what the traffic looked like.
When the phone was connected via USB to the PC (which was using a Wifi connection), each individual byte was in its own packet being spaced ~200ms apart.
When the phone was operating on its own Wifi connection (disconnected from USB), the bytes still had their own packets, but they were all grouped together in bursts of 4 or 5 packets and each group was ~1000ms apart from the next.
btw, Ping times on my Wifi network to the server are a low 2ms as measured from my laptop.
I realize that buffering "sends" together probably allows the phone to save energy, but is there any way to disable this "delay"? The responsiveness of the app is more important than saving power.
This is an interesting question indeed! I'm going to throw my 2 cents in but please be advised, I'm not an expert on System.Net.Sockets on WP7.
Firstly, performance testing while in the debugger should be ignored. The reason for this is that the additional overhead of logging the stack trace always slows applications down, no matter the OS/language/IDE. Applications should be profiled for performance in release mode and disconnected from the debugger. In your case its actually slower disconnected! Ok so lets try to optimise that.
If you suspect that packets are being buffered (and this is a reasonable assumption), have you tried sending a larger packet? Try linearly increasing the packet size and measuring latency. Could you write a simple micro-profiler in code on the device ie: using DateTime.Now or Stopwatch class to log the latency vs. packet size. Plotting that graph might give you some good insight as to whether your theory is correct. If you find that 10 byte (or even 100byte) packets get sent instantly, then I'd suggest simply pushing more data per transmission. It's a lame hack I know, but if it aint broke ...
Finally you say you are using TCP. Can you try UDP instead? TCP is not designed for real-time communications, but rather accurate communications. UDP by contrast is not error checked, you can't guarantee delivery but you can expect faster (more lightweight, lower latency) performance from it. Networks such as Skype and online gaming are built on UDP not TCP. If you really need acknowledgement of receipt you could always build your own micro-protocol over UDP, using your own Cyclic Redundancy Check for error checking and Request/Response (acknowledgement) protocol.
Such protocols do exist, take a look at Reliable UDP discussed in this previous question. There is a Java based implementation of RUDP about but I'm sure some parts could be ported to C#. Of course the first step is to test if UDP actually helps!
Found this previous question which discusses the issue. Perhaps a Wp7 issue?
Poor UDP performance with Windows Phone 7.1 (Mango)
Still would be interested to see if increasing packet size or switching to UDP works
ok so neither suggestion worked. I found this description of the Nagle algorithm which groups packets as you describe. Setting NoDelay is supposed to help but as you say, doesn't.
http://msdn.microsoft.com/en-us/library/system.net.sockets.socket.nodelay.aspx
Also. See this previous question where Keepalive and NoDelay were set on/off to manually flush the queue. His evidence is anecdotal but worth a try. Can you give it a go and edit your question to post more up to date results?
Socket "Flush" by temporarily enabling NoDelay
Andrew Burnett-Thompson here already mentioned it, but he also wrote that it didn't work for you. I do not understand and I do not see WHY. So, let me explain that issue:
Nagle's algorithm was introduced to avoid a scenario where many small packets had to been sent through a TCP network. Any current state-of-the-art TCP stack enables Nagle's algorithm by default!
Because: TCP itself adds a substantial amount of overhead to any the data transfer stuff that is passing through an IP connection. And applications usually do not care much about sending their data in an optimized fashion over those TCP connections. So, after all that Nagle algorithm that is working inside of the TCP stack of the OS does a very, very good job.
A better explanation of Nagle's algorithm and its background can be found on Wikipedia.
So, your first try: disable Nagle's algorithm on your TCP connection, by setting option TCP_NODELAY on the socket. Did that already resolve your issue? Do you see any difference at all?
If not so, then give me a sign, and we will dig further into the details.
But please, look twice for those differences: check the details. Maybe after all you will get an understanding of how things in your OS's TCP/IP-Stack actually work.
Most likely it is not a software issue. If the phone is using WiFi, the delay could be upwards of 70ms (depending on where the server is, how much bandwidth it has, how busy it is, interference to the AP, and distance from the AP), but most of the delay is just the WiFi. Using GMS, CDMA, LTE or whatever technology the phone is using for cellular data is even slower. I wouldn't imagine you'd get much lower than 110ms on a cellular device unless you stood underneath a cell tower.
Sounds like your reads/writes are buffered. You may try setting the NoDelay property on the Socket to true, you may consider trimming the Send and Receive buffer sizes as well. The reduced responsiveness may be a by-product of there not being enough wifi traffic, i'm not sure if adjusting MTU is an option, but reducing MTU may improve response times.
All of these are only options for a low-bandwidth solution, if you intend to shovel megabytes of data in either direction you will want larger buffers over wifi, large enough to compensate for transmit latency, typically in the range of 32K-256K.
var socket = new System.Net.Sockets.Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
{
NoDelay = true,
SendBufferSize = 3,
ReceiveBufferSize = 3,
};
I didn't test this, but you get the idea.
Have you tried setting SendBufferSize = 0? In the 'C', you can disable winsock buffering by setting SO_SNDBUF to 0, and I'm guessing SendBufferSize means the same in C#
Were you using Lumia 610 and mikrotik accesspoint by any chance?
I have experienced this problem, it made Lumia 610 turn off wifi radio as soon as last connection was closed. This added perceivable delay, compared to Lumia 800 for example. All connections were affected - simply switching wifi off made all apps faster. My admin says it was some feature mikrotiks were not supporting at the time combined with WMM settings. Strangely, most other phones were managing just fine, so we blamed cheapness of the 610 at the beginning.
If you still can replicate the problem, I suggest trying following:
open another connection in the background and ping it all the time.
use 3g/gprs instead of wifi (requires exposing your server to the internet)
use different (or upgraded) phone
use different (or upgraded) AP
I am using Java (although I think the socket options is implement in most languages) to implement a client and server. The server sends data to the client for processing which the client acknowledges. On another port the client then sends the results of the processing back to the server. When it comes to options such as
SO_LINGER
SO_KEEPALIVE
SO_NODELAY
SO_REUSEADDRESS
SO_SENDBUFFER
SO_RECBUFFER
TCP_NODELAY
We have noticed that the connection between the client and server occasionally breaks. There will be a timeout on the send or the receive. When this happens will kill the socket and open a new one to continue.
What would be the best options to set in terms of the above scenario and is there anything that we could do from our side (programmatically or options-wise) to try minimize the amount of times the connection is dropped. We are using normal TCP/IP.
UPDATE:
The bounty on this ends soon. I haven't had a satisfactory answer yet so it is still open. I think everyone is missing the point of the quest. What is the best practice with regards to the options above for sockets that continuously chat. I have already got a ping packet in that if there is no work to be done (hardly ever the scenario) the normal message is sent with no inner elements so there is always processing.
Strictly speaking, you don't need any of these socket options:
* SO_LINGER
You need to set SO_LINGER only if your application still has outstanding packets to send when close(2) or shutdown(2) has been called. Not really applicable for your application.
* SO_KEEPALIVE
Sending keepalive-pings every two hours would really only help very long-lived but -very- quiet connections going through stateful firewalls with very long session timeouts. (Two hours between pings is entirely too long to be practical in today's Internet.)
* SO_NODELAY
This (presumably an alias for TCP_NODELAY) disables Nagle's algorithm, which is just a small-packet-avoidance problem. Perhaps Nagle is getting in the way in your application, but it takes special sequences of packets to introduce 500ms delays into processing; it never just hangs connections.
* SO_REUSEADDRESS
Useful for all 'servers' that listen on well-known port numbers; use on 'clients' is almost always covering up some bug or other, but it is sometimes necessary if requests must come from a well-known port number.
* SO_SENDBUFFER
* SO_RECBUFFER
These buffer sizes influence the kernel-side buffer sizes maintained for receiving or sending data while your program (receive buffer) or the socket (send buffer) isn't yet ready to accept more data. If these are set too small, your application might not transfer data as smoothly as possible, reducing throughput, but it should not lead to any stalls if these are set smaller than optimal. Of course, too large may put unreasonable demands on kernel memory, but there should be a reasonable system-wide maximum allowed size.
* TCP_NODELAY
Disables Nagle. Not likely to do more than introduce 500ms delays if your application sends multiple small packets before attempting a blocking read.
Really, you shouldn't need to set any socket options.
Can you distill your code into something that could be pasted here and tested or inspected? I'm used to TCP sessions surviving for days or weeks without trouble, so this is pretty surprising.
First I think that this page is relevant, regarding half-open connections.
http://nitoprograms.blogspot.com/2009/05/detection-of-half-open-dropped.html
That being said, TCP is designed to hide connection problems, so you may often find yourself in cases where the connection is broken, but neither side thinks it is. You have addressed this partially by using timeouts and taking that as a sign the connection is broken.
Since you are writing the client and server, I would avoid relying on TCP to tell you when the connection is broken altogether. I would just have the server also acknowledge the receipt of the result from the client. Then both sides will expect immediate responses to their messages, and you can track which messages have been ack'd and set an appropriately small timeout for receiving the ack. This is not a timeout on the send or receive, but a timeout on the time between sending a message and receiving the ack for that message. Then you can set the timeout appropriately depending on the quality of your connection (e.g. very small if you are running on loopback, but large if running over wireless with a weak signal).
Regarding the options you list, you will want to use SO_REUSEADDRESS so that you won't be prevented from reopening the socket, for example if it hasn't finished closing from a previously killed process.
You probably have, but it is best to check the obvious....
Have you verified that it IS the socket that is timing out, and not your code? Sockets are fairly stable, and while there might be an issue somewhere, it seems more likely that it is in your code. I would use logs, timestamps, and synchronised clocks to be sure.
There may be an issue that you genuinely DO take a long time to do the calculation, so maybe adding a 'I'm still thinking about it' message to your protocol that gets sent regularly, to keep the connection alive?
Of course networks will drop out from time to time regardless of what you do, and it sounds like you are already handling that case nicely.
try these options
SO_LINGER - for specyfying when the Socket close s called while some unsent data in the queue
TCP_NODELAY - For non blocking datat transfer
I would strongly encourage you to use a ping/echo model between client and server, so that if no data is sent for x seconds a ping message needs to be send. A typical reason for a break might be a firewall, which shuts down socketss because of inactivity.
The typical issue where the TCP model fails are physical problems e.g. a pulled/broken cable and hangs on one side, where technically someone is listening until a queue overrun kicks in (which might never happen given your amount of data).
What are the chances the connection is going through a NAT firewall somewhere along the way? Stateful firewalls maintain a table of open connections so that packets belonging to an allowed connection can quickly pass through the system, without forcing firewall admins to write overly-complex rule sets.
The downside is that this table can grow immensely large, so it must be pruned as connections are closed or as they appear to have simply grown stale and died quietly. A connection that has gone silent for 20 minutes is usually quiet enough to reaped. (Which is really very quick, as the TCP KEEPALIVE is typically two hours, making it nearly useless in the face of NAT firewalls.)
So: is this going through a NAT firewall? Is the connection quiet for long stretches? If so, add a ping/pong to your protocol, and fire it every few minutes.