SocketCAN stops read after RX overflow, is it normal? - sockets

I’m doing tests on embedded hardware with integrated CAN bus interface.
The driver provides Linux Socket API.
I try to see the limits:
I have one transmitter that writes CAN frames as fast as possible and a receiver that reads continuously.
After a moment the receiver gets an error frame signalling RX overflow.
I have no problem with that, it's normal and expected.
But my question is why at this point no more frame is received ?
(The restart-ms option is set)
I expected some dropped frames and others RX buffer errors but not the end of reception.

After exchanging emails with socket-can developers and my device provider, it was an bug in the driver.
In the mean time a patch was released to move at91_can to new rx_fifo architecture.
this patch fix the issue.

Related

Why is my TCP socket showing connected but not responding?

I have a program using a bi-directional TCP socket to send messages from the host PC to a VLinx ethernet-to-serial converter and then on to a PLC via RS-232. During heavy traffic the socket will intermittently stop communicating although all soft tests of the connection show that it is connected, active and writeable. I suspect that something is interrupting the connection causing the socket to close with out FIN/ACK. How can I test to see where this disconnect might be occuring?
The program itself is written in VB6 and uses Catalyst SocketTools/SocketWrench as opposed to the standard Winsock library. The methodology, properties and code seem to be sound since the same setup works reliably at two other sites. It's just this one site in particular where this problem occurs. It only happens during production when there is traffic on the network and can lose connection anywhere between 20 - 100 times per 10-hour day.
There are redundant tests in place to catch this loss of communication and keep the system running. We have tests on ACK messages, message queue size, time between transmissions (tokens on 2s interval), etc. Typically, the socket will not be unresponsive for more than 30 seconds before it is caught, closed and re-established which works properly >99% of the time.
Previously I had enabled the SocketTools logging capabilities which did not capture any relevant information. Most recently I have tried to have the system ping the VLinx on the first sign of a missed message (2.5 seconds). Those pings have always been successful, meaning that if there is a momentary loss of connection at a switch or AP it does not stay disconnected for long.
I do not have access to the network hardware aside from the PC and VLinx that we own. The facility's IT is also not inclined to help track these kinds of things down because they work on a project-based model.
Does anyone have any suggestions what I can do to try and determine where the problem is occurring so that I can then try to come up with a permanent solution to this issue rather than the band-aid of reconnecting multiple times per day?
A tool like Wireshark may be helpful in seeing what's going on at the network level. The logging facility in SocketTools/SocketWrench can only report what's going on at the API level, and it sounds like whatever the underlying problem is occurs at a lower level in the TCP stack.
If this is occurring after periods of relative inactivity, followed by a burst of activity, one thing you could try doing is enabling keep-alive and see if that makes any difference.

lwip stm32 - http requests failing

I running freeRTOS and lwip 1.4.1 with the socket api in use on an stm32 processor (stm32f407).
Overall it works pretty fine.
I can send and receive data with udp and tcp.
But in a timewindow of 3 to 7 days I see a strange behavior.
My Problem
Every 3 to 7 days my client (Windows 10, which sends 1-2 HTTP-Requests per second) fails to send those requests. When this happens, there are ~10 Requests successively, which are failing. In very few moments, the stack won't regenerate at all.
My Guess
I think I have possibly missconfigured something in my LWIP config.
Because the stack is well used and shouldn't have any bugs in this direction
My Ethernet settings
server and client are directly connected, no switch,hub or router in between.
server (stm32/lwip):
static, 192.168.168.2
netmask, 255.255.255.0
client (win10) eth0:
static, 192.168.168.1
netmask, 255.255.255.0
client (win10) eth1:
dhcp, to normal working network
My Tries
At the moment I have tests running which are sending ~7-8 Requests per second, but the error doesn't apply more often.
I played around with the lwip config:
more memory for the stack
more pbufs
bigger pbufs
with/without backlog
But everything without improving of this connection problem.
Could it be because of the often reused port numbers from the client, which could make this problem?
Here I have the relevant part of the lwip debuging output:
tcp debugging output
https://pastebin.com/a9JabhET
Here the Wireshark log:
orig screenshot
hole wireshark log:
https://www.file-upload.net/download-12682664/debug_tcp_00001_20170828172950.html
And here my lwipopts.h:
lwip configuration:
https://pastebin.com/cW0v4hF6
It seems a memory problem, but as it is temporary, it could be a timeout on something.
I suggest to use the memory stats functions of LwIP, and also to enable the ARP debug messages.
I have a new Job and am no longer working on this issue.
Befor I stated my new job I could show that it was not a memory Problem on LwIP (I defined unreasonable large pbufs and memorypools) they never reached their limits.
The problem was in the DMA driver for the ETH. When once reached the memory chain end of the DMA driver the chain elements never got freed, so I run into RBU (Receive Buffer Underrun) problems and the RBU Flag never got reseted again and the DMA ETH driver was hanging in this RBU interrupt (even if there where enough LwIP buffs to write to from DMA chain). So I added a sledgehammer fix to the DMA driver and disabled the RBU interrupt (I am polling the RBU flag in multiple situations and clear it if needed, and start to read again from ETH).
I think since then the problem is more or less "solved". Not nice, but it worked.
I've got some information of my coworker at my old working place:
The RBU Interrupt and the clear did not work, because our used CAN stack did not work very well with FreeRTOS, the CAN stack used on busy systems much over 90% of CPU time, which let to the strange behaviour in ETH driver and LWIP.

"Failed to send big message" in Unity networking

Unity suggests that custom synchronization of more complex objects can be done by overriding OnSerialize in your NetworkBehaviour. However, it seems NetworkWriter just gives up and errors out if you try and feed it more than about 1500 bytes.
In a lower-level context, I would expect this kind of size limitation on network messages and chunk them up myself. The HLAPI doesn't really expose the functionality necessary to do this in OnSerialize, though; how can I work around it?
Unity 5.2.3 added packet queueing for the ReliableFragmented QoS channel. Switching to this channel solved my problem.
(726466) - Networking: Added support for HLAPI packet queuing on channels using the ReliableFragmented QoS.

How long does a UDP packet keeps floating and where?

We keep hearing about unreliability of udp that it may reach or not reach or just reach out of order (Signifying delay).
Where is it held until delivered?
Since its connection less if you keep sending packets without a network connection where will it go? Driver buffer?
Similarly when the receiver is not reachable is the packet immediately lost or does it float around a bit expecting host to be available soon? if yes then where?
On a direct connection from one device to another, with no intervening devices, there shouldn't be a problem. Where you can run into problems is where you go through a bunch of switches and routers (like the Internet).
A few reasons:
If a switch drops a frame, there is no mechanism to resend the frame.
Routers will buffer packets when they get congested, and packets can
be dropped if the buffers are full, or they may be purposely dropped to prevent congestion.
Load balancing can cause packets to be delivered out of order.
You have no control over anything outside your network.
Where is it held until delivered?
Packet buffering can occur if packets arrive faster than the device can read. Buffering can be either at NIC of the device or software queue of device driver or in the software queue between driver and stack. But, if the rate of arrival
is much higher such that it cannot be handled by these buffering mechanisms, then it will get dropped at those appropriate layer/location (based on design).
Since its connection less if you keep sending packets without a
network connection where will it go? Driver buffer?
If there is no network, there might be no other intermediate network devices and hence there should not be significant problems. But, it also depends on your architecture / design / configurations. If the configured value of internal OS receive buffer limit / socket buffer size (SO_RCVBUF, rmem_max, rmem_default) is exceeded, there can be drops here. And, if the software queue in respective device driver overflows or the software queue between device driver & stack of the device overflows, there can be drops here. Also, if the CPU is busy addressing another priority task where by it suspends reception, there can be drops here.
Similarly when the receiver is not reachable is the packet immediately
lost or does it float around a bit expecting host to be available
soon? if yes then where?
If there is no reachable destination, it shall be dropped by router.
Also, note that the particular router shall also drop the packet if the TTL/hoplimit count (in IP) is zero by the time the packet reaches this router.

How to speed up slow / laggy Windows Phone 7 (WP7) TCP Socket transmit?

Recently, I started using the System.Net.Sockets class introduced in the Mango release of WP7 and have generally been enjoying it, but have noticed a disparity in the latency of transmitting data in debug mode vs. running normally on the phone.
I am writing a "remote control" app which transmits a single byte to a local server on my LAN via Wifi as the user taps a button in the app. Ergo, the perceived responsiveness/timeliness of the app is highly important for a good user experience.
With the phone connected to my PC via USB cable and running the app in debug mode, the TCP connection seems to transmit packets as quickly as the user taps buttons.
With the phone disconnected from the PC, the user can tap up to 7 buttons (and thus case 7 "send" commands with 1 byte payloads before all 7 bytes are sent.) If the user taps a button and waits a little between taps, there seems to be a latency of 1 second.
I've tried setting Socket.NoDelay to both True and False, and it seems to make no difference.
To see what was going on, I used a packet sniffer to see what the traffic looked like.
When the phone was connected via USB to the PC (which was using a Wifi connection), each individual byte was in its own packet being spaced ~200ms apart.
When the phone was operating on its own Wifi connection (disconnected from USB), the bytes still had their own packets, but they were all grouped together in bursts of 4 or 5 packets and each group was ~1000ms apart from the next.
btw, Ping times on my Wifi network to the server are a low 2ms as measured from my laptop.
I realize that buffering "sends" together probably allows the phone to save energy, but is there any way to disable this "delay"? The responsiveness of the app is more important than saving power.
This is an interesting question indeed! I'm going to throw my 2 cents in but please be advised, I'm not an expert on System.Net.Sockets on WP7.
Firstly, performance testing while in the debugger should be ignored. The reason for this is that the additional overhead of logging the stack trace always slows applications down, no matter the OS/language/IDE. Applications should be profiled for performance in release mode and disconnected from the debugger. In your case its actually slower disconnected! Ok so lets try to optimise that.
If you suspect that packets are being buffered (and this is a reasonable assumption), have you tried sending a larger packet? Try linearly increasing the packet size and measuring latency. Could you write a simple micro-profiler in code on the device ie: using DateTime.Now or Stopwatch class to log the latency vs. packet size. Plotting that graph might give you some good insight as to whether your theory is correct. If you find that 10 byte (or even 100byte) packets get sent instantly, then I'd suggest simply pushing more data per transmission. It's a lame hack I know, but if it aint broke ...
Finally you say you are using TCP. Can you try UDP instead? TCP is not designed for real-time communications, but rather accurate communications. UDP by contrast is not error checked, you can't guarantee delivery but you can expect faster (more lightweight, lower latency) performance from it. Networks such as Skype and online gaming are built on UDP not TCP. If you really need acknowledgement of receipt you could always build your own micro-protocol over UDP, using your own Cyclic Redundancy Check for error checking and Request/Response (acknowledgement) protocol.
Such protocols do exist, take a look at Reliable UDP discussed in this previous question. There is a Java based implementation of RUDP about but I'm sure some parts could be ported to C#. Of course the first step is to test if UDP actually helps!
Found this previous question which discusses the issue. Perhaps a Wp7 issue?
Poor UDP performance with Windows Phone 7.1 (Mango)
Still would be interested to see if increasing packet size or switching to UDP works
ok so neither suggestion worked. I found this description of the Nagle algorithm which groups packets as you describe. Setting NoDelay is supposed to help but as you say, doesn't.
http://msdn.microsoft.com/en-us/library/system.net.sockets.socket.nodelay.aspx
Also. See this previous question where Keepalive and NoDelay were set on/off to manually flush the queue. His evidence is anecdotal but worth a try. Can you give it a go and edit your question to post more up to date results?
Socket "Flush" by temporarily enabling NoDelay
Andrew Burnett-Thompson here already mentioned it, but he also wrote that it didn't work for you. I do not understand and I do not see WHY. So, let me explain that issue:
Nagle's algorithm was introduced to avoid a scenario where many small packets had to been sent through a TCP network. Any current state-of-the-art TCP stack enables Nagle's algorithm by default!
Because: TCP itself adds a substantial amount of overhead to any the data transfer stuff that is passing through an IP connection. And applications usually do not care much about sending their data in an optimized fashion over those TCP connections. So, after all that Nagle algorithm that is working inside of the TCP stack of the OS does a very, very good job.
A better explanation of Nagle's algorithm and its background can be found on Wikipedia.
So, your first try: disable Nagle's algorithm on your TCP connection, by setting option TCP_NODELAY on the socket. Did that already resolve your issue? Do you see any difference at all?
If not so, then give me a sign, and we will dig further into the details.
But please, look twice for those differences: check the details. Maybe after all you will get an understanding of how things in your OS's TCP/IP-Stack actually work.
Most likely it is not a software issue. If the phone is using WiFi, the delay could be upwards of 70ms (depending on where the server is, how much bandwidth it has, how busy it is, interference to the AP, and distance from the AP), but most of the delay is just the WiFi. Using GMS, CDMA, LTE or whatever technology the phone is using for cellular data is even slower. I wouldn't imagine you'd get much lower than 110ms on a cellular device unless you stood underneath a cell tower.
Sounds like your reads/writes are buffered. You may try setting the NoDelay property on the Socket to true, you may consider trimming the Send and Receive buffer sizes as well. The reduced responsiveness may be a by-product of there not being enough wifi traffic, i'm not sure if adjusting MTU is an option, but reducing MTU may improve response times.
All of these are only options for a low-bandwidth solution, if you intend to shovel megabytes of data in either direction you will want larger buffers over wifi, large enough to compensate for transmit latency, typically in the range of 32K-256K.
var socket = new System.Net.Sockets.Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
{
NoDelay = true,
SendBufferSize = 3,
ReceiveBufferSize = 3,
};
I didn't test this, but you get the idea.
Have you tried setting SendBufferSize = 0? In the 'C', you can disable winsock buffering by setting SO_SNDBUF to 0, and I'm guessing SendBufferSize means the same in C#
Were you using Lumia 610 and mikrotik accesspoint by any chance?
I have experienced this problem, it made Lumia 610 turn off wifi radio as soon as last connection was closed. This added perceivable delay, compared to Lumia 800 for example. All connections were affected - simply switching wifi off made all apps faster. My admin says it was some feature mikrotiks were not supporting at the time combined with WMM settings. Strangely, most other phones were managing just fine, so we blamed cheapness of the 610 at the beginning.
If you still can replicate the problem, I suggest trying following:
open another connection in the background and ping it all the time.
use 3g/gprs instead of wifi (requires exposing your server to the internet)
use different (or upgraded) phone
use different (or upgraded) AP