I'm having a strange problem trying to maintain TCP connections from my local PC to Azure (oddly remote desktop works fine). I first noticed the problem with my own software, but it's not limited to it. What I've noticed is:
TCP 3 way handshake completes
Some data is successfully sent and received
Something bad happens and no more data is sent
To rule out my software, I tested netcat. On my Azure machine I set up a netcat server to echo a large text file. On my local PC I established the netcat connection to the Azure server and observed some of the text file being printed and then it just stopped.
The first Wireshark image is from the Azure server, and the second image is from my PC. Both were captured at the same time doing the netcat test I described above.
Here is my Azure endpoint configuration (same result with both endpoints):
I'm currently at a loss, and don't know enough about what the problem may be to continue my debugging efforts. Any suggestions would be greatly appreciated.
Thanks!
Some guy from Azure support told me that Azure Network's max. TCP packet size is 1350. So if your packets larger than this, it might be a problem. try to limit them to 1300 and test it again.
There are several things that could cause these kind of n/w problems such as below.
Azure SLB/SNAT
Azure Physical n/w issues(ACLs on Azure routers where your vms reside)
Proxies in front of your client applications etc.
We should systematically prune the problem space while tackling these kind of problems. For example, in this instance, you should run your client app on one of the Azure VMS and verify whether the issue is reproducible or not. If the issue is not reproducible, your problem must reside on your local pc n/w(or proxy behind which your machine resides) for 99% of time.
If you found the issue is reproducible from another Azure VM too, there must be something wrong on Azure side for 99% of time.
Some tips to identify issues on Azure side
Check whether your tcp connection is idle(not sending) data for 4 minutes. If that's the case, Azure SLB/SNAT layer drops the tcp connection on azure side. You could prevent this issue by either sending tcp keep alives or increasing vm endpoint idle time out using AzureEndpoint.
Hope this helps.
Related
I have an Ubuntu VM installed on a client's VMware system. Recently, the client's IT informed us that his firewall has been detecting consistent potential port scans to our VM's internal IP address (coming from 87.238.57.227). He asked if this was part of a known package update process on our VM.
He sent us a firewall output where we can see several instances of the port scan, but there are also instances of our Ubuntu VM trying to communicate back to the external server on port 37258 (this is dropped by the firewall).
Based on a google lookup, the hostname of the external IP address is "feris.postgresql.org", with the ASN pointing to a European company called Redpill-Linpro. As far as I can tell, they offer IT consulting services, specializing in open source software (like PostgreSQL, which is installed on our VM). I have never heard of them before though and have no idea why our VM would be communicating with them or vice-versa. I'm also not sure if I'm interpreting the IP lookup information correctly: https://ipinfo.io/87.238.57.227
I'm looking for a way to confirm or disprove that this is just our VM pinging for a standard postgres update. If that's the case I'd like to restrict this behaviour. We would prefer to do these types of updates manually and limit the communication outside of the VM to what is strictly necessary for the functionality of our application.
Update
I sent an email to Redpill's abuse account. They responded quickly saying that the server should not be port scanning anyone and if it appears that way, something is wrong.
The server is part of a cluster of machines that serves apt.postgresql.org among other postgres download sites. I don't think we have anything like ansible or puppet installed that would automatically check for updates but I will look into that to make sure. I'm wondering if Ubuntu reaching out to update the MOTD with the number of available packages would explain why our VM is trying to reach out to the external postgres server?
The abuse rep said in any case there should only be outgoing connections from the VM, not incoming. He asked for some additional info so I will keep communicating with him and try to update this post accordingly
My communication with the client's IT dropped off so I did not get a definitive answer on this, but I'll provide some new details:
I reached out to the abuse email for Redpill-Linpro. He got back to me and confirmed the server corresponding to the detected IP address is part of a cluster that hosts postgres download sites, including apt.postgresql.org. He was surprised to learn we had detected a port scan from their server and seems eager to figure out why that is happening.
He asked if the client IT could pass along some necessary info for them to set up tracking on that server. But the client IT never got back to me. I think he was satisfied that it wasn't malicious and stopped pursuing it.
Here's one of the messages the abuse rep sent me that may be relevant:
That does look a lot like the tcp to the apt download server yes. It's
strange that your firewall reports that many incoming connections, but
they could be fallout from some connection tracking that's not
operating as intended. The timing appears to be matching up more or
less perfectly. And there should definitely not be any ping-back
connections from it.
Since you appear to be using the http version of the server (and not https) bringing the data in cleartext, they should be able to just
dump the TCP connection contents and verify exactly what it does. But
I bet they are going to see a number of http requests initiated by the
apt client that is checking for updates.
We made a chat module in our project using socket.io. When the load is balanced and the manual deployed, if socket connections are switched to different servers, socket connections are disconnected and the messaging events are partially not processed. I solved the load balance problem with socket.io-redis library. It acts as a gateway and solves this problem thanks to redis.
Another problem is that when I deploy it manually, the pid of the servers changes and socketio connections are instantly disconnected on the client and then it is not connected even though it says connected.
Do you think that using tools such as Travis CI solves the problems in manual deploy process?
Another question is, if a system that goes to 3 servers with load balance then goes back to 2 servers, the socket connections will be closed again, what method may be required to solve this? I thought of separating the socket.io service from the monolithic structure and keeping it on a single server, and scaling the server vertically when the load increased.
We are using an Aws Elastic Beanstalk(EBS), it automatically performs load balance.
Summary:
I am guessing that the issue here is something to do with how Windows and Linux handle TCP connections, or sockets, but I have no idea what it is. I'm initiating a TCP connection to a piece of custom hardware that someone else has developed and I am trying to understand its behaviour. In doing so, I've created a .Net core 2.2 application; run on a Windows system, I can initiate the connection successfully, but on Linux (latest Raspbian), I cannot.
It appears that it may be because Linux systems do not try to retry/retransmit a SYN after a RST, whereas Windows ones do - and this behaviour seems key to how this peculiar piece of hardware works..
Background:
We have a black box piece of hardware that can be controlled and queried over a network, by using a manufacturer-supplied Windows application. Data is unencrypted and requires no authentication to connect to it and the application has some other issues. Ultimately, we want to be able to relay data from it to another system, so we decided to make our own application.
I've spent quite a long time trying to understand the packet format and have created a library, which targets .net core 2.2, that can be used to successfully communicate with this kit. In doing so, I discovered that the device seems to require a kind of "request to connect" command to be sent, via UDP. Straight afterwards, I am able to initiate a TCP connection on port 16000, although the first TCP attempt always results in a RST,ACK being returned - so a second attempt needs to be made.
What I've developed works absolutely fine on both Windows (x86) and Linux (Raspberry Pi/ARM) systems and I can send and receive data. However, when run on the Raspbian system, there seems to be problems when initiating the TCP connection. I could have sworn that we had it working absolutely fine on a previous build, but none of the previous commits seem to work - so it may well be a system/kernel update that has changed something.
The issue:
When initiating a TCP connection to this device, it will - straight away - reset the connection. It does this even with the manufacturer-supplied software, which itself then immediately re-attempts the connection again and it succeeds; so this kind of reset-once-then-it-works-the-second-time behaviour in itself isn't a "problem" that I have any control over.
What I am trying to understand is why a Windows system immediately re-attempts the connection through a retransmission...
..but the Linux system just gives up after one attempt (this is the end of the packet capture..)
To prove it is not an application-specific issue, I've tried using ncat/netcat on both the Windows system and the Raspbian system, as well as a Kali system on a separate laptop to prove it isn't an ARM/Raspberry issue. Since the UDP "request" hasn't been sent, the connection will never succeed anyway, but this simply demonstrates different behaviour between the OSes.
Linux versions look pretty much the same as above, whereby they send a single packet that gets reset - whereas the Windows attempt demonstrates the multiple retransmissions..
So, does anyone have any answer for this behaviour difference? I am guessing it isn't a .net core specific issue, but is there any way I can set socket options to attempt a retransmission? Or can it be set at the OS level with systemctl commands or something? I did try and see if there are any SocketOptionNames, in .net, that look like they'd control attempts/retries, as this answer had me wonder, but no luck so far.
If anyone has any suggestions as to how to better align this behaviour across platforms, or can explain the reason for this difference is at all, I would very much appreciate it!
Nice find! According to this, Windows´ TCP will retry a connection if it receives a RST/ACK from the remote host after sending a SYN:
... Upon receiving the ACK/RST client from the target host, the client determines that there is indeed no service listening there. In the Microsoft Winsock implementation of TCP, a pending connection will keep attempting to issue SYN packets until a maximum retry value is reached (set in the registry, this value defaults to 3 extra times)...
The value used to limit those retries is set in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxConnectRetransmissions according to the same article. At least in Win10 Pro it doesn´t seem to be present by default.
Although this is a conveniece for Windows machines, an application still should determine its own criteria for handling a failed connect attempt IMO (i. e number of attempts, timeouts etc).
Anyhow, as I said, surprising fact! Living and learning I guess ...
Cristian.
I'm having an issue with TeamCity that is proving very difficult to solve for a number of reasons. I've looked around and not found any useful answers so far.
We have a teamcity server running on port 8080 with two agents connecting to it on ports 9090 and 9091 respectively. The agents register successfully and can accept new builds just fine. When the build is complete, tests have passed and the logs state "Sending artifacts" things stop and the artifacts never reach the server. Having left this sit overnight I make requests to stop the build which fail.
We have recently switched to a new firewall but things have been working after setting the required port rules for 8080, 9090 and 9091. No changes have been made since we got things working but now things do not work.
To the logs...
The server is aware of the failure as I can see logs in several places stating:
jetbrains.buildServer.SERVER - Failed to upload artifact, due to error: org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing of multipart/form-data request failed. Read timed out
The agent also has logs stating a similar reason:
jetbrains.buildServer.AGENT - Failed to publish artifacts because of error: java.net.SocketException: Connection reset by peer: socket write error, will try again.
During all this the firewall logs show that all traffic on the expected ports is being allowed through. What is odd though are some logs that look like this:
2016-04-01 10:45:00 Deny [sourceIp] [targetIP] 49426/tcp 8080 49426 0-External Firebox tcp syn checking failed (expecting SYN packet for new TCP connection, but received ACK, FIN, or RST instead). 558 113 (Internal Policy) proc_id="firewall" rc="101" msg_id="3000-0148" tcp_info="offset 5 A 478076245 win 258"
Examining port 49426 on the agent shows that it was being used by java.exe. Now I'm assuming this might have something to do with TeamCity as it runs in the JVM. The next step was to scour every bit of config I can find to figure out where this port number comes from. After a while the agent decided to retry and the port changed. It looks to me that java is just using whatever port it wants (as if unassigned in code) so could there be something missing in the agent config instructing it which port to use for artifact uploads?
I did read somewhere that perhaps the server or the firewall doesn't like requests or file uploads that exceed a certain size (the largest file is 81 meg) but we found nothing to suggest there was such a rule in place.
The Teamcity version is old (v7.1.1) but we are currently unable to upgrade (I am waiting on approval to use a newer, bigger server due to hard disk space issues).
UPDATE
We very briefly opened up a bit of the firewall to see if it was the cause of the issues to no avail. At this point I'm not convinced the firewall is the problem.
Any ideas?
Thanks in advance.
UPDATE 2
I've ended up setting up a whole new build server and things work just fine there. The new server has the latest TeamCity version but the agents are the same machines and artifact uploads appear to work just fine. This isn't really a solution to the question but at least I have a working setup now.
This can happen when the agent is too slow to start sending data for whatever reason. This workaround by Jetbrains employee Pavel Sher might help:
Increase the connectionTimeout value in the server.xml file
<Connector port="8111" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8543"
enableLookup="false"
useBodyEncodingForURI="true">
To 20000 to 60000 or even more.
I have written a small client server socket application. It is a proof of concept for some socket programming that I want to apply to a much bigger project.
For the moment I want to user wireshark to analyse the traffic that goes between them. They are both running on my local machine.
I have installed a loopback interface, and have tried to use wireshark with it.
No joy. Any ideas?
I have successfully analysed traffic between my machine and other machines no problems.
I have had a look here,
http://wiki.wireshark.org/CaptureSetup/Loopback
And I am not using the address 127.0.0.1 which they mention saying you can't capture traffic on 127.0.0.1
Thanks.
You might try creating a virtual machine to run your application and using wireshark on it.
Save yourself some grief and download Microsoft Network Monitor.
As good as Wireshark is on Unixen, Windows is a "special" case :)