TCP socket over HTTP proxy disconnects after idle timeout - sockets

I have a problem with TCP socket when using HTTP tunneling over proxy.
Client (C++) opens a TCP socket to a server (JAVA). I added support for HTTP proxy. Everything worked good, client sends "HTTP connect" request like this and continues to plain TCP connection after:
CONNECT servername:5555 HTTP/1.1
Host: servername:5555
Proxy-Connection: Keep-Alive
HTTP/1.1 200
However, if idle timeout is configured in proxy and there is no actual data sent, connection is terminated though client sends TCP keep alive packets every 60 seconds. Idle timeout is configured to 10 minutes.
TCP keep alive is configured as following:
WSAIoctl(socket, SIO_KEEPALIVE_VALS, &alive, sizeof(alive), NULL, 0, &dwBytesRet, NULL, NULL)
client IP - 192.168.91.xxx
Proxy IP - 192.168.92.yyy
244 47.133017000 192.168.91.xxx 192.168.92.yyy TCP 55 [TCP Keep-Alive] 64351 > 808 [ACK] Seq=4336 Ack=13084 Win=65700 Len=1
245 47.133336000 192.168.92.yyy 192.168.91.xxx TCP 66 [TCP Keep-Alive ACK] 808 > 64351 [ACK] Seq=13084 Ack=4337 Win=65536 Len=0 SLE=4336 SRE=4337
Any ideas how to keep connection alive?
I tried to add "Connection: Keep-Alive" header though HTTP1.1 should do it automatically. It didn't help anyway.

This is a timeout at the application layer, e.g. the connection is idle because no application data are sent. What you've tried will not work because:
Connection: keep-alive is for having multiple HTTP requests over a single connection. This does not apply here because from the view of the proxy there is only a single request (CONNECT).
TCP keep-alive is to notice if the peer is not reachable any longer (died without closing connection or connection broke somewhere in the middle). It does not apply for cases, where the TCP connection is still alive, but it is idle (no application data).
Having a idle timeout for the proxy makes sense. The idea of HTTP is, that the client sends a request and the server sends a response. If it is idle while receiving the request or the response usually something is broken (or you have a reaaaaaly slow connection). If it is idle after request and response finished it is perfectly valid to close the connection, even if the client asked for Connection: keep-alive, because keep-alive is not a requirement on the server but only a suggestion to keep the connection open for more requests if the server has enough resources to do so.

Related

TCP socket state become persist after changing IP address even configured keep-alive early

I met a problem about TCP socket keepalive.
TCP keep-alive is enabled and configured after the socket connection, and system has its own TCP keep-alive configuration.
'ss -to' can show the keep-alive information of the connection.
The network interface is a PPPOE device, if we ifup the interface, it will get a new ip address. And the old TCP connection will keep establish until keep-alive timeout.
But sometimes 'ss -to' shows that the tcp connection becomes 'persist', which will take long time (about 15 minutes) to close.
Following is the result of 'ss -to':
ESTAB 0 591 172.0.0.60:46402 10.184.20.2:4335 timer:(persist,1min26sec,14)
The source address is '172.0.0.60', but the network interface's actual address has been updated to '172.0.0.62'.
This is the correct result of 'ss -to':
ESTAB 0 0 172.0.0.62:46120 10.184.20.2:4335 timer:(keepalive,4.480ms,0)
I don't know why the "timer" is changed to 'persist', which makes keep-alive be disable.
In short: TCP keepalive is only relevant if the connection is idle, i.e. no data to send. If instead there are still data to send but sending is currently impossible due to missing ACK or a window of 0 then other timeouts are relevant. This is likely the problem in your case.
For the deeper details see The Cloudflare Blog: When TCP sockets refuse to die.

Long request returns with empty response after 120 seconds, caused by Network Load Balancer

I have a GKE cluster with 2 nodes, with a service of type LoadBalancer.
When I call the service internally a long request will not timeout after 120 seconds.
But if I call the external IP of the Network Load Balancer that forwards to the internal service, I get a "Empty reply from server" response.
External call example:
curl -v "http://<public-ip>/longResponse"
* Trying <public-ip>...
* TCP_NODELAY set
* Connected to <public-ip> (<public-ip>) port 80 (#0)
> GET /longResponse HTTP/1.1
> Host: <public-ip>
> User-Agent: curl/7.54.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host <public-ip> left intact
curl: (52) Empty reply from server
Internal call example:
/ # wget -O - -S <service-name>/longResponse
Connecting to location-service (10.3.255.181:80)
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json
Content-Length: 15
Date: Thu, 28 Feb 2019 10:31:14 GMT
Connection: close
- 100% |*********************************************************************************************************************************************************************************************************************| 15 0:00:00 ETA
/ #
I've tried to find documentation for request or socket timeout in the load balancer level, but I didn't encounter anything. Any idea?
Thanks.
Are you sure that's not a client-side timeout? Network LB doesn't process packets other than to route them, so it should never send any response back.
Try the -m flag to curl?
Also maybe capture a tcpdump on your client-side so you can see what the network is actually doing.
Get the load-balancer's backend name with:
gcloud compute backend-services list
then
BACKEND=name-of-your-backend
gcloud compute backend-services update $BACKEND --timeout=600s
otherwise, in the console: Network services ⇒ Load balancing ⇒ Backends then you can click your HTTP backend(s) and edit the settings, including the timeout.
On a wider note, this may be one of serval hops between server and client, each of which might timeout. You're better off either living with the timeout (and making your long polls complete before the timeout), or drip feeding data down the line... for instance, you can preprend whitespace to json, so for instance, send a space character every 30 seconds until you have a proper response body. This will keep the load-balance from timing out.

haproxy close connecton after 1 minute

this config
frontend https_frontend
bind *:4055
mode tcp
maxconn 8192
use_backend https_web
backend https_web
mode tcp
balance roundrobin
option http-keep-alive
server haproxy2 xxx.xxx.xxx.xxx:4055 send-proxy-v2
new connection send keep-alive packets every 30 seconds. but connection drop after 1 minute
I think this is because you're using mode tcp, but option http-keep-alive is a mode http option. In this case, it would most likely be using whatever value you have for timeout client or timeout server before dropping the connection.
For more details about option http-keep-alive and mode http, see:
https://www.haproxy.com/documentation/aloha/7-5/traffic-management/lb-layer7/http-modes/#http-modes-in-haproxy
frontend https_frontend
bind *:4055
mode tcp
maxconn 8192
use_backend https_web
backend https_web
mode tcp
balance roundrobin
timeout client 600000
timeout server 600000
server haproxy2 147.78.65.172:4055 send-proxy-v2
now i send keep-alive packets and real data every 30 seconds
but steel drop after 2 minutes
its not http/https query. its sample tcp communication with rand data. maybe it problem?

Kube-proxy or ELB "delaying" packets of HTTP requests

We're running a web API app on Kubernetes (1.9.3) in AWS (set with KOPS). The app is a Deployment and represented by a Service (type: LoadBalancer) which is actually an ELB (v1) on AWS.
This generally works - except that some packets (fragments of HTTP requests) are "delayed" somewhere between the client <-> app container. (In both HTTP and HTTPS which terminates on ELB).
From the node side:
( Note: Almost all packets on server-side arrive duplicated 3 times )
We use keep-alive so the tcp socket is open and requests arrive and return pretty fast. Then the problem happens:
first, a packet with only the headers arrives [PSH,ACK] (I see the headers in the payload with tcpdump).
an [ACK] is sent back by the container.
The tcp socket/stream is quiet for a very long time (up to 30s and more - but the interval is not consistent, we consider >1s as a problem ).
another [PSH, ACK] with the HTTP data arrives, and the request can finally be processed in the app.
From the client side:
I've run some traffic from my computer, recording it on the client side to see the other end of the problem, but not 100% sure it represents the real client side.
a [PSH,ASK] with the headers go out.
a couple of [ACK]s with parts of the payload start going out.
no response arrives for a few seconds (or more) and no more packets go out.
[ACK] marked as [TCP Window update] arrives.
a short pause again and [ACK]s start arriving and the session continues until the end of the payload.
This is only happening under load.
To my understanding, this is somewhere between the ELB and the Kube-Proxy, but I'm clueless and desperate for help.
This is the arguments Kube-Proxy runs with:
Commands: /bin/sh -c mkfifo /tmp/pipe; (tee -a /var/log/kube-proxy.log < /tmp/pipe & ) ; exec /usr/local/bin/kube-proxy --cluster-cidr=100.96.0.0/11 --conntrack-max-per-core=131072 --hostname-override=ip-10-176-111-91.ec2.internal --kubeconfig=/var/lib/kube-proxy/kubeconfig --master=https://api.internal.prd.k8s.local --oom-score-adj=-998 --resource-container="" --v=2 > /tmp/pipe 2>&1
And we use Calico as a CNI:
So far I've tried:
Using service.beta.kubernetes.io/aws-load-balancer-type: "nlb" - the issue remained.
(Playing around with ELB settings hoping something will do the trick ¯_(ツ)_/¯ )
Looking for errors in the Kube-Proxy, found rare occurrences of the following:
E0801 04:10:57.269475 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://api.internal.prd.k8s.local/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup api.internal.prd.k8s.local on 10.176.0.2:53: no such host
...and...
E0801 04:09:48.075452 1 proxier.go:1667] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 7 failed
)
I0801 04:09:48.075496 1 proxier.go:1669] Closing local ports after iptables-restore failure
I couldn't find anything describing such issue and will appreciate any help. Ideas on how to continue and troubleshoot are welcome.
Best,
A

Fiddler doesn't work

Fiddler almost not work for me. Seems the problem only with https.
For example to open https google.com I need to wait around 40 seconds
Screenshots:
immediately after request
after ~40 seconds
Fiddler log:
18:02:46:3326 Fiddler Running...
18:02:46:3922 Windows 8+ AppContainer isolation feature detected.
18:03:09:5427 Assembly 'C:\Program Files (x86)\Fiddler2\CertMaker.dll' was not found. Using default Certificate Generator.
18:03:09:5467 /Fiddler.CertMaker> Using Fiddler.DefaultCertificateProvider+CertEnrollEngine for certificate generation; UseWildcards=False.
18:03:11:3745 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:11:3855 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:11:3895 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:11:3915 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:11:3945 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:20:2192 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6091.
18:03:20:3110 [Fiddler] No HTTP request was received from (chrome:10428) new client socket, port 6134.
18:03:20:3120 [Fiddler] No HTTP request was received from (chrome:10428) new client socket, port 6130.
18:03:28:8160 HTTPSLint> Warning: ClientHello record was 508 bytes long. Some servers have problems with ClientHello's greater than 255 bytes. githubcom /ssllabs/research/wiki/Long-Handshake-Intolerance
18:03:30:2198 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6095.
18:03:30:2198 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6097.
18:03:30:2198 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6099.
18:03:30:2198 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6101.
18:03:50:2219 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6163.
18:03:50:2219 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6141.
18:03:50:2219 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6167.
18:04:10:2230 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6176.
18:04:10:2230 [Fiddler] No HTTPS request was received from (chrome:10428) new client socket, port 6179.
Many times in chrome I see: Waiting for proxy tunnel...and site shows This webpage is not available (ERR_TIMED_OUT)
In EDGE I even can't open http site for all sub requests I see blue up arrow which means fiddler trying to load it (after ~40 seconds I get loaded all that requests)
I tried to reset Internet Properties-> Advanced tab-> Restore advanced settings - it doesn't help me.
Also I tried to restarted my system, also I restarted fiddler after any changes I made.
Fiddler settings:
Certificates generated by CertEnroll engine. I tried to change it to MakeCert. Few times I reset All certificates, also manually removes certificates.
Browsers: Chrome/Firefox
Gateway info in fiddler: No upstream gateway proxy is configured.
Recently I made clear installation of Windows 10.
I do not have any Antivirus.
Windows 10 Pro x64
Fiddler v4.6.2.0
I need fiddler for my work. Please help me
UPDATED:
This is can be issue with Protocols. Currently in fiddler I have next protocols:
fiddler.network.https> HTTPS handshake to www.bing.com (for #4) failed. System.IO.IOException Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. < An existing connection was forcibly closed by the remote host
and
fiddler.network.https> HTTPS handshake to www.google.com.ua (for #23) failed. System.ComponentModel.Win32Exception The client and server cannot communicate, because they do not possess a common algorithm
As I thought the issue was with protocols enabled in Internet Options in Windows and Fiddler protocols.
I ticked Use SSL 3.0 and use TSL 1.0 in Interent properties (all other should be unticked)
in Fiddler protocols I typed: ;ssl3;tls1.0
And after this changes everything works perfectly
Yes, Use SSL 3.0 and use TSL 1.0 in Internet properties for it to work. Previously I also did the same mistake but now working fine. I also checked on fiddler for the same.