EDIT
It seems that the second server DOES occasionally get this error, this makes me near certain it's a config problem. Could it be one of:
net.ipv4.tcp_fin_timeout = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse =1
version information as requested: Meteor: 1.5.0
OS: Ubuntu 16.04
Provider: AWS EC2
I'm getting the following error, intermittently and seemingly randomly, on both processes running on one server (of a pair). The other server never gets this error, the error doesn't refer to any code I've written, so I can only assume its (a) a bug in Meteor or (b), a bug with my server config. The server whose processes are crashing is also hosting two other meteor sites, both of which occasionally get this error:
Error: write after end
at writeAfterEnd (_stream_writable.js:167:12)
at PassThrough.Writable.write (_stream_writable.js:212:5)
at IncomingMessage.ondata (_stream_readable.js:542:20)
at emitOne (events.js:77:13)
at IncomingMessage.emit (events.js:169:7)
at IncomingMessage.Readable.read (_stream_readable.js:368:10)
at flow (_stream_readable.js:759:26)
at resume_ (_stream_readable.js:739:3)
at nextTickCallbackWith2Args (node.js:511:9)
at process._tickDomainCallback (node.js:466:17)
things I've already checked:
memory limits (nowhere near close)
connection limits - very small, around 20 per server at the time of failure, and the processes were bumped to the second server within 1 minute, which handled them + it's own just fine
process limits - both processes on server 1 failed within 7 minutes of each other.
server config - while I was trying to eek out a little extra performance during load testing, I modified sysctl.conf based on a post I saw for high load node.js servers, this is the contents of the faulty servers sysctl.conf however, the functioning server has an identical config.
.
fs.file-max = 1000000
fs.nr_open = 1000000
ifs.file-max = 70000
net.nf_conntrack_max = 1048576
net.ipv4.netfilter.ip_conntrack_max = 32768
net.ipv4.tcp_fin_timeout = 2
net.ipv4.tcp_max_orphans = 8192
net.ipv4.ip_local_port_range = 16768 61000
net.ipv4.tcp_max_syn_backlog = 10024
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 2500
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse =1
net.core.somaxconn = 20048
I have an NGINX balancer on server1 which load balances across the 4 processes (2 per server). The NGINX error log is littered with lines as follows:
2017/08/17 16:15:01 [warn] 1221#1221: *6233472 an upstream response is buffered to a temporary file /var/lib/nginx/proxy/1/46/0000029461 while reading upstream, client: 164.68.80.47, server: server redacted, request: "GET path redacted HTTP/1.1", upstream: "path redacted", host: "host redacted", referrer: "referrer redacted"
At the time of the error, I see a pair of lines like this:
2017/08/17 15:07:19 [error] 1222#1222: *6215301 connect() failed (111: Connection refused) while connecting to upstream, client: ip redacted, server: server redacted, request: "GET /admin/sockjs/info?cb=o2ziavvsua HTTP/1.1", upstream: "http://127.0.0.1:8080/admin/sockjs/info?cb=o2ziavvsua", host: "hostname redacted", referrer: "referrer redacted"
2017/08/17 15:07:19 [warn] 1222#1222: *6215301 upstream server temporarily disabled while connecting to upstream, client: ip redacted, server: server redacted, request: "GET /admin/sockjs/info?cb=o2ziavvsua HTTP/1.1", upstream: "http://127.0.0.1:8080/admin/sockjs/info?cb=o2ziavvsua", host: "hostname redacted", referrer: "referrer redacted"
If it matters at all, I'm using a 3 node mongo replica set, where both servers are pointing at all 3 nodes.
I'm also using a custom hosted version of kadira (since it went offline).
If there is no way to stop the errors, is there anyway to stop them taking down the entire process, there are times when 50-100 users are connected per process, booting them all because of one error seems excessive
It's been two days without a crash, so I think the solution was changing:
net.ipv4.tcp_fin_timeout = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
to
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
I don't know which of those was causing the problem (probably the timeout). I still think its a "bug" that a single "Write after end" error crashes the entire meteor process. Perhaps this should simply be logged.
Related
I have a spring-boot application that uses Java on the backend and react.js on the frontend side. Some API calls are too slow in the production environment (it takes 1 min in the production environment and 4 ms in the local). The slow APIs are not fetching any large data set. I was trying to debug the code and found Nginx error logs. The logs are as follows:
[error] 29755#29755: *632803 upstream timed out (110: Connection timed out) while connecting to upstream, client: 172.XX.X.XX, server: test.apps.com, request: "GET /apiv1/master/modules?&isglobal=all&startIndex=0&pageSize=10&sortBy=id HTTP/1.1", upstream: "https://172.XX.X.XX:443/mhk-cmt-app/master/modules?&isglobal=all&startIndex=0&pageSize=10&sortBy=id", host: "test.apps.com", referrer: "https://test.apps.com/admin/app-setting"
How can I improve the API response time in production?
I was successfully using Spilo (HA PostgreSQL Cluster with Docker) in Docker Swarm behind HAProxy. I used one of the HAProxy configuration posted by one of the users.
It was working fine for HAProxy 2.1. I updated HAProxy to 2.2 and suddenly it doesn't work anymore. In the announce of HAProxy 2.2 I found that there was some changes for the Health Checks.
This is my backend section of the master that was working before:
backend backend_master
option httpchk OPTIONS /master
server dbnode1 spilo1:5432 maxconn 100 check port 8008 resolvers docker_resolver resolve-prefer ipv4
server dbnode2 spilo2:5432 maxconn 100 check port 8008 resolvers docker_resolver resolve-prefer ipv4
server dbnode3 spilo3:5432 maxconn 100 check port 8008 resolvers docker_resolver resolve-prefer ipv4
After reading HAProxy 2.2 documentation I'm not sure why the current configuration doesn't work anymore.
This is the message from the logs:
Server be-postgres-master/dbnode1 is DOWN, reason: Layer7 invalid response, info: "TCPCHK got an empty response at step 1", check duration: 5ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server be-postgres-master/dbnode2 is DOWN, reason: Layer7 invalid response, info: "TCPCHK got an empty response at step 1", check duration: 4ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server be-postgres-master/dbnode3 is DOWN, reason: Layer7 invalid response, info: "TCPCHK got an empty response at step 1", check duration: 4ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 235/144508 (6) : backend 'be-postgres-master' has no server available!
I downgraded HAProxy to 2.1 and it works again but how to make it work with 2.2 ?
Don't know whether you're still struggling with the issue or not, but changing the request method from OPTIONS to GET in the httpchk section helped me.
I am getting occasional layer 7 health check failures. This happens on production machine seemingly at random, maybe once a minute or every few minutes on average. Here is the configuration:
backend api
mode http
option httpchk GET /api/v1/status HTTP/1.0
http-check expect status 200
balance roundrobin
server api1 127.0.0.1:8001 check fall 3 rise 2
server api2 127.0.0.1:8002 check fall 3 rise 2
The HAproxy log tells me the following:
Health check for server api/api2 failed, reason: Layer7 timeout, check duration: 10001ms, status: 2/3 UP.
Strange thing is when I run a script to fetch the same URL at a much faster pace than HAproxy, it never fails to return 200 response. It never hangs like it seems to do for HAproxy.
In addition, I'm getting occasional HAProxy error for various API calls, not just health checks, all looking quite similar:
https-in~ api/api1 45/0/0/-1/30045 504 194 - - sHVN 50/49/13/10/0 0/0 "POST /api/v1/accounts HTTP/1.1"
What could be the issue here? This one really got me stumped.
I have a GKE cluster with 2 nodes, with a service of type LoadBalancer.
When I call the service internally a long request will not timeout after 120 seconds.
But if I call the external IP of the Network Load Balancer that forwards to the internal service, I get a "Empty reply from server" response.
External call example:
curl -v "http://<public-ip>/longResponse"
* Trying <public-ip>...
* TCP_NODELAY set
* Connected to <public-ip> (<public-ip>) port 80 (#0)
> GET /longResponse HTTP/1.1
> Host: <public-ip>
> User-Agent: curl/7.54.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host <public-ip> left intact
curl: (52) Empty reply from server
Internal call example:
/ # wget -O - -S <service-name>/longResponse
Connecting to location-service (10.3.255.181:80)
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json
Content-Length: 15
Date: Thu, 28 Feb 2019 10:31:14 GMT
Connection: close
- 100% |*********************************************************************************************************************************************************************************************************************| 15 0:00:00 ETA
/ #
I've tried to find documentation for request or socket timeout in the load balancer level, but I didn't encounter anything. Any idea?
Thanks.
Are you sure that's not a client-side timeout? Network LB doesn't process packets other than to route them, so it should never send any response back.
Try the -m flag to curl?
Also maybe capture a tcpdump on your client-side so you can see what the network is actually doing.
Get the load-balancer's backend name with:
gcloud compute backend-services list
then
BACKEND=name-of-your-backend
gcloud compute backend-services update $BACKEND --timeout=600s
otherwise, in the console: Network services ⇒ Load balancing ⇒ Backends then you can click your HTTP backend(s) and edit the settings, including the timeout.
On a wider note, this may be one of serval hops between server and client, each of which might timeout. You're better off either living with the timeout (and making your long polls complete before the timeout), or drip feeding data down the line... for instance, you can preprend whitespace to json, so for instance, send a space character every 30 seconds until you have a proper response body. This will keep the load-balance from timing out.
We're running a web API app on Kubernetes (1.9.3) in AWS (set with KOPS). The app is a Deployment and represented by a Service (type: LoadBalancer) which is actually an ELB (v1) on AWS.
This generally works - except that some packets (fragments of HTTP requests) are "delayed" somewhere between the client <-> app container. (In both HTTP and HTTPS which terminates on ELB).
From the node side:
( Note: Almost all packets on server-side arrive duplicated 3 times )
We use keep-alive so the tcp socket is open and requests arrive and return pretty fast. Then the problem happens:
first, a packet with only the headers arrives [PSH,ACK] (I see the headers in the payload with tcpdump).
an [ACK] is sent back by the container.
The tcp socket/stream is quiet for a very long time (up to 30s and more - but the interval is not consistent, we consider >1s as a problem ).
another [PSH, ACK] with the HTTP data arrives, and the request can finally be processed in the app.
From the client side:
I've run some traffic from my computer, recording it on the client side to see the other end of the problem, but not 100% sure it represents the real client side.
a [PSH,ASK] with the headers go out.
a couple of [ACK]s with parts of the payload start going out.
no response arrives for a few seconds (or more) and no more packets go out.
[ACK] marked as [TCP Window update] arrives.
a short pause again and [ACK]s start arriving and the session continues until the end of the payload.
This is only happening under load.
To my understanding, this is somewhere between the ELB and the Kube-Proxy, but I'm clueless and desperate for help.
This is the arguments Kube-Proxy runs with:
Commands: /bin/sh -c mkfifo /tmp/pipe; (tee -a /var/log/kube-proxy.log < /tmp/pipe & ) ; exec /usr/local/bin/kube-proxy --cluster-cidr=100.96.0.0/11 --conntrack-max-per-core=131072 --hostname-override=ip-10-176-111-91.ec2.internal --kubeconfig=/var/lib/kube-proxy/kubeconfig --master=https://api.internal.prd.k8s.local --oom-score-adj=-998 --resource-container="" --v=2 > /tmp/pipe 2>&1
And we use Calico as a CNI:
So far I've tried:
Using service.beta.kubernetes.io/aws-load-balancer-type: "nlb" - the issue remained.
(Playing around with ELB settings hoping something will do the trick ¯_(ツ)_/¯ )
Looking for errors in the Kube-Proxy, found rare occurrences of the following:
E0801 04:10:57.269475 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://api.internal.prd.k8s.local/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup api.internal.prd.k8s.local on 10.176.0.2:53: no such host
...and...
E0801 04:09:48.075452 1 proxier.go:1667] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 7 failed
)
I0801 04:09:48.075496 1 proxier.go:1669] Closing local ports after iptables-restore failure
I couldn't find anything describing such issue and will appreciate any help. Ideas on how to continue and troubleshoot are welcome.
Best,
A