Theia IDE websocket disonnects every 30 sec when serving in Kubernetes behind ingress - kubernetes

I have a kubernetes ingress configured on google cloud with a managed certificate. Then I have the theia/theia-full docker image as a pod and a kubernetes service connecting the ingress and the pod.
The initial load of the theia page in my browser works and all plugins are started in the backend. After that every 30sec the browser issues another websocket request to wss://mytheiadomain. The theia backend logs
root ERROR [hosted-plugin: 59] Error: connection is closed
at Object.create (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:82:30)
at Object.<anonymous> (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:108:56)
at Object.disposable.dispose (/home/theia/node_modules/#theia/core/lib/common/disposable.js:101:13)
at DisposableCollection.dispose (/home/theia/node_modules/#theia/core/lib/common/disposable.js:78:40)
at RPCProtocolImpl.dispose (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:129:24)
at /home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:142:21
at step (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:48:23)
at Object.next (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:29:53)
at fulfilled (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:20:58)
at processTicksAndRejections (internal/process/task_queues.js:97:5) {
code: 'RPC_PROTOCOL_CLOSED'
}
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][typefox.yang-vscode]: Disconnected.
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][rebornix.ruby]: Disconnected.
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][ms-python.python]: Disconnected.
...
and all plugins disconnect and initialize again. (sometimes I don't even get this error message and the plugins just disconnect and initialize)
If I cut the wifi connection of my browser this does not happen! So the browsers wss request seems to trigger the restart. The disconnect every 30sec does not happen if I run theia-full locally on plain docker.
This is as far as I got tracing the error after a few hours of searching. Any hint would be appreciated. I can provide more log output and my configuration files.

The default timeout for Google Load Balancers is 30 seconds.
For external HTTP(S) load balancers and internal HTTP(S) load balancers, if the HTTP connection is upgraded to a WebSocket, the backend service timeout defines the maximum amount of time that a WebSocket can be open, whether idle or not.
You need to create a custom BackendCondig with the timeout that you want.

Related

How to force kubernetes pod to route through the internet?

I have an issue with my kubernetes routing.
The issue is that one of the pods makes a GET request to auth.domain.com:443 but the internal routing is directing it to auth.domain.com:8443 which is the container port.
Because the host returning the SSL negotiation identifies itself as auth.domain.com:8443 instead of auth.domain.com:443 the connection times out.
[2023/01/16 18:03:45] [provider.go:55] Performing OIDC Discovery...
[2023/01/16 18:03:55] [main.go:60] ERROR: Failed to initialise OAuth2 Proxy: error intiailising provider: could not create provider data: error building OIDC ProviderVerifier: could not get verifier builder: error while discovery OIDC configuration: failed to discover OIDC configuration: error performing request: Get "https://auth.domain.com/realms/master/.well-known/openid-configuration": net/http: TLS handshake timeout
(If someone knows the root cause of why it is not identifying itself with the correct port 443 but instead the container port 8443, that would be extremely helpful as I could fix the root cause.)
To workaround this issue, I have the idea to force it to route out of the pod onto the internet and then back into the cluster.
I tested this by setting up the file I am trying to GET on a host external to the cluster, and in this case the SSL negoiation works fine and the GET request succeeds. However, I need to server the file from within the cluster, so this isn't a viable option.
However, if I can somehow force the pod to route through the internet, I believe it would work. I am having trouble with this though, because everytime the pod looks up auth.domain.com it sees that it is an internal kubernetes IP, and it rewrites the routing so that it is routed locally to the 10.0.0.0/24 address. After doing this, it seems to always return with auth.domain.com:8443 with the wrong port.
If I could force the pod to route through the full publicly routable IP, I believe it would work as it would come back with the external facing auth.domain.com:443 with the correct 443 port.
Anyone have any ideas on how I can achieve this or how to fix the server from identifying itself with the wrong auth.domain.com:8443 port instead of auth.domain.com:443 causing the SSL negotiation to fail?

Failed to accept an incoming connection: connection from "9.42.x.x" rejected, allowed hosts: "zabbix-server"

SUMMARY
I have installed zabbix on OpenShift cluster. I am trying to monitor a host(vm) outside the cluster but the zabbix server is unable to connect to it. In the /etc/zabbix/zabbix_agentd.conf file I have mentioned the DNS name of the server zabbix-server but it looks like there server is trying to connect through a different public IP. I am not sure what this IP is.
OS / ENVIRONMENT / Used docker-compose files
I applied the kubernetes.yaml file present in this repo - https://github.com/zabbix/zabbix-docker/blob/6.2/kubernetes.yaml - on an OpenShift cluster.
CONFIGURATION
In the /etc/zabbix/zabbix_agentd.conf file Server=zabbix-server.
STEPS TO REPRODUCE
Apply the kubernetes.yaml file on Openshift cluster and try to monitor any external vm.
EXPECTED RESULTS
The zabbix server should be able to connect to the vm.
ACTUAL RESULTS
Zabbix server logs.
Defaulted container "zabbix-server" out of: zabbix-server, zabbix-snmptraps
\*\* Updating '/etc/zabbix/zabbix_server.conf' parameter "DBHost": 'mysql-server'...added
287:20230120:060843.131 Zabbix agent item "system.cpu.load\[all,avg5\]" on host "Host-C" failed: first network error, wait for 15 seconds
289:20230120:060858.592 Zabbix agent item "system.cpu.num" on host "Host-C" failed: another network error, wait for 15 seconds
289:20230120:060913.843 Zabbix agent item "system.sw.arch" on host "Host-C" failed: another network error, wait for 15 seconds
289:20230120:060929.095 temporarily disabling Zabbix agent checks on host "Host-C": interface unavailable
Logs from the agent installed on the vm.
350446:20230122:103232.230 failed to accept an incoming connection: connection from "9.x.x.219" rejected, allowed hosts: "zabbix-server"
350444:20230122:103332.525 failed to accept an incoming connection: connection from "9.x.x.219" rejected, allowed hosts: "zabbix-server"
350445:20230122:103432.819 failed to accept an incoming connection: connection from "9.x.x.210" rejected, allowed hosts: "zabbix-server"
350446:20230122:103533.114 failed to accept an incoming connection: connection from "9.x.x.217" rejected, allowed hosts: "zabbix-server"
If I add this IP in /etc/zabbix/zabbix_agentd.conf it will work. But what IP is this? Is this a service? Or any node/pod IP? It keeps on changing. Everytime I cannot change this id in the conf file. I need something more stable.
Kindly help me out with this issue.
So I don't know zabbix. So I have to make some educated guesses both in how the agent works and how the server works.
But, to summarize, unlike something like docker compose where you are running the zabbix server on a known server, in Openshift/Kubernetes you are deploying into a cluster of machines with their own networking. In other words, the whole point of OpenShift is that OpenShift will control where the application's pod gets deployed and will relocate/restart that pod as needed. With a different IP every time. (And the DNS name is meaningless since the two systems aren't sharing DNS anyway.) Most likely the IP's you are seeing are the pod's randomly assigned IPs.
So, what are you to do when you have a situation like yours where an external application requires a predicable IP? Well, option 1, is to remove that requirement. Using something like a certificate is obviously more secure and more reliable than depending on an IP anyway. But another option is to use an egress IP. This is a feature of OpenShift where you essentially use a proxy to provide an external application with a consistent IP.

Elastic APM Intake API timeout except on python applications

I have an elastic apm server version 7.17.1. There are only to Django application on the server. The APM service is using about 140MB of memory. When connecting new agents, I receive a timeout error.
node.js error
{"log.level":"error","#timestamp":"2022-07-07T23:15:34.033Z","log":{"logger":"elastic-apm-node"},"ecs":{"version":"1.6.0"},"message":"APM Server transport error: intake response timeout: APM server did not respond within 10s of gzip stream finish"}
Java error
2022-07-07 14:31:36,503 [elastic-apm-server-reporter] ERROR
co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error
sending data to APM server: Read timed out, response code is -1
If I use flask or Django or PHP, new apps are registered.
(I couldn't find logs from Go, but that APM agent failed as well.)
APM server logs did not appear relevant as the errors occurred in both instances
Check your IPS. In our env, we saw the IPS swatting down packets from Java and Go-based apps. I'm no developer, but once we got those sources white-listed, we saw the Agents information coming through :)

Are service addresses available to the dc/os host OS?

I’m trying to have my dc/os 1.8 docker containers send log messages to a logstash that is also running in dc/os by using the service address of the logstash service.
that doesn’t appear to work as docker throws an error: logstash.marathon.l4lb.thisdcos.directory: no such host
are service addresses not exposed to the host systems (or do I need to configure something for this)?
on dc/os 1.7 I used a fixed host port in my logstash config and logstash.marathon.mesos as host, but these .marathon.mesos hostnames seem to not exist in 1.8 anymore.
the service addresses work fine when I try to use them from within a container (for example to link my prometheus service to my alertmanager service). but from the host level they don’t exist.
EDIT:
my statement about the missing marathon.mesos urls was wrong. they do work, but I uses the wrong one. for now this fixes my problem kind of. I configured logging using this host and a fixed container port.
for everybody trying the same thing: you have to configure the fixed host port everytime you make changes to the service config in the ui via the json mode. the fixed host port config is no longer available in the network tab of the ui, so the dc/os ui will DELETE the host port config on every load.
still no idea why the l4lb urls don't work.
EDIT2
still no idea, but i figured out that minuteman generates crash and error logs every other second:
/opt/mesosphere/active/minuteman/minuteman/error.log:
CRASH REPORT Process <0.25809.2> with 0 neighbours exited with reason: {timeout,{gen_server,call,[{lashup_kv,'navstar#10.2.140.216'},{start_kv_sync_fsm,'minuteman#10.2.103.143',<0.25809.2>}]}} in gen_server:call/2 line 204
/opt/mesosphere/active/minuteman/minuteman/log/crash.log
2016-10-12 13:16:49 =CRASH REPORT====
crasher:
initial call: lashup_kv_sync_tx_fsm:init/1
pid: <0.29002.2>
registered_name: []
exception exit: {{timeout,{gen_server,call,[{lashup_kv,'navstar#10.2.140.216'},{start_kv_sync_fsm,'minuteman#10.2.103.143',<0.29002.2>}]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,204}]},{lashup_kv_sync_tx_fsm,init,1,[{file,"/pkg/src/minuteman/_build/default/lib/lashup/src/lashup_kv_sync_tx_fsm.erl"},{line,23}]},{gen_statem,init_it,6,[{file,"gen_statem.erl"},{line,554}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
ancestors: [lashup_kv_aae_sup,lashup_kv_sup,lashup_platform_sup,lashup_sup,<0.916.0>]
messages: []
links: [<0.992.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 610
stack_size: 27
reductions: 127
neighbours:
the dc/os ui claims spartan and minuteman are healthy, but while the crash.log of the dns dispatcher is empty the l4lb gets new crashes every other second.
They should certainly be available from the host OS. Are these host services running the "Spartan" and "Minuteman" services?
my problem was twofold:
the l4b did not properly run, that was only fixed after a total reinstall of the cluster
the l4b only supports TCP traffic. because i wanted to use it to send container-logs to logstash using udp (docker-gelf only supports UDP) this failed

Nginx proxy hangs in a while when idle

We are using nginx proxy_pass feature for bridging RESTful calls to a backend app, and use nginx web socket proxy for the same system at eh same time. Sometimes (guess when the system has no client request for a while) the nginx freezes any request till we restart it and then anything works well. What is the problem? DO I have to change keep-alive settings? I have turned off buffer and cache feature for proxy in nginx.conf.
I found the problem. By checking nginx error log and a bit a hackery sniff and guess, I found out that the web socket connections usually disconnect and reconnect (mobile devices) and the nginx peer tries to keep the connection alive, and then maximum connection limit reaches. I just decreased timeouts and increased max connections.