Spurious failures on discovery client connecting to discovery server - spring-cloud

We are noticing a number of random failures on the spring cloud services that we created trying to connect to the discovery server. They seem to happen every now and then and so far have not caused discovery to fail.
However, out operations group is reporting this to us a possible issue and we are trying to investigate. Is there any condition that we should be concerned about.
The example that I am showing below is that i replicated the same issue running on my laptop with the service and discovery server (eureka) on the same machine, so networking as a cause seems to be not the issue.
015-08-10 10:41:51.592 INFO 9897 --- [pool-8-thread-1] o.a.http.impl.client.DefaultHttpClient : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8761: The target server failed to respond
2015-08-10 10:41:51.593 INFO 9897 --- [pool-8-thread-1] o.a.http.impl.client.DefaultHttpClient : Retrying request to {}->http://localhost:8761

Related

Elastic APM Intake API timeout except on python applications

I have an elastic apm server version 7.17.1. There are only to Django application on the server. The APM service is using about 140MB of memory. When connecting new agents, I receive a timeout error.
node.js error
{"log.level":"error","#timestamp":"2022-07-07T23:15:34.033Z","log":{"logger":"elastic-apm-node"},"ecs":{"version":"1.6.0"},"message":"APM Server transport error: intake response timeout: APM server did not respond within 10s of gzip stream finish"}
Java error
2022-07-07 14:31:36,503 [elastic-apm-server-reporter] ERROR
co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error
sending data to APM server: Read timed out, response code is -1
If I use flask or Django or PHP, new apps are registered.
(I couldn't find logs from Go, but that APM agent failed as well.)
APM server logs did not appear relevant as the errors occurred in both instances
Check your IPS. In our env, we saw the IPS swatting down packets from Java and Go-based apps. I'm no developer, but once we got those sources white-listed, we saw the Agents information coming through :)

Theia IDE websocket disonnects every 30 sec when serving in Kubernetes behind ingress

I have a kubernetes ingress configured on google cloud with a managed certificate. Then I have the theia/theia-full docker image as a pod and a kubernetes service connecting the ingress and the pod.
The initial load of the theia page in my browser works and all plugins are started in the backend. After that every 30sec the browser issues another websocket request to wss://mytheiadomain. The theia backend logs
root ERROR [hosted-plugin: 59] Error: connection is closed
at Object.create (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:82:30)
at Object.<anonymous> (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:108:56)
at Object.disposable.dispose (/home/theia/node_modules/#theia/core/lib/common/disposable.js:101:13)
at DisposableCollection.dispose (/home/theia/node_modules/#theia/core/lib/common/disposable.js:78:40)
at RPCProtocolImpl.dispose (/home/theia/node_modules/#theia/plugin-ext/lib/common/rpc-protocol.js:129:24)
at /home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:142:21
at step (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:48:23)
at Object.next (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:29:53)
at fulfilled (/home/theia/node_modules/#theia/plugin-ext/lib/hosted/node/plugin-host.js:20:58)
at processTicksAndRejections (internal/process/task_queues.js:97:5) {
code: 'RPC_PROTOCOL_CLOSED'
}
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][typefox.yang-vscode]: Disconnected.
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][rebornix.ruby]: Disconnected.
root INFO [e894a0b2-e9cd-4f35-8167-89eb28e840d8][ms-python.python]: Disconnected.
...
and all plugins disconnect and initialize again. (sometimes I don't even get this error message and the plugins just disconnect and initialize)
If I cut the wifi connection of my browser this does not happen! So the browsers wss request seems to trigger the restart. The disconnect every 30sec does not happen if I run theia-full locally on plain docker.
This is as far as I got tracing the error after a few hours of searching. Any hint would be appreciated. I can provide more log output and my configuration files.
The default timeout for Google Load Balancers is 30 seconds.
For external HTTP(S) load balancers and internal HTTP(S) load balancers, if the HTTP connection is upgraded to a WebSocket, the backend service timeout defines the maximum amount of time that a WebSocket can be open, whether idle or not.
You need to create a custom BackendCondig with the timeout that you want.

Application keeps crashing because of database

Our application keeps on crashing once per day (at the start of the workday). Because of what it seems connection with database.
[31merror[39m: [SSL-QTEH-TD] E01000-SYSTEM_ERROR: [IBM][CLI Driver]
SQL30081N A communication error has been detected. Communication
protocol being used: "TCP/IP". Communication API being used:
"SOCKETS". Location where the error was detected: "000.00.00.00".
Communication function detecting the error: "recv". Protocol specific
error code(s): "104", "*", "0". SQLSTATE=08001
I'm unable to determine why this is happening.
You have a communication related SQL Error at the start of each working day. This implies that the network connectivity between your application and the database server was broken overnight, most likely for a scheduled downtime.
This could have been one or more of your app, the server your app runs on, any proxy or firewall servers servers between your app and the database server, the database, the server the database runs on.
Most likely it will be the database, allowing it to run reorgs and make backups. Next likely is a firewall, shutting down to allow maintenance. In any case your app needs to be able to detect the disconnect and recover.

Network Timeout for Containerized Service in Service Fabric

We lift and shifted our cloud services to Azure Service Fabric. We converted our services to Windows Container and successfully deployed to the cluster. However, we're getting a lot of timeout issue while uploading file to Sharepoint and accessing Azure SQL Server.
See the example below:
SQL: A transport-level error has occurred when receiving results from the server. (provider: Session Provider, error: 19 - Physical connection is not usable)
Sharepoint: The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
Note: These issues are not happening is cloud services.

PeopleSoft Webserver crashing, losing connection to AppServer

On our Webserver, we're seeing a ton of these errors:
Application Server last connected //psoftapp.company.net_8850
bea.jolt.ServiceException: bea.jolt.JoltRemoteService(GetCertificate)call(): Timeout\nbea.jolt.SessionException: Connection recv error\nbea.jolt.JoltException: [3] NwHdlr.recv(): Timeout Error
and on our Appserver:
PSPUBDSP_dflt.27505 (0) 07/20/11 08:13:33 (JNIUTIL): Java exception thrown: java.net.SocketException: Connection reset
I'm reading some tuning documents from PeopleSoft & I found a suggestion that I've seen in a couple of places -- Reducing the tcp_wait_time_interval to 60 seconds. I think I sort of understand what this is doing - It seems that network (or socket?) connections that are no longer being used are "recycled" or made available? Can someone confirm this? Also, why are these connections unused/stale? Is it caused by people not properly logging out of the app (and just closing the browser)?
Thanks!
PSPUBDP is part of the Integration Broker application messaging framework. You could look at the Tuxedo logs or the Integration Broker Monitor too see what is going on. You may be running a high number of messages and overloading the server or possibly you have a message with errors that is somehow causing the crashes.