Tidal Workload automation : Fault Monitor lost the connect with Master - tidal-scheduler

02/01 21:55:41:394[197:RdOnly-13]: (mem=2286410344/2508193792) FTNodeMessageHandler: Connection is lost from node

I restarted the Fault Monitor service on the Fault Monitor server. That resolved the issue.

Related

FabricGateway.exe goes into a boot loop after a server reboot

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

Cadence - Cadence Canary failing to start

I am running into this error after Cadence Canary gets started on my cluster nodes.
After the error error starting cron workflow.... , Cadence Canary does nothing and just hangs there.
Any thoughts/suggestions?
UPDATE: I have turned on debug level logging and I am getting hammered with the following (note: it's a fresh cluster):
This error message says that cadence-canary was not able to call cadence-frontend service. This might indicate that cadence-frontend is not running or is not reachable. Check if cadence-frontend is running and check if your cadence-canary config points to correct cadence-frontend address

Dramatiq worker getting killed every often

I have started a dramatiq worker to do some task and after a point, it is just stuck and throws this below-mentioned error after some time.
[MainThread] [dramatiq.MainProcess] [CRITICAL] Worker with PID 53 exited unexpectedly (code -9). Shutting down...
What can be the potential reason for this to occur? Are System resources a constraint?
This queuing task is run inside a Kubernetes pod
Please check kernel logs (/var/log/kern.log and /var/log/kern.log.1)
The Worker might be getting killed due to OOMKiller (OutOfMemory).
To resolve this try to increase the memory if you are running in a docker or pod.

Getting error no such device or address on kubernetes pods

I have some dotnet core applications running as microservices into GKE (google kubernetes engine).
Usually everything work right, but sometimes, if my microservice isn't in use, something happen that my application shutdown (same behavior as CTRL + C on terminal).
I know that it is a behavior of kubernetes, but if i request application that is not running, my first request return the error: "No such Device or Address" or timeout error.
I will post some logs and setups:
The key to what's happening is this logged error:
TNS: Connect timeout occured ---> OracleInternal.Network....
Since your application is not used, the Oracle database just shuts down it's idle connection. To solve this problem, you can do two things:
Handle the disconnection inside your application to just reconnect.
Define a livenessProbe to restart the pod automatically once the application is down.
Make your application do something with the connection from time to time -> this can be done with a probe too.
Configure your Oracle database not to close idle connections.

Mesos-master: Shutdown failed on fd=25: Transport endpoint is not connected [107]

When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:
E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]
E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]
They keep electing one another in a loop, consistently failing and re-electing.
If I set QUORUM=1, everything works well. What could be the reason for this?
One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.
When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.
We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.
After a lot of investigation we came to this very same error:
E0718 18:51:05.836688 5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]
In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.
To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.
We had the same problem, the mesos-master log flooding with messages like:
mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]
Turned out it was the loadbalancers health check to /stats.json