FabricGateway.exe goes into a boot loop after a server reboot - azure-service-fabric

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?

Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

Related

Disconnect service fabric cluster connection

I know that we can connect to the service fabric cluster using Connect-ServiceFabricCluster as mentioned in Microsoft learn, which works flawlessly.
I use this in a script - it prints the following every time it tries to connect to service fabric again.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
So, is there a way to safely disconnect from sf before executing the next steps or closing, other than letting the connection time out?
To Disconnect service fabric cluster connection we have a Remove-ServiceFabricCluster command.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
The warning indicates that you are trying to connect the already connected cluster.
The warning itself says that the old one will be deleted and new one will be created.
AFAIK, you can continue without disconnecting/ removing the cluster.
Reference taken from MSDoc.

Getting error no such device or address on kubernetes pods

I have some dotnet core applications running as microservices into GKE (google kubernetes engine).
Usually everything work right, but sometimes, if my microservice isn't in use, something happen that my application shutdown (same behavior as CTRL + C on terminal).
I know that it is a behavior of kubernetes, but if i request application that is not running, my first request return the error: "No such Device or Address" or timeout error.
I will post some logs and setups:
The key to what's happening is this logged error:
TNS: Connect timeout occured ---> OracleInternal.Network....
Since your application is not used, the Oracle database just shuts down it's idle connection. To solve this problem, you can do two things:
Handle the disconnection inside your application to just reconnect.
Define a livenessProbe to restart the pod automatically once the application is down.
Make your application do something with the connection from time to time -> this can be done with a probe too.
Configure your Oracle database not to close idle connections.

mongodb mms monitoring agent does not find group members

I have installed the latest mongodb mms agent (6.5.0.456) on ubuntu 16.04 and initialised the replicaset. Hence I am running a single node replicaset with the monitoring agent enabled. The agent works fine, however it does not seem to actually find the replicaset member:
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:170] Received new configuration: Primary agent, Assigned 0 out of 0 plus 0 chunk monitor(s)
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:182] Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the Group.
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Run:199] Done. Sleeping for 55s...
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:746] Performing discovery with 0 hosts
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:803] Received discovery responses from 0/0 requests after 891ns
I can see two processes for monitor agents:
/bin/sh -c /usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config >> /var/log/mongodb-mms/monitoring-agent.log 2>&1
/usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config
However if I terminate one, it also tears down the other, so I do not think that is the problem.
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
The rs.config() looks fine, with one replicaset member, which has a host field, which looks just fine. I can use that value to connect to the instance using the mongo command. no auth is configured.
EDIT
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset. This seems to be different to pre-cloud-manager days, where the agent was able to track the rs - if I remember correctly... Probably there still is a way to get this done easier, so I am leaving this question open for now...
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
Configuration values for the Cloud Manager agent (such as mmsGroupId and mmsApiKey) are set in the config file, which is /etc/mongodb-mms/monitoring-agent.config by default. The agent needs this information in order to communicate with the Cloud Manager servers.
For more details, see Install or Update the Monitoring Agent and Monitoring Agent Configuration in the Cloud Manager documentation.
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset.
Unless a MongoDB process is already managed by Cloud Manager automation, I believe it has always been the case that you need to add an existing MongoDB process to monitoring to start the process of initial topology discovery. Once a deployment is monitored, any changes in deployment membership should automatically be discovered by the Cloud Manager agent.
Production employments should have authentication and access control enabled, so in addition to adding a seed hostname and port via the Cloud Manager UI you usually need to provide appropriate credentials.

Is it possible to control Service Fabric hosted service restart behaviour?

I can't find much documentation on the action that Service Fabric takes when a service it is hosting fails. I have performed some experimentation (using a stateless service in a local cluster), the results of which are below. My question is: is it possible to change this behaviour?
There are two distinct scenarios that I tested.
An exception thrown from the RunAsync() method.
The hosted service is restarted immediately on another cluster node. If no other node is available then it is restarted on the same node. There does not appear to be any limit to the number of times the restart will be attempted or any kind of back-off in terms of the interval between attempts.
The hosted service fails to start (e.g. an exception is thrown before RunAsync() is called).
The hosted service is restarted on the same node. In my test environment there appears to be a fixed interval between restart attempts (15 seconds) but no limit to the number of attempts.
I can see in the cluster configuration that there are some parameters in the Hosting section that look like they might be relevant (ActivationMaxRetryInterval, ActivationRetryBackoffInterval, ActivationMaxFailureCount) and I am guessing that these cover scenario (2) above (assuming that Activation == service start). These affect the entire cluster by the looks of it.

Removing service fabric application fails

I have deployed an application to a 5 node standalone cluster. Deployment succeeded successful. But the application did not start because of some bug in the application.
I tried removing the application from the cluster using the Service Fabric Explorer but this fails.
The health State of the application is “Error” and the status is “Deleting”
The application has 9 services. 6 services show a Health State “Unknown” with a question mark and a Status “Unknown”. 3 services show a health state “Ok” but with a Status “Deleting”.
I have also tried to remove it using powershell:
Remove-ServiceFabricApplication -ApplicationName fabric:/appname -Force -ForceRemove
The result was an Operation timed out.
I also tried the script below that I found in some other post.
Connect-ServiceFabricCluster -ConnectionEndpoint localhost:19000
$nodes = Get-ServiceFabricNode
foreach($node in $nodes)
{
$replicas = Get-ServiceFabricDeployedReplica -NodeName $node.NodeName - ApplicationName "fabric:/MyApp"
foreach ($replica in $replicas)
{
Remove-ServiceFabricReplica -ForceRemove -NodeName $node.NodeName -PartitionId $replica.Partitionid -ReplicaOrInstanceId $replica.ReplicaOrInstanceId
}
}
Also no result, the script did not find any replica to remove.
At the same time we started removing the application one of the system services also changed state.
The fabric:/System/NamingService service shows a “Warning” health state.
This is on partition 00000000-0000-0000-0000-000000001002.
The primary replica shows:
Unhealthy event: SourceId='System.NamingService', Property='Duration_PrimaryRecovery', HealthState='Warning', ConsiderWarningAsError=false.
The PrimaryRecovery started at 2016-10-06 07:55:21.252 is taking longer than 30:00.000.
I also restarted every node (1 at the time) with no result.
How to force to remove the application without recreating the cluster because that is not a option for a production environment.
Yeah this can happen if you don't allow your code to exit RunAsync or Open/Close of your ICommunicationListener.
Some background:
Your service has a lifecycle that is driven by Service Fabric. A small component in your service - you know it as FabricRuntime - drives this. For stateless service instances, it's a simple open/close lifecycle. For stateful services, it's a bit more complex. A stateful service replica opens and closes, but also changes role, between primary, secondary, and none. Lifecycle changes are initiated by Service Fabric and show up as a method call or cancellation token trigger in your code. For example, when a replica is switch to primary, we call your RunAsync method. When it switches from primary to something else, or needs to shut down, the cancellation token is triggered. Either way, the system waits for you to finish your work.
When you go delete a service, we tell your service to change role and close. If your code doesn't respond, then it will get stuck in that state.
To get out of that state, you can run Remove-ServiceFabricReplica -ForceRemove. This essentially drops the replica from the system - as far Service Fabric is concerned, the replica is gone. But your process is still running. So you have to go in and kill the process too.
The error in the script is with the '- ApplicationName' and should be '-ApplicationName'.
And after correcting the parameter, this DID remove the hosed up pieces and get me back in order to be able to correct and redeploy the application to the cluster.