ImageStoreService IdleSecondary replica - azure-service-fabric

I have a Service Fabric cluster with 5 nodes deployed to Azure. A few days ago one of the nodes in the cluster is failing to start the ImageStoreService and I'm seeing the following error:
Unhealthy event: SourceId='System.RAP', Property='IReplicator.BuildReplica(131347642528671415)Duration', HealthState='Warning', ConsiderWarningAsError=false.
The api IReplicator.BuildReplica(131347642528671415) on node _backend_0 is stuck. Start Time (UTC): 2018-07-19 12:47:20.764.
I have tried rebooting the virtual machine but it still remains in this state. It looks like it's currently in the InBuild replica state but I don't know what to look for or how to resolve it.
Edit
I logged into the virtual machine that is failing and looked at the Event Log for Service Fabric. I see tons of warning messages in the logs that look like this:
CopyFile: no new token is found. current token count: 2
and a lot that have this:
ImpersonateAndCopyFile for SourcePath:\10.10.10.1\StoreShare__node_3\131347742554673412\ApplicationName\ServiceName\Code\131347741584517844_1047972020224_30.File.dll, DestinationPath:C:\ProgramData\Microsoft\SF_App__FabricSystem_App4294967295\work\Store\131347742554673413\ApplicationName\ServiceName\Code\131347741584517844_1047972020224_30.File.dll failed: E_ACCESSDENIED. Have tried all access tokens.
Seems like it is failing to either read or write the files. I'm not sure why.

Related

FabricGateway.exe goes into a boot loop after a server reboot

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

AWS EC2 free tier instance is automatically stopping frequently

I am using ubuntu 18.04 on AWS EC2 instace free tier, running websites on apache server, NodeJS with PostgreSQL database. All deployments are done perfectly and webapps works fine without any exception or error details.
However I am facing an annoying issue: this instance is stopping frequently without any exception or error logs. After rebooting instance everything starts working fine but after some time it automatically stops either in few hrs. on same day when rebooted instance or in 1-2 days after that.
I created another free tier instance with seperate account and that is also giving same issue. I am not finding any logs or troubleshoot option to get rid of this problem.
I would like to know how it can be troubleshooted or where can i find logs of any errors or exception for this isntance?
Suggestion given by AWS in "Instance Status Checl" as attached below are not practicle solution to apply evertime.
Something with your VM itself is causing its health checks to fail.
Have a look at syslogs, and your application logs. Also take a look at CloudWatch metrics to see if any metrics have dramatic change close to time.
You can also add a CloudWatch alarm with a recovery action to automatically reboot if there’s an issue with your VM.

Compute Engine unhealthy instance down 50% of the time

I started to use google cloud 3 days ago or so, so I am completely new to it.
I have 4 pods deployed to Google Kubernetes Engine:
Frontend: react app,
Redis,
Backend: made up of 2 containers, a nodejs server and a cloudsql-proxy,
Nginx-ingress-controller
** And also have an sql instance running for my postgresql database, hence the cloudsql-proxy container
This setup works well 50% of the time, but every now and then all the pods crash or/and the containers are recreated.
I tried to check all the relevant logs, but I really don't know which are actually relevant. But there is one thing that I found which correlates with my issue, I have 2 VM instances running, and one of them might be the faulty one:
When I hover the loading spin, it says Instance is being verified, and it seems to be in this state 80% of the time, when it is not there is a yellow warning beside the name of the instance, saying The resource is not ready.
Here is the cpu usage of the instance (the trend is the same for all the hardware), I checked in the logs of my frontend and backend containers, here is
the last logs that correspond to a cpu drop:
2019-03-13 01:45:23.533 CET - 🚀 Server ready
2019-03-13 01:45:33.477 CET - 2019/03/13 00:45:33 Client closed local connection on 127.0.0.1:5432
2019-03-13 01:54:07.270 CET - yarn run v1.10.1
As you can see here, all the pods are being recreated...
I think that it might come from the fact that the faulty instance is unhealthy:
Instance gke-*****-production-default-pool-0de6d459-qlxk is unhealthy for ...
...the health check is proceeding and recreating/restarting the instance again and again. Tell me if I am wrong.
So, how can I discover what is making this instance unhealthy?

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Is it possible to control Service Fabric hosted service restart behaviour?

I can't find much documentation on the action that Service Fabric takes when a service it is hosting fails. I have performed some experimentation (using a stateless service in a local cluster), the results of which are below. My question is: is it possible to change this behaviour?
There are two distinct scenarios that I tested.
An exception thrown from the RunAsync() method.
The hosted service is restarted immediately on another cluster node. If no other node is available then it is restarted on the same node. There does not appear to be any limit to the number of times the restart will be attempted or any kind of back-off in terms of the interval between attempts.
The hosted service fails to start (e.g. an exception is thrown before RunAsync() is called).
The hosted service is restarted on the same node. In my test environment there appears to be a fixed interval between restart attempts (15 seconds) but no limit to the number of attempts.
I can see in the cluster configuration that there are some parameters in the Hosting section that look like they might be relevant (ActivationMaxRetryInterval, ActivationRetryBackoffInterval, ActivationMaxFailureCount) and I am guessing that these cover scenario (2) above (assuming that Activation == service start). These affect the entire cluster by the looks of it.