Disconnect service fabric cluster connection - powershell

I know that we can connect to the service fabric cluster using Connect-ServiceFabricCluster as mentioned in Microsoft learn, which works flawlessly.
I use this in a script - it prints the following every time it tries to connect to service fabric again.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
So, is there a way to safely disconnect from sf before executing the next steps or closing, other than letting the connection time out?

To Disconnect service fabric cluster connection we have a Remove-ServiceFabricCluster command.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
The warning indicates that you are trying to connect the already connected cluster.
The warning itself says that the old one will be deleted and new one will be created.
AFAIK, you can continue without disconnecting/ removing the cluster.
Reference taken from MSDoc.

Related

FabricGateway.exe goes into a boot loop after a server reboot

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

Getting error no such device or address on kubernetes pods

I have some dotnet core applications running as microservices into GKE (google kubernetes engine).
Usually everything work right, but sometimes, if my microservice isn't in use, something happen that my application shutdown (same behavior as CTRL + C on terminal).
I know that it is a behavior of kubernetes, but if i request application that is not running, my first request return the error: "No such Device or Address" or timeout error.
I will post some logs and setups:
The key to what's happening is this logged error:
TNS: Connect timeout occured ---> OracleInternal.Network....
Since your application is not used, the Oracle database just shuts down it's idle connection. To solve this problem, you can do two things:
Handle the disconnection inside your application to just reconnect.
Define a livenessProbe to restart the pod automatically once the application is down.
Make your application do something with the connection from time to time -> this can be done with a probe too.
Configure your Oracle database not to close idle connections.

Google Cloud SQL unable to connect and restart master instance

This morning my application could not connect my MySQL master instance in Google Cloud SQL. The master instance does not have more logs, but the replication instance log show that the replication could not connect to the master too.
I tried to restart MySQL, but an hour later, it didn't start.
What should I do?
There are several possible reasons for this issue. For instance, your master instance may have failed due to an error while a dump was being created, or the instance may have been under maintenance and now it cannot restart correctly, etc. If that were the case, you would need to get in touch with Google Cloud Platform Support to have your Cloud SQL instance manually restarted.
Alternatively, you can also check the documentation for instance connection issues and how to diagnose issues in connections.
If nothing of this applies to your case, you should consider adding more information to your question, since there could be a problem with the expiration of your SSL server certificate, with the Proxy, etc.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Force delete of IBM Bluemix Kubernetes Cluster

I tried to set up a new Kubernetes cluster on IBM Bluemix and after a while I received the message that the deploy failed. To start over again I have tried to delete the cluster from the Bluemix interface, with no success. The error messages are not consistent, and go from elaborate messages error messages to the most common: 500: internal server error.
The command line does not help either. I expected this to work
bx cs cluster-rm k8s_demo
But the most of the time it leads to and EOF error. Somehow internal connections are an issue because
bx cs clusters
leads to the error
FAILED
unable to connect to https://us-south.containers.bluemix.net/v1/clusters, please check your Internet Connection
most of the time. Every so often a list including the k8s_demo cluster appears, but being as persistent with the cluster-rm command has not brought such luck that the cluster is deleted.
Is there any other way I can try to start over again? Apart from setting up another Bluemix account of course, something I would prefer to avoid.
If this problem is continuing, I would suggest contacting IBM Support. They'll have the tools to troubleshoot what has happened with the cluster provisioning and/or deleting.