Cluster provisioning failed after adding a secondary certificate - azure-service-fabric

I deployed a single node Service Fabric test cluster then followed this instructions https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-security-update-certs-azure to add a secondary certificate. After more than an hour the cluster provisioning fails and stays with the unhealty node with code "PendingClusterUpgradeCannotBeInterrupted".
Here the log.
"code": "ClusterUpgradeFailed",
"message": "Cluster upgrade failed. Reason Code: 'HealthCheck'
"description": "Exception occured while creating an object of type FabricDCA.AzureBlobEtwCsvUploader
(assembly AzureFileUploader, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) for creating consumer AzureBlobServiceFabricEtw. Exception occured while creating an object of type FabricDCA.AzureTableQueryableEventUploader (assembly AzureTableUploader, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) for creating consumer AzureTableServiceFabric EtwQueryable."
Then I deleted and recreated the cluster with primary and secondary certificate and it worked. However if I try to swap the certificates by the portal the cluster hangs with "ClusterUpgradeFailed" (no error description given) and drives to "Cluster provisioning failed".
After serveral attempts this is the situation.
I am not able to add a secondary certificate on an existing cluster
neither by ARM nor by Add-AzureRmServiceFabricClusterCertificate
Powershell command.
Swapping certificates does not work (the only way
is by the portal).
The only way to change a certificate is deleting
and recreating the cluster.
Any ideas?

Related

FabricGateway.exe goes into a boot loop after a server reboot

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

Disconnect service fabric cluster connection

I know that we can connect to the service fabric cluster using Connect-ServiceFabricCluster as mentioned in Microsoft learn, which works flawlessly.
I use this in a script - it prints the following every time it tries to connect to service fabric again.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
So, is there a way to safely disconnect from sf before executing the next steps or closing, other than letting the connection time out?
To Disconnect service fabric cluster connection we have a Remove-ServiceFabricCluster command.
WARNING: Cluster connection with the same name already existed, the old connection will be deleted
The warning indicates that you are trying to connect the already connected cluster.
The warning itself says that the old one will be deleted and new one will be created.
AFAIK, you can continue without disconnecting/ removing the cluster.
Reference taken from MSDoc.

Container App Environment creation timing out

Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.
I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.

Azure Service Fabric Cluster Update

I have a cluster in Azure and it failed to update automatically so I'm trying a manual update. I tried via the portal, it failed so I kicked off an update using PS, it failed also. The update starts then just sits at "UpdatingUserConfiguration" then after an hour or so fails with a time out. I have removed all application types and check my certs for "NETWORK SERVCIE". The cluster is 5 VM single node type, Windows.
Error
Set-AzureRmServiceFabricUpgradeType : Code: ClusterUpgradeFailed,
Message: Cluster upgrade failed. Reason Code: 'UpgradeDomainTimeout',
Upgrade Progress:
'{"upgradeDescription":{"targetCodeVersion":"6.0.219.9494","
targetConfigVersion":"1","upgradePolicyDescription":{"upgradeMode":"UnmonitoredAuto","forceRestart":false,"u
pgradeReplicaSetCheckTimeout":"37201.09:59:01","kind":"Rolling"}},"targetCodeVersion":"6.0.219.9494","target
ConfigVersion":"1","upgradeState":"RollingBackCompleted","upgradeDomains":[{"name":"1","state":"Completed"},
{"name":"2","state":"Completed"},{"name":"3","state":"Completed"},{"name":"4","state":"Completed"}],"rolling
UpgradeMode":"UnmonitoredAuto","upgradeDuration":"02:02:07","currentUpgradeDomainDuration":"00:00:00","unhea
lthyEvaluations":[],"currentUpgradeDomainProgress":{"upgradeDomainName":"","nodeProgressList":[]},"startTime
stampUtc":"2018-05-17T03:13:16.4152077Z","failureTimestampUtc":"2018-05-17T05:13:23.574452Z","failureReason"
:"UpgradeDomainTimeout","upgradeDomainProgressAtFailure":{"upgradeDomainName":"1","nodeProgressList":[{"node
Name":"_mstarsf10_1","upgradePhase":"PreUpgradeSafetyCheck","pendingSafetyChecks":[{"kind":"EnsureSeedNodeQu
orum"}]}]}}'.
Any ideas on what I can do about a "EnsureSeedNodeQuorum" error ?
The root cause was only 3 seed nodes in the cluster as a result of the cluster being build with a VM scale set that had "overprovision" set to true. Lesson learned, remember to set "overprovision" to false.
I ended up deleting the cluster and scale set and recreated using my stored ARM template.

NServiceBus distributor cannot create queues on clustered MSMQ

I'm trying to set up a NServiceBus distributor on a Windows failover cluster. I've successfully followed the "official" guides and most of the things seem to work nicely. Except for actually starting the distributor on the cluster. When it starts it tries to create it's queues on the clustered MSMQ, but is denied permission:
Unhandled Exception: Magnum.StateMachine.StateMachineException: Exception occurred in Topshelf.Internal.ServiceController`1[[NServiceBus.Hosting.Windows.WindowsHost, NServiceBus.Host, Version=3.2.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c]] during state Initial while handling OnStart ---> System.Exception: Exception when starting endpoint, error has been logged. Reason: The queue does not exist or you do not have sufficient permissions to perform the operation. ---> System.Messaging.MessageQueueException: The queue does not exist or you do not have sufficient permissions to perform the operation.
I'm able to create queues when opening the clustered MSMQ manager, but even if I run the distributor using my own account it gets this error.
Something that might be related, is that I cannot change properties on the Message Queuing object in the clustered MSMQ manager. For instance, I try to change the message storage limit, I get this error:
The properties of TEST-CLU-MSMQ cannot be set
Error: This operation is not supported for Message Queuing installed in workgroup mode
I can however change this setting on the node's MSMQ settings, and those are also installed in workgroup mode.
Any ideas? I've tried reinstalling the cluster and services and just about everything, to no avail. Environment is Windows Server 2008R2