Error starting Service Fabric Cluster (v 1.5.175) - azure-service-fabric

I've been trying to install and start the new preview SDK, and even after several installs/uninstalls/reboots I always get this error when running DevClusterSetup:
Start-Service : Failed to start service 'Microsoft Service Fabric Host Service (FabricHostSvc)'.
At C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\ClusterSetupUtilities.psm1:433 char:5
(full log below)
What I've tried, from other posts on stackoverflow:
reparing the performance counters with lodctr /R
used the system file checker with SFC /SCANNOW
checked that the windows firewall is running (and tried disabling it for the domain networks)
made sure I have enough disk space
The windows service "Microsoft Service Fabric Host Service" is always "Starting", but never starts.
I have two hints as to what the source of the problem might be, but can't solve it:
a) in the event viewer (Microsoft-Service Fabric > Admin) there are 4 errors that occur everytime the service attempts to start:
Unable to stop data collector for performance counters. The command
"logman stop FabricCounters" failed with error code -2147287038.
System.Fabric.FabricDeployer.InvalidDeploymentException: Failed to
start performance counter collection when creating or updating
deployment
FabricDeployer::Install failed with error 0xffffffff
FabricDeployer::Install failed with error 0xffffffff, Rolling back
b) In the C:\SfDevCluster\Log\Traces folder there are files named something like FabricSetup-131034051696570691.trace . All of them have the same content, and in the middle there are warnings like these:
FabricSetup.FabricSetup.EventTraceInstaller,Method QueryDataCollectorSet failed with HRESULT: -2144337918
FabricSetup.FabricSetup.EventTraceInstaller,Method StopPlaTraceSession failed with HRESULT: -2144337918
and then further down the error:
FabricSetup.FabricSetup.FabricDeployer,Configuration Deployment failed with error 0xffffffff
If I go and check the Fabric deployer files (eg, fabricdeployer-635945286697202537.trace), I have a single error at the end, after a series of Performance counter deletes:
FabricDeployer.FabricDeployer,Executing command: logman stop FabricCounters
FabricDeployer.FabricDeployer,Unable to stop data collector for performance counters. The command "logman stop FabricCounters" failed with error code -2147287038.
but this error seems to come after some other error, as part of the rollback.
Any ideas? This is very frustrating and there is very little info on the net.
I've also tried cleaning the installation with ClearCluster.ps1 and installing the dev cluster to a different folder, always with the same result.
I am running Win10 with VS2015 Update 1, Azure SDK 2.8.2.1 . My user is a liveid which is local admin.

I'll start with a short answer to unblock you. From an elevated powershell session run:
Unregister-ScheduledTask FabricCounters

I had the exact same issue but in my case the FabricCounters task wasn't there. So I did a search for other "Fabric*" tasks via Get-ScheduledTask Fabric* and found both FabricAppInfoTraces and FabricQueryTraces to be present still after uninstall.
I removed both Tasks using Unregister-ScheduledTask <name>, reinstalled the SDK and was able to start my local cluster again!

Related

Failed to start service VisualStudioRemoteDeployer

We are using on site Dev-Ops and have a similar problem to that described in the link Example from SO.
But ours is intermittent.
Our environment uses two build and deploy machines, which each deploy machine having two worker agents.
For one of our projects, when it is deployed, we constantly get the error:
The VisualStudioRemoteDeployerc4d3852f-411b-48ba-97d8-5e09c8d07ce4 service failed to start due to the following error:
%%2
But here is the rub, not every time. Sometimes the deployment completes without error.
Other projects that use the same deployment machine and the same target server work each and every time without fail.
The deployment log reports "The WSMan provider host process did not return a proper response." as an error.
Checking the allocated memory, described in PowerShell Out of Memory, to find our set at 2.1 Billion.
This is an interesting issue that I have uncovered. The source of this problem stems from the interaction of McAfee Endpoint security.
Said antivirus was reporting that when the remote powershell script, using WSMan, was called. McAfee, saw this as a viral payload and canceled the deployment by stopping the service from running and deleting the payload. This has been reported to McAfee as an issue. In the mean time, internal network security settings for McAfee has had to be modified to ignore the processes used by powershell in remote deployment.

Azure Service Fabric publish upgrade from Visual Studio - PowerShell Script Error

I am trying to publish an upgrade of a Service Fabric application from Visual Studio 2017 to our Azure Service Fabric Cluster. In mid-September, I successfully published an upgrade of this same app with same PowerShell script to SFC with no issues. I am now trying to upgrade it at the next version number and suddenly getting this error.
I get the following error during Publish, related to Powershell.
2>Started executing script 'Deploy-FabricApplication.ps1'.
2>powershell -NonInteractive -NoProfile -WindowStyle Hidden -ExecutionPolicy Bypass -Command ". 'C:\Users\pj\Source\Workspaces\VDevelopment\trunk\Services\Sources\src\For.Application.ServiceFabric.Sources\Scripts\Deploy-FabricApplication.ps1' -ApplicationPackagePath 'C:\Users\pj\Source\Workspaces\VDevelopment\trunk\Services\Sources\src\For.Application.ServiceFabric.Sources\pkg\Debug' -PublishProfileFile 'C:\Users\pj\Source\Workspaces\VDevelopment\trunk\Services\Sources\src\For.Application.ServiceFabric.Sources\PublishProfiles\Cloud.xml' -DeployOnly:$false -ApplicationParameter:#{} -UnregisterUnusedApplicationVersionsAfterUpgrade $false -OverrideUpgradeBehavior 'None' -OverwriteBehavior 'SameAppTypeAndVersion' -SkipPackageValidation:$false -ErrorAction Stop"
2>Copying application package to image store...
2>Upload to Image Store succeeded
2>Registering application type...
2>Register application type started. Use Get-ServiceFabricApplicationType to query for status.
2>Running Image Builder process ...
2>Application package is registered.
2>Start upgrading application...
2>aka.ms/upgrade-defaultservices
2>Start-ServiceFabricApplicationUpgrade : aka.ms/upgrade-defaultservices
2>At C:\Program Files\Microsoft SDKs\Service
2>Fabric\Tools\PSModule\ServiceFabricSDK\Publish-UpgradedServiceFabricApplication.ps1:317 char:13
2>+ Start-ServiceFabricApplicationUpgrade #UpgradeParameters
2>+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2> + CategoryInfo : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
2> bricApplicationUpgrade], FabricException
2> + FullyQualifiedErrorId : UpgradeApplicationErrorId,Microsoft.ServiceFabric.Powershell.StartApplicationUpgrade
2>
2>Finished executing script 'Deploy-FabricApplication.ps1'.
2>Time elapsed: 00:07:39.0407526
2>The PowerShell script failed to execute.
========== Build: 1 succeeded, 0 failed, 10 up-to-date, 0 skipped ==========
========== Publish: 0 succeeded, 1 failed, 0 skipped ==========
Any idea what's going on here? Again, when I last published this in September, with the same script, no issues at all, and I haven't made any changes to the solution other than upgrading the Manifest versions to push it out as a new upgraded version.
I noted this S/O thread: Getting error as part of trying to upgrade Service Fabric Application using Start-ServiceFabricApplicationUpgrade and saw the user's error was similar, but the answer does not apply to my issue because all three steps in the answer provided are definitely included in my powershell deploy script.
I can add the deployment script if helpful, but will wait until that is requested as it's long, and I only want to post it here if someone feels it's needed to diagnose.
You are getting this error because you are changing some parameters in a DefaultService that are not allowed by default.
The link aka.ms/upgrade-defaultservices shown in the error logs explain this.
Some default service parameters defined in the application manifest
can also be upgraded as part of an application upgrade.
Only the service parameters that support being changed through
Update-ServiceFabricService can be changed as part of an upgrade. The
behavior of changing default services during application upgrade is as
follows:
Default services in the new application manifest that do not already exist in the cluster are created.
Default services that exist in both the previous and new application manifests are updated. The parameters of the default
service in the new application manifest overwrite the parameters of
the existing service. The application upgrade will rollback
automatically if updating a default service fails.
Default services that do not exist in the new application manifest are deleted if they exist in the cluster. Note that deleting a default
service will result in deleting all that service's state and cannot be
undone.
Also, there is this other SO question about the same thing: Default service descriptions can not be modified as part of upgrade set EnableDefaultServicesUpgrade to true
The item 1 above is a common approach, where new services are added to the solution and later created during the upgrade without errors, the item 2 and 3 are the restricted approach that requires the EnableDefaultServicesUpgrade.
The item 2, is like described in the answer you've added, you changed MinReplicaSize and TargetReplicaSize to 1 during a manual update, when SF validated the state of your service for upgrade, it identified the difference and prevented the upgrade to continue, if you had set cluster setting EnableDefaultServicesUpgrade to true it would continue and override the default values.
The item 3, would occur you when you removed the service and added again, you had changed or misspelled the name, SF default settings would prevent the deletion of this service.
Regarding the solution you've found(delete and recreate), is not ideal,
In scenarios where you have stateful services running in production, would be risky to apply, because you would have to backup the state, re-deploy the services, and restore the backup, in some cases, depending on what these changes are, you wouldn't be able to restore the backup, because they have to match with the original services definitions (partitions type, number, and son on). You would also lose the benefits of Rolling Updates, and your service would go down maybe for a while if these backups are big.
The issue had to do with us trying to push out the application with mismatched node instances. We have a stateful service running under this application that is supposed to have MinReplicaSize and TargetReplicaSize set to 3. Yesterday, due to an issue, we deleted and re-created this service inside the SF Explorer. Upon doing so, it reset the replica size parameters back to 1. So we used a Powershell script to change them back to 3, but that script did not include all the necessary commands to get the service back to the exact state it was in before we deleted it. So today when we went to upgrade the app, the app in SFC wouldn't accept an upgrade from VS deployment, because of mismatches between what was in the parameters of the solution vs. what was in our SFC. To resolve, we re-deleted those services first, then deployed from VS, and no more error.

Local cluster installation .\DevClusterSetup.ps1 fails Waiting for Naming Service to be ready

When trying to set up the local cluster the powershell script I get the following error:
Is there any way of continuing the installation or fixing the cause of this error?
Cheers,
Mike
I have completely removed the SDK and started over but I am still having the same issues. Everything boils down to the 'Connect-ServiceFabricCluster' just doesn't work at all (I have followed all of the suggestions provided).
Surely the warnings about the naming services must point to something?
Each attempt I see the following:
WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service...
2>WARNING: Failed to connect Failover Manager Service, Attempting to contact FMM...
2>Connect-ServiceFabricCluster : A communication error caused the operation to fail.
2>At D:\Source\Play\ServiceFabricApplication\ServiceFabricApplication\Scripts\Deploy-FabricApplication.ps1:158 char:16
2>+ ... [void](Connect-ServiceFabricCluster #ClusterConnectionParameters ...
2>+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2> + CategoryInfo : InvalidOperation: (:) [Connect-ServiceFabricCluster], FabricTransientException
2> + FullyQualifiedErrorId : CreateClusterConnectionErrorId,Microsoft.ServiceFabric.Powershell.ConnectCluster
Attempting a reset from the tray:
Tray output
In my case the Cluster was not running (ie no Fabric.exe processes in Task Manager).
I was able to get things working again my opening a Powershell as Admin and running:
& "$ENV:ProgramFiles\Microsoft SDKs\Service Fabric\ClusterSetup\DevClusterSetup.ps1"
After that close the powershell window and open a new one (as Admin). Then Connect-ServiceFabricCluster worked.
This usually indicates that the main service host isn't running. If this is on our just-released public preview SDK, you can usually resolve these situations by resetting the cluster (just right click on the service fabric tray icon and click reset). If this is an older rev, well, then first you should upgrade :) But other than that you can check inside services.msc and make sure FabricHostSvc is running.
The error is a temporary communication error. Open Task Manager, go to 'Details' tab and check if 'FabricHost.exe' and 'Fabric.exe' is running. This indicates if the cluster has been setup and running.
Open a new administrator PowerShell window and try to connect to cluster using 'Connect-ServiceFabricCluster'.
If the connection still fails, try to remove the cluster using 'CleanCluster.ps1' and setup it again using 'DevClusterSetup.ps1'. This should fix the issue.
Please visit Troubleshoot your local development cluster setup.
I recently had a similar situation where all the TCP connections were erroring out with a FabricTransientException exception.
The underlying cause turned out to be the Windows Firewall. Once I disabled the firewall for the domain network, the connections were successful and the services were again accessible.
P.S> In case someone faces the same issue: Initially the problem was that after the installation Fabric Host service was just stalling with the "Starting" status. Main cause for this problem was that Windows Firewall service was disabled on the server. After enabling and starting the windows service, the Fabric Host service started as expected.

Start-Service : Failed to start service 'Microsoft Service Fabric Host Service (FabricHostSvc)'

I want to start working with Azure Service Fabric technology.
I am working according to this document and install the latest SDK.
After installation, I opened the PowerShell ("Run as administrator") command line windows and write those lines:
# Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Force -Scope CurrentUser
# cd "$env:ProgramW6432\Microsoft SDKs\Service Fabric\ClusterSetup"
# .\DevClusterSetup.ps1
As an answer, got this error:
Cleaning existing cluster ...
NOTE: If this powershell command window exits, please re-run the script in a new powershell command window.
Stopping service FabricHostSvc. This may take a few minutes...
Removing cluster configuration
Remove node configuration succeeded
Cleaning existing certificates
Stopping all logman sessions
Cleaning log and data folder, the powershell window may close automatically.
ClusterPath not provided, will use C:\SfDevCluster
FabricDataRoot not provided, will use C:\SfDevCluster\Data
FabricLogRoot not provided, will use C:\SfDevCluster\Log
Directory: C:\
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 4/11/2015 12:47 PM SfDevCluster
Directory: C:\SfDevCluster
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 4/11/2015 12:47 PM Manifests
True
Create node configuration succeeded
Starting service FabricHostSvc. This may take a few minutes...
Start-Service : Failed to start service 'Microsoft Service Fabric Host Service (FabricHostSvc)'.
At C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\DevClusterSetup.ps1:167 char:1
+ Start-Service FabricHostSvc -WarningAction SilentlyContinue
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Start-Service],
ServiceCommandException
+ FullyQualifiedErrorId : StartServiceFailed,Microsoft.PowerShell.Commands.StartServiceCommand
WARNING: Could not start FabricHostSvc
The bottom line is "Failed to start service". This output is printed to the screen after 3 minutes of waiting.
Things I've already been tried:
Restart the computer few times (I was reading somewhere that this solve the problem).
Turn OFF my Anti-virus\firewall software.
Attached screenshot of the PowerShell Command line.
I'm using:
Visual studio 2015 Enterprise edition
Windows 8.1
Azure Service Fabric SDK v1.0.328
I also fought with this problem just this morning. I did NOT have to reinstall Windows.
I too found events in the event log talking about corrupt performance counters. I'm not sure if it's related or not but I ran this command from a cmd windows as administrator to rebuild the performance counters and the error clear up:
lodctr /r
I then went to Programs and Features and uninstalled anything that mentioned Service Fabric.
I then reinstalled the Service Fabric SDK and followed the instrucions on the Azure Service Fabric environment setup page here and my cluster started working fine.
I was facing the same issue and tried many times one evening and next morning I got the answer. Well the answer is "Ensure that Firewall is on".
UPDATE2: This is a very old issue and I have not seen this reoccur since Nov 2015. (added just so this post doesn't get down-voted any more :-/ ).
UPDATE: I have not had this issue since the November update.
ORIGINAL:I had this issue the other day and tried everything. I had uninstalled, rebooted, reinstalled everything from Service Fabric down to Azure SDK's and Visual Studio.
The fix - pretty bad. Reinstall windows.
At one point, I found a trace that indicated a registry corruption. Something about unable to find a performance counter.
Right now I have a new issue (which I'll post separately) but just repeating here to let people know there is some buggy infrastructure under this service right now.... My stateful/stateless app deploys. The stateless app deploys and runs. The stateful service deploys but will not replicate. If I run exactly the same code on another machine (and I mean copy/paste to other machine then run it), it all works.

PowerShell remote sessions: Problems with ESET Nod32 AntiVirus

I am making my first attempts at using PowerShell remoting features. I've set up the "destination" server using the instructions in the help docs. But when I attempt to start a remote session (by executing an "Enter-PSSession servername1" command), it sits there for a long time, and eventually gives this error:
Enter-PSSession : Connecting to remote server failed with the following error message : The WinRM client cannot complete the operation within the time specified. Check if the machine name is valid and is reachable over the network and firewall exception for Windows Remote Management service is enabled. For more information, see the about_Remote_Troubleshooting Help topic.
I also noticed that while it was sitting there, my computer's performance had degraded. Looking at Task Manager, I see that ekrn.exe, which is the kernel process for Nod32 Antivirus, was using a lot of CPU (~50%, sometimes edging higher). It seems to never stop using the CPU until I kill the process, and I did some testing, and it clearly begins to use all that CPU as soon as I execute that Enter-PSSession command.
I then tried disabling the Nod32 anti-virus, executed the same command, and voilĂ , it worked, and the remote session started properly.
But obviously disabling my anti-virus isn't a solution. Can anyone suggest a better one?
It turns out I wasn't running the latest version of Nod32. I was running version 3, the I was able to upgrade to version 4, and the problem went away.