Failed to Create Development Service Fabric Cluster on Windows Server 2016 Standard - azure-service-fabric

I am attempting to create a local development (unsecured) Service Fabric Cluster on Windows Server 2016 Standard. I have followed the instructions found in this article. However, I'm getting a rather interesting error and cannot find anything to help me resolve this.
FabricHostSvc was not installed by FabricInstallerSvc on machine
localhost. FabricSetup may have failed. CreateCluster Error:
System.AggregateException: One or more errors occurred. --->
System.Fabric.FabricServiceNotFoundExc eption: FabricHostSvc was not
installed by FabricInstallerSvc on machine localhost. FabricSetup may
have failed. at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(Str
ing machineName, ServiceController installerSvc) at
System.Threading.Tasks.Parallel.<>c__DisplayClass17_01.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at
System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object
) --- End of inner exception stack trace --- at
System.Threading.Tasks.Task.ThrowIfExceptional(Boolean
includeTaskCanceledExceptions) at
System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout,
CancellationToken cancellationToken) at
System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive,
Int32 toExclusive, ParallelOptions parallel Options, Action1 body,
Action2 bodyWithState, Func4 bodyWithLocal, Func1 localInit,
Action1 localFinally) at
System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable1
source, ParallelOptions parallelOption s, Action1 body, Action2
bodyWithState, Action3 bodyWithStateAndIndex, Func4
bodyWithStateAndLocal, Func5 bodyWithE verything, Func1 localInit,
Action1 localFinally) at
System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable1 source,
Action1 body) at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.RunFabricServices(List1
machines, FabricPacka geType fabricPackageType) at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.<CreateClusterAsyncInternal>d__7.MoveNext()
---> (Inner Exception #0) System.Fabric.FabricServiceNotFoundException: FabricHostSvc was not
installed by FabricInstall erSvc on machine localhost. FabricSetup may
have failed. at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(Str
ing machineName, ServiceController installerSvc) at
System.Threading.Tasks.Parallel.<>c__DisplayClass17_01.b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at
System.Threading.Tasks.Task.<>c__DisplayClass176_0.b__0(Object
)<---
Cleaning up faulted installation. FabricRoot not found in registry of
target machine localhost. Create Cluster failed. For more information
please look at traces in FabricLogRoot. Create Cluster failed with
exception: System.AggregateException: One or more errors occurred.
---> System.AggregateExcep tion: One or more errors occurred. at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task
task) at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManager.d__0.MoveNext()
--- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean
includeTaskCanceledExceptions) at
System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout,
CancellationToken cancellationToken) at
Microsoft.ServiceFabric.Powershell.ClusterCmdletBase.NewCluster(String
clusterConfigurationFilePath, String fabric PackageSourcePath, Boolean
cleanupOnFailure)
---> (Inner Exception #0) System.AggregateException: One or more errors occurred. at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task
task) at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.ServiceFabric.DeploymentManager.DeploymentManager.d__0.MoveNext()<---
Has anyone encountered this error before and fixed it? How is this error resolved?
Side Note: After receiving this error I ran the CleanFabric PowerShell script and removed all the Service Fabric files from the server and tried running the installation again with the same error message.
In addition, there are no Service Fabric SDKs installed on the machine (the ones you'd use on a local development machine). The reason for this is due to the official prerequisites stated by Microsoft shown below.
Prerequisites for each machine that you want to add to the cluster:
1. A minimum of 16 GB of RAM is recommended.
2. A minimum of 40 of GB available disk space is recommended.
3. A 4 core or greater CPU is recommended.
4. Connectivity to a secure network or networks for all machines.
5. Windows Server 2012 R2 or Windows Server 2012 (you need to
have KB2858668 installed).
6. .NET Framework 4.5.1 or higher, full install.
7. Windows PowerShell 3.0. The RemoteRegistry service should be running on all the machines.
The cluster administrator deploying and configuring the cluster must have administrator privileges on each of the machines. You cannot install Service Fabric on a domain controller.
I cannot help but feel there is something obvious missing but I've followed the docs very closely so this is rather perplexing.

Service Fabric drivers have a signing issue which is preventing them from being installed on Windows Server 2016 and Windows 10 Anniversary edition. Please wait for the next version or try with version 5.2.

Related

Service Fabric Application fails to find the managed identity endpoint

The Service Fabric cluster exists, the applications exists and are running. The user-assigned managed identity exists in the same resource group the cluster is. NOTE: I do not know how to verify whether it is assigned to the cluster or not.
Code is trying to create a Storage queues client using the identity and I get the error below, which I think means that the fabric:/System/ManagedIdentityTokenService is not running. NOTE: I do not know how to verify whether the service is running or not.
NOTE: Very similar code worked in other clusters.
NOTE: the underlying VMSS does have the managed identity associated to it.
NOTE: I am using Storage SDK 12. The C# code does the following:
ManagedIdentityCredentials cred = new ManagedIdentityCredentials(ClientId: "XYZ...");
string queueEndpoint = string.Format("https://{0}.queue.core.windows.net/{1}", accountName, queueName);
QueueClient qc = QueueClient(new Uri(queueEndpoint), cred);
bool b = await qc.CreateIfNotExistsAsync(); // This one throws the error below.
Any guidance to fix this issue would be appreciated.
Error:
Trying to create a queue (using MSI) failed with exception Azure.Identity.CredentialUnavailableException: No managed identity endpoint found.
at Azure.Identity.ExtendedAccessToken.GetTokenOrThrow()
at Azure.Identity.ManagedIdentityCredential.d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Azure.Core.Pipeline.BearerTokenAuthenticationPolicy.AccessTokenCache.d__11.MoveNext()

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor

I've asked this question on Github also - https://github.com/Azure/service-fabric-issues/issues/379
I have (n) actors that are executing on a continuous reminder every second.
These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.
I cannot find any information relating to this exception being encountered by reliable actors.
Once this exception occurs and the actors are deactivated, they do not get re-activated.
What are the conditions for this error to occur and how can I further diagnose the problem?
"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()
Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:
Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)
With a warning in the primary replica of that partition:
Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.
And a warning in the ActiveSecondary:
Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0
3 out of 5 Nodes are showing the following error:
Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.
More Information:
My cluster setup consists of 5 nodes of D1 virtual machines.
Event viewer errors in Microsoft-Service Fabric application:
I see quite a lot of
Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)
and a heap of warnings like:
Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z
And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:
Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?
Re: the errors - yeah looks like disk full causing failures. Just to close the loop here - looks like you found out that your state wasn't actually getting distributed in the cluster, and once you fixed that you stopped seeing disk full. Your capacity planning should hopefully make more sense now.
Regarding safety: TLDR: Using the temporary drive is fine because you're using Service Fabric. If you weren't then using that drive for real data would be a very bad idea.
Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe, since Azure may Service heal the VM in response to a bunch of different things.
In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and try to stall the update in the meantime. Some more info on that is here.

Patch Orchestration Application issue - NodeAgentSFUtility.exe crashing

so I'm working on getting POA going. The issue I'm running into is that as soon as the Node Agent NT Service (POSNodeSvc) starts, it runs NodeAgentSFUtility.exe which then fails with the below exception and an HRESULT of 80071c43 which seems to mean "connection denied". No logs are present. They both runs as SYSTEM . Running this on an on prem cluster using Windows security. BTW, all the SF services for POA are showing green in the SF Explorer, so it seems that there perhaps is room for better health reporting around this exe not running correctly.
Application: NodeAgentSFUtility.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Runtime.InteropServices.COMException
at System.Fabric.Interop.NativeClient+IFabricQueryClient9.EndGetApplicationList2(IFabricAsyncOperationContext)
at System.Fabric.FabricClient+QueryClient.GetApplicationListAsyncEndWrapper(IFabricAsyncOperationContext)
at System.Fabric.Interop.AsyncCallOutAdapter2`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].Finish(IFabricAsyncOperationContext, Boolean)
Exception Info: System.Fabric.FabricConnectionDeniedException
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentSFUtility.Helpers.CoordinatorServiceHelper+<GetApplicationDeployedStatusAsync>d__1.MoveNext()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentSFUtility.CommandProcessor+<GetApplicationDeployedStatusAsync>d__10.MoveNext()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentSFUtility.CommandProcessor+<ProcessArguments>d__5.MoveNext()
Exception Info: System.AggregateException
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean)
at System.Threading.Tasks.Task.Wait(Int32, System.Threading.CancellationToken)
at Microsoft.ServiceFabric.PatchOrchestration.NodeAgentSFUtility.Program.Main(System.String[])
I was able to make this work by adding the following to the cluster manifest:
"ClientIdentities": [
{
"Identity": "NT AUTHORITY\\SYSTEM",
"IsAdmin": true
}
]
Not quite sure if this really is needed? Can someone please confirm. There is no mention of this in the POA docs - https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-patch-orchestration-application
Thanks,
Hans
There appears to be a POA fix coming to address this. See link in above comment.

TFS Build Agent stopped

I have a problem with a running build machine, where the agent suddenly does not want to start. It's been a part of a remote controller and for trouble shooting this issue, if started a local controller. The symptoms are, at the agent(s) initializes correct (says 'Ready'), but has the stopped icon and in the status area says 'BuildController has not been started in 1 minutes. The AD account running the build service works on another build machine (seperate controller + build agents). I've tried the following
Reinstall the build service
running with machine name, fully qualified domain name and IP address for endpoint address
un- and re-registered build service
rebooted
cleaned up build agent registrations with script
If I change the service account running the build service to my own AD account, it works. However, running under our dedicated build user failes on this particular machine, but not the other. Any suggestions what to do? Here's the error from the event log:
Service 'Default Agent - tfs2010build1' had an exception:
Exception Message: There was no endpoint listening at http://tfs2010build1:9191/Build/v3.0/Services/Controller/31 that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. (type EndpointNotFoundException)
Exception Stack Trace:
Server stack trace:
at System.ServiceModel.Channels.HttpOutput.WebRequestHttpOutput.GetOutputStream()
at System.ServiceModel.Channels.HttpOutput.Send(TimeSpan timeout)
at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.SendRequest(Message message, TimeSpan timeout)
at System.ServiceModel.Channels.RequestChannel.Request(Message message, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)
Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.TeamFoundation.Build.Machine.IBuildControllerService.TestConnectionFromController(String agentUri)
at Microsoft.TeamFoundation.Build.Machine.ServiceProxies.ServiceProxy`1.<>c__DisplayClass3.<Do>b__2(T channel)
at Microsoft.TeamFoundation.Build.Machine.ServiceProxies.ServiceProxy`1.Do[TResult](Func`2 action)
at Microsoft.TeamFoundation.Build.Machine.BuildAgentService.<>c__DisplayClass12.<TestConnection>b__11(Object )
Inner Exception Details:
Exception Message: Unable to connect to the remote server (type WebException)
Exception Stack Trace: at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
at System.Net.HttpWebRequest.GetRequestStream()
at System.ServiceModel.Channels.HttpOutput.WebRequestHttpOutput.GetOutputStream()
Inner Exception Details:
Exception Message: No connection could be made because the target machine actively refused it 127.0.0.1:38742 (type SocketException)
Exception Stack Trace: at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)
Found the problem. Proxy server was enabled under IE options. Not sure why the build service worked under my AD user account and not the dedicated build user, but it solved the problem.
---->>>>>Update!
So we have 2 machines (B1 & B2), each with 2 agents. B1 had the initial problem and was solved by disabling the proxy settings under IE. Yesterday B2 suddenly started showing the same symptons and error messages on the 2 agents. Proxy setting is NOT enabled. While it did fix B1, it's not the universal solution for this particular problem.
It's hard work keeping these build agents running :( - Miss TeamCity...
---->>>>Update again!
So yesterday when I looked at the proxy configuration, it wasn't set. However this morning the checkbox was checked. Disabled the proxy and the agents went online. Very strange behavior! Wonder if Windows Update changes these settings...
I often get this same issue with the proxy when I am forced to manually stop a build. I have not been able to find any decent resolutions for this.

Crystal report 9.2, incorrect log on parameters

Background;
The Web application is developed for .Net framework 4.0. It has crystal report 9.2 integration. Application runs on Integrated Windows Authentication. The crystal reports are working fine when we execute, from solution running under Visual Studio 2010. When the same report are deployed to server (Web Server- OS: Windows Server 2003-SP2-32bit. DB Server- OS: Windows Server 2003-32bit) following error occurs, incorrect log on parameters. Crystal Report Runtime Engine for .net framework 4.0 has been installed in web server. Reports are configured to work on ODBC, System DSN with SQL Server driver. This drivers runs on SQL Server user account which has permission on database. There are no logon parameters passed from application. Just to verify, solution of passing the logon parameters from application is also tried, but the problem is not resolved.
Note: To resolve full access has been provided to IIS_WPG account on folder: C:\Windows\Temp, C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319\Temporary ASP.NET Files and web application.
Server Error in '/XXXX' Application.
________________________________________
Error in File E:\WebApps\XXXX\Reports\CompanyStandard.rpt:
Unable to connect: incorrect log on parameters.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: CrystalDecisions.CrystalReports.Engine.LogOnException: Error in File E:\WebApps\XXXX\Reports\CompanyStandard.rpt:
Unable to connect: incorrect log on parameters.
Source Error:
An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.
Stack Trace:
[LogOnException: Error in File E:\WebApps\XXXX\Reports\CompanyStandard.rpt:
Unable to connect: incorrect log on parameters.]
. N(String -, EngineExceptionErrorID 0) +582
. I(Int16 !, Int32 ") +277
CrystalDecisions.CrystalReports.Engine.FormatEngine.GetPage(PageRequestContext reqContext) +429
CrystalDecisions.ReportSource.LocalReportSourceBase.GetPage(PageRequestContext pageReqContext) +172
CrystalDecisions.Web.ReportAgent.|(Boolean Z) +223
CrystalDecisions.Web.CrystalReportViewer.OnPreRender(EventArgs e) +165
System.Web.UI.Control.PreRenderRecursiveInternal() +103
System.Web.UI.Control.PreRenderRecursiveInternal() +175
System.Web.UI.Control.PreRenderRecursiveInternal() +175
System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +2496
You need to change the profile IIS is running under to a domain account or change your web.config file to impersonate the user. Otherwise you will need to apply logoninfo for the report.