I'm trying to set up a NServiceBus distributor on a Windows failover cluster. I've successfully followed the "official" guides and most of the things seem to work nicely. Except for actually starting the distributor on the cluster. When it starts it tries to create it's queues on the clustered MSMQ, but is denied permission:
Unhandled Exception: Magnum.StateMachine.StateMachineException: Exception occurred in Topshelf.Internal.ServiceController`1[[NServiceBus.Hosting.Windows.WindowsHost, NServiceBus.Host, Version=3.2.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c]] during state Initial while handling OnStart ---> System.Exception: Exception when starting endpoint, error has been logged. Reason: The queue does not exist or you do not have sufficient permissions to perform the operation. ---> System.Messaging.MessageQueueException: The queue does not exist or you do not have sufficient permissions to perform the operation.
I'm able to create queues when opening the clustered MSMQ manager, but even if I run the distributor using my own account it gets this error.
Something that might be related, is that I cannot change properties on the Message Queuing object in the clustered MSMQ manager. For instance, I try to change the message storage limit, I get this error:
The properties of TEST-CLU-MSMQ cannot be set
Error: This operation is not supported for Message Queuing installed in workgroup mode
I can however change this setting on the node's MSMQ settings, and those are also installed in workgroup mode.
Any ideas? I've tried reinstalling the cluster and services and just about everything, to no avail. Environment is Windows Server 2008R2
Related
I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected
I'm trying to run this command : confluent local services start
I don't know why each time it gives me an error before passing to the next step. So I had to run it again over and over until it passes all the steps.
what is the reason for the error and how to solve the problem?
You need to open the log files to inspect any errors that would be happening.
But, it's possible the services are having a race condition. Schema Registry requires Kafka, REST Proxy and Connect require the Schema Registry... Maybe they are not waiting for the previous components to start.
Or maybe your machine does not have enough resources to start all services. E.g. I believe at least 6GB of RAM are necessary. If you have 8GB on the machine, and Chrome and lots of other services are running, for example, then you wouldn't have 6GB readily available.
I have some dotnet core applications running as microservices into GKE (google kubernetes engine).
Usually everything work right, but sometimes, if my microservice isn't in use, something happen that my application shutdown (same behavior as CTRL + C on terminal).
I know that it is a behavior of kubernetes, but if i request application that is not running, my first request return the error: "No such Device or Address" or timeout error.
I will post some logs and setups:
The key to what's happening is this logged error:
TNS: Connect timeout occured ---> OracleInternal.Network....
Since your application is not used, the Oracle database just shuts down it's idle connection. To solve this problem, you can do two things:
Handle the disconnection inside your application to just reconnect.
Define a livenessProbe to restart the pod automatically once the application is down.
Make your application do something with the connection from time to time -> this can be done with a probe too.
Configure your Oracle database not to close idle connections.
I am having an issue with MSMQ in a clustered environment. I have the following setup:
2 Nodes setup in a Windows Failover, lets call them "Node A" and "Node B".
I have then setup a Clustered Instance of MSMQ lets call it "MSMQ Instance".
I have also setup a Clustered instance of the DTC, lets call it "DTC Instance".
Within the DTC instance, I have allowed access both locally and also through the Clustered instance, basically I have taken all authentication off to test.
I have also created a clustered instance of our In house application, lets call it "Application Instance". Within this Application instance, I have other resources added, which are other services the application uses and also the Net.MSMQ adapter.
The Issue.......
When I seem to Cluster the Application Instance, it always seems to set the owner to be the opposite Node that I am using, so if I am creating the Clustered Instance on Node A it always sets the current owner to Node B, however that is not the issue.
The issue I have is that as long as the Application Instance is running on Node B, MSMQ seems to work.
The outbound queues are created locally, receive messages and are then processed through the MSMQ Cluster.
If I then Failover to Node A, the MSMQ refuses to work. The outbound queues are not created and therefore no messages are being processed.
I get an error in Event Viewer:
"The version check failed with the error: 'Unrecognized error -1072824309 (0xc00e000b)'. The version of MSMQ cannot be detected All operations that are on the queued channel will fail. Ensure that MSMQ is installed and is available"
If I then failover back to Node B it works.
The Application has been setup to use the MSMQ instance and all the permissions are correct.
Do I need to have a Clustered instance of DTC or can I just configure it as resource within the MSMQ instance?
Can anybody shed any light on this as I am at a brick wall with this?
Yes, you will need to have a clustered DTC setup.
For your clustered MSMQ instance you will then need to configure the clustered DTC as a "dependendy" Right click on MSMQ -> Properties -> Dependencies
I do not know if this is mandatory in all cases, but on our Cluster we also have a file share configured as a dependcy for the MSMQ. To my understanding this should ensure that temporary files that are needed by MSMQ are still available after a node switch.
Additionally, here are two articles that I found very helpful in setting up the cluster nodes. They might be helpful in confirming step-by-step that your configurations are correct:
"Building MSMQ cluster". You will find several other links in that article that will guide you further.
Microsoft also has a detailed document: "Deploying Message Queuing (MSMQ) 3.0 in a Server Cluster".
We have a two node cluster running on Windows 2008 R2. I've installed MSMQ with the Message Queue Server and Directory Service Integration options on both nodes. I've created a clustered MSMQ resource named TESTV0Msmq (we use transactional queues so a DTC resource had been created previously).
The virtual resource resolves correctly when I ping it.
I created a small console executable in c# using the MessageQueue contructor to allow us to send basic messages (to both transactional and non transactional queues).
From the active node these paths work:
.\private$\clustertest
{machinename}\private$\clustertest
but TESTV0Msmq\private$\clustertest returns "Invalid queue path name".
According to this article:
http://technet.microsoft.com/en-us/library/cc776600(WS.10).aspx
I should be able to do this?
In particular, queues can be created on a virtual server, and
messages can be sent to them. Such queues are addressed using the
VirtualServerName\QueueName syntax.
Sounds like classic Clustering MSMQ problem:
Clustering MSMQ applications - rule #1
If you can access ".\private$\clustertest" or "{machinename}\private$\clustertest" from the active node then that means there is a queue called clustertest hosted by the LOCAL MSMQ queue manager. It doesn't work on the passive node because there is no queue called clustertest there yet. If you fail over the resource, it should fail.
You need to create a queue in the clustered resource instead. TESTV0Msmq\private$\clustertest fails because the queue was created on the local machine and not the virtual machine.
Cheers
John Breakwell