SQL Server 2019 DAG WFC - Manual Failover won't work (MSSQL Error 41131) - sql-server-2019

We set up a distributed failover cluster with 2 Windows Server 2019 Datacenter nodes, each of them running SQL Server 2019 Enterprise + SSMS18.
The two nodes are located in two different sites with two different IP-Subnets.
Each Host is a ESXI VM with only one NIC (Host A in Subnet A, Host B in Subnet B).
Both sites are connected via a S2S-VPN Connection and routing possibilities for traffic between.
Problem
We double checked every possible problem, but we cannot get managed, to manual failover an AvailabilityGroup with a synchronized DB via SSMS
Instance -> Always On High Availability -> Availability Groups -> -> Right-Click "Failover"
SQL Server error 41131 (see attachment)
Troubleshooting
Connection between hosts is up and the "dashboard" shows, that both hosts are communicating, up and synchronized.
Defender Firewall rules are there for the DAG-listeners, the Agent, the Browser service. On a PaloAlto Firewall at site A, traffic can be detected between both SQL hosts, but no traffic is denied.
Both hosts run via a separate service user for SQL Server Agent and SQL Server engine, so there should not be any trouble with missing rights for the NT Authority\SYSTEM.
Rights to the AD-Clusterobject are there, to create and update any child objects. Two DNS entries for the listener and one for the cluster object are also there after the creation.
Even the automatic seeding between both hosts is working, only the failover through SMSS18 is failing (inserted rows replicate from host A to host B).
Questions
Are there any ideas, at which point we can troubleshoot?
enter image description hereI attached the Error-Message, but was not able to find any useful information online, since the only connected solution is always to change rights for the NT Account, which we do not use for Agent or Engine.

Not sure if you were able to resolve this but here is the answer for future reference.
From here https://learn.microsoft.com/en-us/troubleshoot/sql/availability-groups/error-41131-create-availability-group
You can refence the following:
The [NT AUTHORITY\SYSTEM] account is used by SQL Server Always On health detection to connect to the SQL Server computer and to monitor
health. When you create an availability group and the primary replica
in the availability group comes online, health detection is initiated.
If the [NT AUTHORITY\SYSTEM] account doesn't exist or have sufficient
permissions, health detection can't be initiated, and the availability
group can't come online during the creation process. Make sure that
these permissions exist on each SQL Server computer that could host
the primary replica of the availability group.
Even if the SQL Server instance and the SQL Server agent are running under a different service account, there is a process in the cluster that uses [NT AUTHORITY\SYSTEM] to connect to the SQL Server instance hosting the primary replica and it will run the procedure named "sp_server_diagnostics" which is used for health detection and it is an essential part of the Availability Group.
If you check the cluster log around the time of the failover attempt on the node that was supposed to take the primary role, you will see something like this:
INFO [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ...
INFO [RES] SQL Server Availability Group: [hadrag] The connection was established successfully
INFO [RES] SQL Server Availability Group: [hadrag] Run 'EXEC sp_server_diagnostics 10' returns following information
ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297)
ERR [RES] SQL Server Availability Group: [hadrag] Failed to run diagnostics command. See previous log for error message
INFO [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server
Basically, the failover failed because the [NT AUTHORITY\SYSTEM] account on the new primary does not exist or does not have the necessary permissions to start the health monitor process.
I hope this helps!

Related

Monitoring SQL service on a remote computer and start or stop a local service based on the result

I have a local service which interacts with a SQL database.
This service stays up when SQL database goes down.
What I need is a PowerShell script that checks the remote SQL service and based on the result it must start or stop the local service.
Any help would be highly appreciated
You can check if SQL server port is open on a remote host.
See this answer for details about how to do this How to check Network port access and display useful message?
The port number is depending on SQL server you're using. For example, for MS SQL default port is 1433 and for MySQL - 3306.

AWS RDS Postgresql Pgadmin - Server doesn't listen

I followed the aws tutorial found here.
Everything went smoothly up until connecting to the postgresql instance via pgadmin.
I entered the appropriate user/pw info and copy/pasted the address of the db appropriately.
The port is indeed 5432 on my aws dashboard.
I am receiving the following error message:
Server doesn't listen
The server doesn't accept connections: the connection library reports
could not connect to server: Operation timed out Is the server running on host "my_database_name.some_stuff.us-west-2.rds.amazonaws.com" (52.10.228.18) and accepting TCP/IP connections on port 5432?
If you encounter this message, please check if the server you're trying to contact is actually running PostgreSQL on the given port. Test if you have network connectivity from your client to the server host using ping or equivalent tools. Is your network / VPN / SSH tunnel / firewall configured correctly?
For security reasons, PostgreSQL does not listen on all available IP addresses on the server machine initially. In order to access the server over the network, you need to enable listening on the address first.
For PostgreSQL servers starting with version 8.0, this is controlled using the "listen_addresses" parameter in the postgresql.conf file. Here, you can enter a list of IP addresses the server should listen on, or simply use '*' to listen on all available IP addresses. For earlier servers (Version 7.3 or 7.4), you'll need to set the "tcpip_socket" parameter to 'true'.
You can use the postgresql.conf editor that is built into pgAdmin III to edit the postgresql.conf configuration file. After changing this file, you need to restart the server process to make the setting effective.
If you double-checked your configuration but still get this error message, it's still unlikely that you encounter a fatal PostgreSQL misbehaviour. You probably have some low level network connectivity problems (e.g. firewall configuration). Please check this thoroughly before reporting a bug to the PostgreSQL community.
Step 1
You are getting the same dialog I was seeing above. Crap!
Step 2
Go to your RDS instances
Step 3
Go to your security groups
Step 4
If your account was like mine you see this text:
Your account does not support the EC2-Classic Platform in this region.
DB Security Groups are only needed when the EC2-Classic Platform is supported.
Instead, use VPC Security Groups to control access to your DB Instances.
Go to the EC2 Console to view and manage your VPC Security Groups.
For more information, see AWS Documentation on Supported Platforms and Using RDS in VPC.
Step 5 Go back and check your RDS security group name (RDS->instances right click your instance). You will see something like Security GroupsList of VPC Security Groups associated with this DB Instance.
You will see something like:
default (sg-********) ( active )
Step 6 In your VPC security groups find your sg-******** that matches your database. Right click that. Edit inbound/outbound rules to add postgresql.
Try to connect again.
This solved my problem.
If this does not solve your problem I am very sorry, but I hope this documentation brings me some debugging karma.
go to AWS services in security group click on the security group id . from the "actions" button click on "edit inbound roles" and then change the "source" to "my ip"

What's the correct MSDTC configuation for a clustered SQL server for BizTalk WCF SQL adapter

I have a issue on connecting to a clustered sql server instance using wcf-sql adapter.
The sql cluster infrastructure is :
We have 2 servers, SVR1 and SVR2, each have a named SQL instance INST1 installed and these 2 servers are clustered. In SRV1, a clustered MSDTC installed and assigned a NETBIOS name as DTCCLUSTER1. SRV1/SRV2 and DTCCLUSTER1 have its own IP address.
When I try to connect to this SQL Server using WCF-SQL Adapter, I got a timeout error and finally find out this is caused by a MSDTC connection issue. DTCPing test failed in both SRV1 to BizTalk server and BizTalk to SRV1.
The SRV1 hosting DTCCLUSTER1 have been configured to allow both inbound and outbound connections. For security reason, we cannot allow "No Auth" in MSDTC but choosed "Mutual Auth required" in both SRV1 and BizTalk server side.
On server side, the firewall was configured to allow DCE RPC inbound and outbound. We even disabled the firewall in BizTalk server side. Also no port blocking by network.
We are still doing the troubleshooting now, but my question here is kind of more general: what's the proper configuration of the MSDTC for a clustered SQL Server?
For now, there MIGHT be a workaround by setting the UseAmbientTransaction property to false.
Off course, the MSDTC issue is your main concern :)
Are you sure you checked the Network DTC access checkbox as described here:
http://msdn.microsoft.com/en-us/library/dd897483(v=bts.10).aspx
For more information on troubleshooting these specific issues, please see here: http://msdn.microsoft.com/en-us/library/aa561924(v=bts.10).aspx
This link provides you with some good advice on how to set these properties.
More specifically, if you enable the mutual auth required option, take a look at this paragraph:
If either the Mutual Authentication Required or the Incoming Caller
Authentication Required configuration options are enabled then the
client(s) computer account must be granted the Access this computer
from the network user right. If the computer account for a client
computer is not granted the Access this computer from the network user
right, or is included in the Deny access to this computer from the
network user right, then DTC communication between the client and
server computer will fail.
Typically I always set no auth. It might be worth it to try the setting and see if this makes it work. Also be aware that MSDTC settings need to be the same across your BizTalk and SQL servers, including your MSDTC cluster (IIRC: if you have a windows 2008 R2 msdtc cluster).

Sql Server Times Out Twice - Connects on 3rd Attempt

I have a WinForms application installed on multiple PCs in an office, and a SQL Server 2012 Express database on the server to which the client application connects.
Each machine fails to connect on the first two attempts giving an error -
Timeout Expired: The timeout period elapsed prior to the completion of the operation or the server is not responding.
However, it always works on the 3rd attempt on all machines!
The server is SBS 2008, the machines are running Windows 7.
The issue was I had used a Named Instanced on SQL Server which by default uses dynamic ports. Hence each connection attempt used a different Port, and each time I was asking the server administrator to allow additional ports. The successful log ins where simply because the dynamic port chosen just so happened to be one previously allowed.
The answer was to use SQL Server Configuration Manager to remove the dynamic port setting and specify a single port to use for all connection attempts, and ensure firewalls etc had an exception for that particular port.

Database Mirroring - App Can't Connect to Mirror - Named Pipes Provider: Could not open a connection to SQL Server [53]

I have an application that can connect to the Principal, but can't connect to the Mirror during a failover.
(Note to moderator: please let me know if this question is more appropriate for serverfault. I posted it here because I found more questions similar to this issue than on serverfault.)
This is the error I receive when my application attempts to connect to the Mirror after a failover:
Named Pipes Provider: Could not open a connection to SQL Server [53].
Cannot open database "MY_DB_NAME" requested by the login. The login failed.
I am familiar with the fact that when initially connected to the Principal, the name of the Mirror server is cached to be used during the failover and that the failover partner I specify in my connection string is only used if the initial connection to the Principal fails.
This clearly describes the problem I'm having:
http://blogs.msdn.com/b/spike/archive/2010/12/15/running-a-database-mirror-setup-with-the-sqlbrowser-service-off-may-produce-unexpected-results.aspx
...but the SQL Browser Service is running and I can't figure out why the name won't resolve when connecting to the mirror.
I'm assuming there is a service that must be running to enable NetBIOS name resolution that is not running, because this is what I see in WireShark consistently without a response from the Mirror:
Source Destination Protocol Length Info
10.200.3.111 10.200.5.255 NBNS 92 Name query NB SQL-02-SVR-<00>
Question 1: What could be causing the problem? ;-)
Question 2: I really don't want to enable NetBIOS (for security reasons) and I'm using IP addresses (no FQDNs) in the mirror configuration and in the connection string. Given the caching behavior of the mirror partner when connecting to the Principal, is there a way to force TCP/IP to be used so the value that is cached is the IP address and not the name? Do I need to run the SQL Server Browser/Computer Browser services?
The configuration:
App Is Delphi XE2 using SDAC 6.5.9 (I don't think this is relevant to the component I'm using because it works in other installations with mirroring and has no issues)
SQL Server 2012 Enterprise installed as a default instance on Principal, Mirror and Witness in a non-domain configuration using certificate authentication.
Windows Server 2008 R2 SP1 64-bit on all machines
Firewalls disabled on Principal, Mirror and Client (where app is running)
TCP/IP and Named Pipes enabled on Principal and Mirror
SQL Server Browser service running on Mirror
Computer Browser service running on Mirror
Mirroring is configured for automatic failover with a witness and works properly (I can fail back and forth between mirror and principal without issue)
SQL Native Client 2012 installed on Client machine
Same app login (with same SID and user rights) exists on both Principal and Mirror
Correct server, failover partner, database name, user name and password verified in my app log
In connection string, principal server is 'tcp:10.200.3.15,1433' and failover partner is 'tcp:10.200.3.16,1433' using the SQL Native client
I can ping both servers from the Client machine
NetBIOS over TCP/IP has been enabled in the adapter under the WINS tab (on the Mirror and Client machines)
I've been able to get the application working with mirroring on several other installations, but this one is baffling me.
I found the problem, which was that the customer had the Principal and Mirror in one VLAN and the Client(s) in another. Although the IP addressing scheme was the same, the policy for communication between the VLANs prevented broadcast messages, which is why the NetBIOS query was failing on the client. A WINS or DNS server will be implemented to resolve this issue.
However, I am still interested in an answer to my Question #2, above.