What's the correct MSDTC configuation for a clustered SQL server for BizTalk WCF SQL adapter - sql-server-2008-r2

I have a issue on connecting to a clustered sql server instance using wcf-sql adapter.
The sql cluster infrastructure is :
We have 2 servers, SVR1 and SVR2, each have a named SQL instance INST1 installed and these 2 servers are clustered. In SRV1, a clustered MSDTC installed and assigned a NETBIOS name as DTCCLUSTER1. SRV1/SRV2 and DTCCLUSTER1 have its own IP address.
When I try to connect to this SQL Server using WCF-SQL Adapter, I got a timeout error and finally find out this is caused by a MSDTC connection issue. DTCPing test failed in both SRV1 to BizTalk server and BizTalk to SRV1.
The SRV1 hosting DTCCLUSTER1 have been configured to allow both inbound and outbound connections. For security reason, we cannot allow "No Auth" in MSDTC but choosed "Mutual Auth required" in both SRV1 and BizTalk server side.
On server side, the firewall was configured to allow DCE RPC inbound and outbound. We even disabled the firewall in BizTalk server side. Also no port blocking by network.
We are still doing the troubleshooting now, but my question here is kind of more general: what's the proper configuration of the MSDTC for a clustered SQL Server?

For now, there MIGHT be a workaround by setting the UseAmbientTransaction property to false.
Off course, the MSDTC issue is your main concern :)
Are you sure you checked the Network DTC access checkbox as described here:
http://msdn.microsoft.com/en-us/library/dd897483(v=bts.10).aspx
For more information on troubleshooting these specific issues, please see here: http://msdn.microsoft.com/en-us/library/aa561924(v=bts.10).aspx
This link provides you with some good advice on how to set these properties.
More specifically, if you enable the mutual auth required option, take a look at this paragraph:
If either the Mutual Authentication Required or the Incoming Caller
Authentication Required configuration options are enabled then the
client(s) computer account must be granted the Access this computer
from the network user right. If the computer account for a client
computer is not granted the Access this computer from the network user
right, or is included in the Deny access to this computer from the
network user right, then DTC communication between the client and
server computer will fail.
Typically I always set no auth. It might be worth it to try the setting and see if this makes it work. Also be aware that MSDTC settings need to be the same across your BizTalk and SQL servers, including your MSDTC cluster (IIRC: if you have a windows 2008 R2 msdtc cluster).

Related

SQL Server 2019 DAG WFC - Manual Failover won't work (MSSQL Error 41131)

We set up a distributed failover cluster with 2 Windows Server 2019 Datacenter nodes, each of them running SQL Server 2019 Enterprise + SSMS18.
The two nodes are located in two different sites with two different IP-Subnets.
Each Host is a ESXI VM with only one NIC (Host A in Subnet A, Host B in Subnet B).
Both sites are connected via a S2S-VPN Connection and routing possibilities for traffic between.
Problem
We double checked every possible problem, but we cannot get managed, to manual failover an AvailabilityGroup with a synchronized DB via SSMS
Instance -> Always On High Availability -> Availability Groups -> -> Right-Click "Failover"
SQL Server error 41131 (see attachment)
Troubleshooting
Connection between hosts is up and the "dashboard" shows, that both hosts are communicating, up and synchronized.
Defender Firewall rules are there for the DAG-listeners, the Agent, the Browser service. On a PaloAlto Firewall at site A, traffic can be detected between both SQL hosts, but no traffic is denied.
Both hosts run via a separate service user for SQL Server Agent and SQL Server engine, so there should not be any trouble with missing rights for the NT Authority\SYSTEM.
Rights to the AD-Clusterobject are there, to create and update any child objects. Two DNS entries for the listener and one for the cluster object are also there after the creation.
Even the automatic seeding between both hosts is working, only the failover through SMSS18 is failing (inserted rows replicate from host A to host B).
Questions
Are there any ideas, at which point we can troubleshoot?
enter image description hereI attached the Error-Message, but was not able to find any useful information online, since the only connected solution is always to change rights for the NT Account, which we do not use for Agent or Engine.
Not sure if you were able to resolve this but here is the answer for future reference.
From here https://learn.microsoft.com/en-us/troubleshoot/sql/availability-groups/error-41131-create-availability-group
You can refence the following:
The [NT AUTHORITY\SYSTEM] account is used by SQL Server Always On health detection to connect to the SQL Server computer and to monitor
health. When you create an availability group and the primary replica
in the availability group comes online, health detection is initiated.
If the [NT AUTHORITY\SYSTEM] account doesn't exist or have sufficient
permissions, health detection can't be initiated, and the availability
group can't come online during the creation process. Make sure that
these permissions exist on each SQL Server computer that could host
the primary replica of the availability group.
Even if the SQL Server instance and the SQL Server agent are running under a different service account, there is a process in the cluster that uses [NT AUTHORITY\SYSTEM] to connect to the SQL Server instance hosting the primary replica and it will run the procedure named "sp_server_diagnostics" which is used for health detection and it is an essential part of the Availability Group.
If you check the cluster log around the time of the failover attempt on the node that was supposed to take the primary role, you will see something like this:
INFO [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ...
INFO [RES] SQL Server Availability Group: [hadrag] The connection was established successfully
INFO [RES] SQL Server Availability Group: [hadrag] Run 'EXEC sp_server_diagnostics 10' returns following information
ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297)
ERR [RES] SQL Server Availability Group: [hadrag] Failed to run diagnostics command. See previous log for error message
INFO [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server
Basically, the failover failed because the [NT AUTHORITY\SYSTEM] account on the new primary does not exist or does not have the necessary permissions to start the health monitor process.
I hope this helps!

Using Kerberos for RDP

We are in the process of turning off NTLM in our environment for both inbound and outbound traffic via GPO. In our lab testing we have encountered the following when blocking inbound NTLM on a remote host:
RDP'ing to the remote host with inbound NTLM blocked via cross-forest generated a CredSSP error message.
Setting Encryption Oracle Remediation to either Mitigated or Vulnerable as a workaround did not work.
Turning off NLA on the remote host as a workaround will allow cross-forest RDP
I have tried applying "Allow delegating fresh credentials" via policy on the remote host but it is still getting the CredSSP error
I have also tried setting the policy on the remote host to use SSL for "Require use of specific security layer for remote (RDP) connections", and I still got the same CredSSP error.
What did work is if I try to RDP from the same forest to the remote host, it will allow the connection and I can confirm it is using Kerberos for RDP instead of NTLM.
Another observation is once the same forest RDP worked on the remote host, cross-forest RDP connection on the remote host with the blocked inbound NTLM will now work.
Has anyone encountered something similar like this before?
If so, has anyone found a solution for cross-forest RDP to work on a remote host with blocked inbound NTLM without the need to pre-auth on the remote host in the same forest?
The Encryption Oracle Remediation error is a red herring because it uses the same error code as the NTLM is not available error. Unless you haven't patched in 3 years it'll likely never be the Encryption Oracle Remediation issue. It's really just that it tried to fallback to NTLM and policy said no.
In all likelihood the issue is that the client can't find or communicate with a domain controller to do NLA.
The client must find the user's domain first (domain A). From there it authenticates their password. It then asks to get a ticket to the machine. The machine isn't in the user's domain so it creates a referral ticket to where it thinks the machine is (domain B).
The referral is handed back to the client and the client tries to find a DC to where the referral is supposed to go (domain B). The client sends the referral to domain B and asks for a ticket to the machine. The domain controller either finds the machine and issues a ticket for it, or says it doesn't know and offers a referral to another domain (domain C) and you try again, or it just fails saying no machine can be found.
All of this occurs from the client's perspective, not the target machine's perspective. This happens before the client even pings the target machine (ish). This is why disabling NLA appears to resolve the issue.
So there are a handful of reasons why this happens:
You used an IP address -- this is a straight-to-NTLM scenario. Kerberos doens't do IP addresses by default. You can turn it on, but it won't scale.
Client can't communicate with a DC in user's domain (domain A). Networking issue, client needs line of sight to domain controller, plus DNS.
Client can't communicate a with DC in the target machine's domain (domain B). Still a networking issue, client needs line of sight to domain controller, plus DNS.
You're not providing a proper fully qualified name and the user's DC can't figure out what forest it should refer to. You can enable Forest Search Order and it'll maybe help, or you can type in the fully qualified machine name.
This isn't an exhaustive list but these are the most common causes.
References:
https://syfuhs.net/windows-and-domain-trusts
https://syfuhs.net/how-authentication-works-when-you-use-remote-desktop
I also had a similar issue when using the DOMAIN\username login ; using the UPN (username#domaine.com) worked for me.
My understanding is using the UPN allows the client to know the DNS domain name, which then allows it to discover the DC of the remote domain through DNS resolution.
NB : my setup was from a workgroup server so not exactly the same as yours; YMMV.

Does the fact I'm running a VM alter the whitelisting status of my regular ip address?

Our dev ops team have whitelisted my home ip address so that I can connect to our Postgres database on Azure. I am able to connect to our Azure database due to this.
Today I set up a VM in order to run Docker. I am running a container for RStudio which is an app that, among many other things, allows me to connect to our database using ODBC.
After configuring the odbcinst and odbc.ini files I believe that those are configured correctly because when I try to connect I get the following error:
Error: nanodbc/nanodbc.cpp:983: 00000: FATAL: SSL connection is required. Please specify SSL options and retry.
Thus I think that my odbc set up is correct because this error suggests my connection setting are fine, it's just that Azure will not allow it without SSL.
Searching that error message took me to this SO post with the following accepted answer:
By default, Azure Database for PostgreSQL enforces SSL connections between your server and your client applications to protect against MITM (man in the middle) attacks. This is done to make the connection to your server as secure as possible.
Although not recommended, you have the option to disable requiring SSL for connecting to your server if your client application does not support SSL connectivity. Please check How to Configure SSL Connectivity for your Postgres server in Azure for more details. You can disable requiring SSL connections from either the portal or using CLI. Note that Azure does not recommend disabling requiring SSL connections when connecting to your server.
My question is, if I am already able to connect to our database outside of my VM due to my home IP being whitelisted and just using a Postgres Driver with Dbeaver SQL client, is there anything I can do to connect from within my VM?
I can get my VMs ip address but I suspect (am not sure) if sending hat to our developers to whitelist would work?
Is there a prescribed course of action here?
I added this parameter to my .odbc.ini file and was able to connect:
sslmode=require
From Azure Postgres documentation, this parameter may take on different permutations depending on the context
"for example "ssl=true" or "sslmode=require" or "sslmode=required" and other variations"

AWS RDS Postgresql Pgadmin - Server doesn't listen

I followed the aws tutorial found here.
Everything went smoothly up until connecting to the postgresql instance via pgadmin.
I entered the appropriate user/pw info and copy/pasted the address of the db appropriately.
The port is indeed 5432 on my aws dashboard.
I am receiving the following error message:
Server doesn't listen
The server doesn't accept connections: the connection library reports
could not connect to server: Operation timed out Is the server running on host "my_database_name.some_stuff.us-west-2.rds.amazonaws.com" (52.10.228.18) and accepting TCP/IP connections on port 5432?
If you encounter this message, please check if the server you're trying to contact is actually running PostgreSQL on the given port. Test if you have network connectivity from your client to the server host using ping or equivalent tools. Is your network / VPN / SSH tunnel / firewall configured correctly?
For security reasons, PostgreSQL does not listen on all available IP addresses on the server machine initially. In order to access the server over the network, you need to enable listening on the address first.
For PostgreSQL servers starting with version 8.0, this is controlled using the "listen_addresses" parameter in the postgresql.conf file. Here, you can enter a list of IP addresses the server should listen on, or simply use '*' to listen on all available IP addresses. For earlier servers (Version 7.3 or 7.4), you'll need to set the "tcpip_socket" parameter to 'true'.
You can use the postgresql.conf editor that is built into pgAdmin III to edit the postgresql.conf configuration file. After changing this file, you need to restart the server process to make the setting effective.
If you double-checked your configuration but still get this error message, it's still unlikely that you encounter a fatal PostgreSQL misbehaviour. You probably have some low level network connectivity problems (e.g. firewall configuration). Please check this thoroughly before reporting a bug to the PostgreSQL community.
Step 1
You are getting the same dialog I was seeing above. Crap!
Step 2
Go to your RDS instances
Step 3
Go to your security groups
Step 4
If your account was like mine you see this text:
Your account does not support the EC2-Classic Platform in this region.
DB Security Groups are only needed when the EC2-Classic Platform is supported.
Instead, use VPC Security Groups to control access to your DB Instances.
Go to the EC2 Console to view and manage your VPC Security Groups.
For more information, see AWS Documentation on Supported Platforms and Using RDS in VPC.
Step 5 Go back and check your RDS security group name (RDS->instances right click your instance). You will see something like Security GroupsList of VPC Security Groups associated with this DB Instance.
You will see something like:
default (sg-********) ( active )
Step 6 In your VPC security groups find your sg-******** that matches your database. Right click that. Edit inbound/outbound rules to add postgresql.
Try to connect again.
This solved my problem.
If this does not solve your problem I am very sorry, but I hope this documentation brings me some debugging karma.
go to AWS services in security group click on the security group id . from the "actions" button click on "edit inbound roles" and then change the "source" to "my ip"

Database Mirroring - App Can't Connect to Mirror - Named Pipes Provider: Could not open a connection to SQL Server [53]

I have an application that can connect to the Principal, but can't connect to the Mirror during a failover.
(Note to moderator: please let me know if this question is more appropriate for serverfault. I posted it here because I found more questions similar to this issue than on serverfault.)
This is the error I receive when my application attempts to connect to the Mirror after a failover:
Named Pipes Provider: Could not open a connection to SQL Server [53].
Cannot open database "MY_DB_NAME" requested by the login. The login failed.
I am familiar with the fact that when initially connected to the Principal, the name of the Mirror server is cached to be used during the failover and that the failover partner I specify in my connection string is only used if the initial connection to the Principal fails.
This clearly describes the problem I'm having:
http://blogs.msdn.com/b/spike/archive/2010/12/15/running-a-database-mirror-setup-with-the-sqlbrowser-service-off-may-produce-unexpected-results.aspx
...but the SQL Browser Service is running and I can't figure out why the name won't resolve when connecting to the mirror.
I'm assuming there is a service that must be running to enable NetBIOS name resolution that is not running, because this is what I see in WireShark consistently without a response from the Mirror:
Source Destination Protocol Length Info
10.200.3.111 10.200.5.255 NBNS 92 Name query NB SQL-02-SVR-<00>
Question 1: What could be causing the problem? ;-)
Question 2: I really don't want to enable NetBIOS (for security reasons) and I'm using IP addresses (no FQDNs) in the mirror configuration and in the connection string. Given the caching behavior of the mirror partner when connecting to the Principal, is there a way to force TCP/IP to be used so the value that is cached is the IP address and not the name? Do I need to run the SQL Server Browser/Computer Browser services?
The configuration:
App Is Delphi XE2 using SDAC 6.5.9 (I don't think this is relevant to the component I'm using because it works in other installations with mirroring and has no issues)
SQL Server 2012 Enterprise installed as a default instance on Principal, Mirror and Witness in a non-domain configuration using certificate authentication.
Windows Server 2008 R2 SP1 64-bit on all machines
Firewalls disabled on Principal, Mirror and Client (where app is running)
TCP/IP and Named Pipes enabled on Principal and Mirror
SQL Server Browser service running on Mirror
Computer Browser service running on Mirror
Mirroring is configured for automatic failover with a witness and works properly (I can fail back and forth between mirror and principal without issue)
SQL Native Client 2012 installed on Client machine
Same app login (with same SID and user rights) exists on both Principal and Mirror
Correct server, failover partner, database name, user name and password verified in my app log
In connection string, principal server is 'tcp:10.200.3.15,1433' and failover partner is 'tcp:10.200.3.16,1433' using the SQL Native client
I can ping both servers from the Client machine
NetBIOS over TCP/IP has been enabled in the adapter under the WINS tab (on the Mirror and Client machines)
I've been able to get the application working with mirroring on several other installations, but this one is baffling me.
I found the problem, which was that the customer had the Principal and Mirror in one VLAN and the Client(s) in another. Although the IP addressing scheme was the same, the policy for communication between the VLANs prevented broadcast messages, which is why the NetBIOS query was failing on the client. A WINS or DNS server will be implemented to resolve this issue.
However, I am still interested in an answer to my Question #2, above.