Can you move a Service fabric cluster to another subnet? - azure-service-fabric

We have a service fabric cluster that runs in a 10.0.0.0/24 inside a 10.0.0.0/8 VNET. Our customer wants to join this via VPN to their own network. However the there are conflict issues with the ip range we are using and the range that our customer wants us to use (10.90.15.0/24, size is no issue).
We tried creating a new subnet 10.90.15.0/24 however when we edited the subnet reference for the underlying virtual machine scale set to this new subnet the cluster refuses to start and in the event viewer this can be seen:
Throwing coding error - Seed node '35ee85474352dcc2e88fa9ad6af912b1' with address
'10.90.15.4:1025' mismatches configured address '10.0.0.4:1025'
Symbol paths: C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code;
C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code;;
Symbol loading time: 00.161
Stack trace:
00007ff7:25f5478c( windows_error(487): Attempt to access invalid address. )
00007ff7:25f06592( windows_error(487): Attempt to access invalid address. )
00007ff7:25f06413( windows_error(487): Attempt to access invalid address. )
00007ff7:26186be7( windows_error(487): Attempt to access invalid address. )
00007ff7:2616ef25( windows_error(487): Attempt to access invalid address. )
00007ff7:25eaceeb( windows_error(487): Attempt to access invalid address. )
00007ff7:25eb7471( windows_error(487): Attempt to access invalid address. )
00007ff7:25eba4f9( windows_error(487): Attempt to access invalid address. )
00007ff7:25ed5090( windows_error(487): Attempt to access invalid address. )
00007ff7:25ec7c25( windows_error(487): Attempt to access invalid address. )
00007ff7:25ec7c84( windows_error(487): Attempt to access invalid address. )
00007ff7:25f37de8( windows_error(487): Attempt to access invalid address. )
00007ff7:25eae132( windows_error(487): Attempt to access invalid address. )
00007ff7:25ec617a( windows_error(487): Attempt to access invalid address. )
00007ff7:25ea7a5a( windows_error(487): Attempt to access invalid address. )
00007ff7:25eab10a( windows_error(487): Attempt to access invalid address. )
00007ff7:25f07607( windows_error(487): Attempt to access invalid address. )
RtlReleaseSRWLockExclusive + 0x445e
RtlReleaseSRWLockExclusive + 0x2674
BaseThreadInitThunk + 0x14
Now whilst I understand that the configuration of the cluster hasn't changed but the IP has I am somewhat stumped when it comes to a solution. Could it be that moving subnet is impossible without reinstalling the virtual machine scale set extension for service fabric and that it means a full recreation of the cluster and restoring the back-ups. Which by all means is possible but undesirable.
Is there someone that has done this or who perhaps has a completely different idea how this can be accomplished?
Edit: Due to illness I haven't been able to test what has been suggested, will do that as soon as possible.
Edit 2: nicPrefixOverride was already changed as a part of ARM-script, so that doesn't seem to make a difference that the setting was changed.

If you have a look at the SF directory on the nodes, you will discover they have ip references to the other cluster members. Simply switching the vmss subnet won't update these references, so the cluster won't know how to communicate anymore. For these sorts of changes, its always cleaner to just deploy a new cluster and move the workloads over.
While yes, adding a new vmss in the correct subnet may well work, in my own experiance, I've found it to be more consistently succefull to start from scratch. Plus, I've previously had a cluster bricked by an Azure update, so having experiance reprovisioning a cluster in this situation was really helpful rather than having to try it for the first time during an outage.

Related

Internal DNS names not resolving

I was actually doing some quick labs exercise when I noticed this issue where is ping to an internal IP works but if I ping with machine name it does not work. Here is what I did:
a.) Create a GCP project. Leave all the default firewall rules in place
b.) Create a VM in us-central-1 (any region) call it - mynet-us-vm
c.) Create a VM in eu-west-1 (any region) - call it - mynet-eu-vm
d.) SSH to mynet-us-vm from the console
e.) Run this commands : ping -c 3 <Enter mynet-eu-vm's internal IP here>- It works
f.) Run this command: ping -c 3 mynet-eu-vm - Does not work! Getting below error
Getting Error:
"ping: mynet-eu-vm: Name or service not known"
For Internal DNS resolution to work there are multiple factors that affect this:
On the client Instance running ping the resolv.conf file must have the metadata server (169.254.169.254) as it’s nameserver and the search domains must be set similarly to the example on the documentation, if using a Google provided image this configuration should already be set correctly.
Additionally, verify the hostname registered for the Instance “mynet-eu-vm” this can be done by running curl against the metadata server, the output to this will be the full FQDN which will confirm whether the resolv.conf file should be set to Zonal DNS or Global DNS and if the hostname for the Instance is the same as the one being used with ping.
If running “dig FQDN #169.254.169.254” works but ping still fails this would mean that the Instance is trying to resolve against a different nameserver, or that the search list on resolv.conf is incorrect.
If the above steps fail I suggest raising a support case with Google Cloud Platform or opening a new Public Issue Tracker since following the steps provided does not result in the same behavior and likely it’s something specific to your setup.

Failed Name Resolution for Redshift endpoint from Customer Client Machine

We failed to connect Redshift Endpoint from customer's client machine with below message;
FAILED!
[Amazon][Amaxon Redshift](10) Error occurred while trying to connect:[SQLState 08S01] could not translate host name "XXXX.XXXX.ap-northeast-1.redshift.amazoneaws.com" to address : Unknown server error
It seems like name resolution error.
We can connect it, when we set the current(tentative) IP address of Redshift cluster, instead of Redshift Endpoint -- we get that IP address by nslookup command for Redshift Endpoint executed at EC2 instance, where we can normally success to connect to Redshift Endpoint.
As you know, above tetative solution is not good idea, since Redshift cluster IP address should be changed by cluster reboot or something, so We'd like to connect it by Redshift Endpoint. We understand some setting should be missing at our customer's client machine, but so far no idea what should we do.
Any advice would be highly appreciated.
Best regards.

Connectivity issue with AWS DMS with Postgresql on RDS

I have 2 Aws RDS instances,(Run on Postgresql). Both are on Different accounts and different regions. I want to set up data replication between them, using AWS DMS.
I tried doing VPC peering.
I saw the following video to enable VPC peering :-
https://www.youtube.com/watch?v=KmCEFGDTb8U
The Problem:-
When I try creating the AWS DMS service, I add the Hostname, Username and Password, etc for the source(Which exists on the other account), and when I hit Test Connection, I get the following error.
Test Endpoint failed: Application-Status: 1020912, Application-Message: Failed to connect Network error has occurred, Application-Detailed-Message: RetCode: SQL_ERROR SqlState: 08001 NativeError: 101 Message: [unixODBC]timeout expired ODBC general error.
To my surprise, I get a similar error when I hit the Test Connection for the Target RDS instance, which is in the same account. i.e.:-
Test Endpoint failed: Application-Status: 1020912, Application-Message: Cannot connect to ODBC provider Network error has occurred, Application-Detailed-Message: RetCode: SQL_ERROR SqlState: 08001 NativeError: 101 Message: [unixODBC]timeout expired ODBC general error.
Google suggests that we are having some sort of Firewall, but looking at the NACLs I can see we allow 0.0.0.0/0 for both the VPC's.
If you're attempting to access the private IP ranges in one IP from another IP, in addition to creating the VPC Peering connections, you'll have to:
create route table entries in both VPCs to route traffic to the remote VPC's IP range(s) through the Peering Connection,
allow connections within the security groups, both from the source CIDR range in the destination security group, and, if you're filtering outgoing connections from the source, also in it's outbound rules. Note that you can't use Security Group Id to allow this traffic because it doesn't apply to cross region peering;
allow the connection in the underyling software ( probably allowed by default ),
allow the network ACL to pass the traffic ( you've verified that's also allowed by default)
Since you're seeing timeouts, I'd suspect the security group rules. But, it could also be a bad route.
As suggested here https://aws.amazon.com/premiumsupport/knowledge-center/dms-endpoint-connectivity-failures/
When modifying the Replication Instance used to test connection to the Endpoint, take note of:
Private IP Address
VPC Security Group
Either change the Security Group to a suitable one or edit the Security Group being used adding an Inbound Rule to allow PostgreSQL traffic Type from the Private IP Address of the Replication Instance.
The below solution worked for me.
Create replication instance, then endpoints.
If the test endpoints fails - then ensure to pick up the private IP from the instance(if DMS replication instance and the DB are located within the same VPC) and add it to the inbound rules of the corresponding security ID.
If the VPC's are in different region, you might need VPC Peering to get this sorted.
Since I had both running in the same VPC, adding the private IP to inbound rules worked fine and the connection is successful.

Cannot access google cloud SQL from google container engine

I'm still having problems accessing the cloud SQL instance from a GCE container. When I try to open up mysql, I get the following error:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial
communication packet', system error: 0
The connection works fine from my local machine, though (The instance has a public IP and I have added my office's IP to the 'allowed Networks'). So, the instance is accessible through the internet just fine.
I guess the db's access control is blocking my access from the gce network, but I'm unable to figure out how to configure this.
I added my project to "Authorized App Engine Applications" in the Cloud SQL control panel, but that doesn't seem to help.
EDIT:
If I add "0.0.0.0/0" to Allowed Networks, all works well. This is obviously not what I want, so what do I need to enter instead?
EDIT2: I could also add all public IPs from my kubernetes cluster (obtained through gcloud compute instances list) and add them to the cloud sql access list manually. But, this doesn't seem to be right, does it?
The recommended solution is to use SSL connection with that 0.0.0.0/0 CIDR. This is to limit the connection to the correct key. I also read that they won't promise you a specific IP range so the CIDR /14 might not work some times. I had to do the SSL connection with my Cloud SQL for the same reasons.
You should use the public IP addresses of the GCE instances to correctly allow traffic to your Cloud SQL instance (as you mentioned in EDIT2).
You can find more information in Cloud SQL documentation: https://cloud.google.com/sql/docs/gce-access
If you add the /14 CIDR block for your Container Engine cluster as the source address range does that work?
To find the CIDR block for your cluster, click on the cluster name in the Google Cloud Console and find the row labeled "Container address range".

Is it possible to see connection attempts to a Google Cloud SQL instance?

We are currently encountering the following error when trying to connect to a Cloud SQL instance: Lost connection to MySQL server at 'reading initial communication packet', system error: 0.
This is a familiar error, and as detailed here usually means the IP address needs to be whitelisted. However, we believe we have done so.
Is there a way to see connection attempts and their IP addresses that have been made (and refused) to the Cloud SQL instance?
Currently we don't expose that information but it is something we would like fix. :-)
According to #Razvan, as of September 2014, this information isn't exposed.
We ended up using CIDR blocks to search the space and find the actual IP address. This is unsatisfying, obviously, but it's a way to pin down the problem.
If other people want to sanity check that the problem is their IP is being refused, you can add 0.0.0.0/0 in order to accept all ranges and try to connect. If it works, you know what is the problem.
Be absolutely sure to remove this as an accepted range, after you are done, however!
Figured I might help someone who stumbles here.
Had exactly the same issue essentially trying to connect to a GCP SQL instance from a hosting provider.
Whitelist the IP address that is shown in my cpanel and it will not connect. (It used to, but the provider made some changes with their infrastructure lately and it stopped working)
put 0.0.0.0/0 in my Cloud Platform whitelist and it connects no problem.
So now I know that my cpanel IP is not the IP trying to connect to GCP.
After some hair pulling (figured that the bare metal server had a different IP than my cpanel IP, it did, but this also didn't work.)
finally tried the IP address for the name servers that point to my domain and bam. All is good.
If you are facing this issue, try your name server (usually something like NS1.hostingprovider.com etc..). I put both the NS1 and NS2 ip's in the whitelist and we are working fine.