Cannot connect to kafka connect cluster running on AWS from outside EC2 - apache-kafka

I have an ECS cluster with 3 EC2 instances all sitting in private subnets. I created a task definition to run the kafka-connect image provided by Confluent with the following environment variables:
CONNECT_CONFIG_STORAGE_TOPIC=quickstart-config
CONNECT_GROUP_ID=quickstart
CONNECT_INTERNAL_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_OFFSET_STORAGE_TOPIC=quickstart-offsets
CONNECT_PLUGIN_PATH=/usr/share/java
CONNECT_REST_ADVERTISED_HOST_NAME=localhost
CONNECT_REST_ADVERTISED_PORT=8083
CONNECT_SECURITY_PROTOCOL=SSL
CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM=
CONNECT_STATUS_STORAGE_TOPIC=quickstart-status
CONNECT_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
I have an application load balancer in front of this cluster with a listener on port 8083. I have correctly set up target group to include the EC2 instances running kafka-connect. So the load balancer should forward requests to the cluster. And it does, but I always get back a 502 Bad Gateway response. I can ssh into the EC2 instances and curl localhost:8083 and get the response back from kafka-connect, but from outside the EC2, I don't get a response.
To rule out networking issues between the load balancer and the cluster, I created a separate task defintion running Nginx on port 80 and I'm able to successfully hit it from outside the EC2 instances through the load balancer.
I have a feeling that I have not set CONNECT_REST_ADVERTISED_HOST_NAME to the correct value. It's my understanding that this is the host clients should connect to. However, because my EC2 instances are in a private subnet, I have no idea what to set this to, which is why I've set it to localhost. I tried setting it to the load balancer's DNS name, but that doesn't work.

You need to set CONNECT_REST_ADVERTISED_HOST_NAME to the host or IP that the other Kafka Connect workers can resolve and connect to.
It's used for the internal communication between workers, and if it's localhost then if your REST request (via your load balancer) hits a worker that is not the current leader of the cluster, that worker will try to forward the request to the leader—using the CONNECT_REST_ADVERTISED_HOST_NAME. But if CONNECT_REST_ADVERTISED_HOST_NAME is localhost then the worker will simply be forwarding the request to itself and hence things won't work.
For more details see https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/

Related

Azure Kubernetes Service: Route outbound traffic through LoadBalancer

Right now im setting up a Kubernetes cluster with Azure Kubernetes Service (AKS).
Im using the feature "Bring your own Subnet" and Kubenet as a network mode.
As you can see in the diagram, on the left side is an example vm.
In the middle is a load balancer I set up in the cluster, who directs incoming traffic to all pods with the label "webserver", this works fine.
On the right side is an example node of the cluster.
My problem is the outgoing traffic of nodes. As you would expect, if you try to ssh into a vm in subnet 1 from a node in subnet 2, it uses the nodes-ip for connecting, the .198. (Red Line)
I would like to route the traffic over the load balancer, so the incoming ssh connection at the vm in subnet 1 has a source address of .196. (Green Line)
Reason: We have got a central firewall. To open ports, I have to specify the ip-address, from which the package is coming from. For this case, I would like to route the traffic over on central load balancer so only one ip has to be allowed through in the firewall. Otherwise, every package would have the source ip of the node.
Is this possible?
I have tried to look this use case up in the azure docs, but most of the times it talks about the usage of public ips, which i am not using in this case.

openVPN accesses the K8S cluster, it access the POD of the host where the server is located,cannot access the POD of other hosts in the cluster

I deployed the OpenVPN server in the K8S cluster and deployed the OpenVPN client on a host outside the cluster. However, when I use client access, I can only access the POD on the host where the OpenVPN server is located, but cannot access the POD on other hosts in the cluster.
The network used by the cluster is Calico. I also added the following iptables rules to the openVPN server host in the cluster:
I found that I did not receive the package back when I captured the package of tun0 on the server.
When the server is deployed on hostnetwork, a forward rule is missing in the iptables field.
Not sure how you set up iptables inside the server pod as iptables/netfilter was not accessible on most kube clusters I saw.
If you want to have full access to cluster networking over that OpenVPN server you probably want to use hostNetwork: true on your vpn server. The problem is that you still need proper MASQ/SNAT rule to get response across to your client.
You should investigate your traffic going out of the server pod to see if it has a properly rewritten source address, otherwise the nodes in cluster will have no knowledge on how to route the response.
You probably have a common gateway for your nodes, depending on your kube implementation you might get around this issue by setting the route back to your vpn, but that likely will require some scripting around vpn server it self to make sure the route is updated each time server pod is rescheduled.

ECS+NLB does not support dynamic port hence only 1 task per EC2 instance?

Please confirm if these are true, or please point to the official AWS documentations that describes how to use dynamic port mapping with NLB and run multiple same tasks in an ECS ES2 instance. I am not using Fargate.
ECS+NLB does NOT support dynamic port mapping, hence
ECS+NLB can only allow 1 task (docker container) per EC2 instance in an ECS service
This is because:
AWS ECS Developer Guide - Creating a Load Balancer only mentions ALB that can use dynamic port, and not mention on NLB.
Application Load Balancers offer several features that make them attractive for use with Amazon ECS services:
* Application Load Balancers allow containers to use dynamic host port mapping (so that multiple tasks from the same service are allowed per container instance).
ECS task creation page clearly states that dynamic port is for ALB.
Network Load Balancer for inter-service communication quotes a response from the AWS support:
"However, I would like to point out that there is currently an ongoing issue with the NLB functionality with ECS, mostly seen with dynamic port mapping where the container is not able to stabilize due to health check errors, I believe the error you're seeing is related to that issue. I can only recommend that you use the ALB for now, as the NLB is still quite new so it's not fully compatible with ECS yet."
Updates
Found a document stating NLB supports dynamic port. However, if I switch ALB to NLB, ECS service does not work. When I log into an EC2 instance, an ECS agent is running but no docker container is running.
If someone managed to make ECS(EC2 type)+NLB work, please provide the step by step how it has been done.
Amazon ECS Developer Guide - Service Load Balancing - Load Balancer Types - NLB
Network Load Balancers support dynamic host port mapping. For example, if your task's container definition specifies port 80 for an NGINX container port, and port 0 for the host port, then the host port is dynamically chosen from the ephemeral port range of the container instance (such as 32768 to 61000 on the latest Amazon ECS-optimized AMI). When the task is launched, the NGINX container is registered with the Network Load Balancer as an instance ID and port combination, and traffic is distributed to the instance ID and port corresponding to that container. This dynamic mapping allows you to have multiple tasks from a single service on the same container instance.

Connect to On Premises Service Fabric Cluster

I've followed the steps from Microsoft to create a Multi-Node On-Premises Service Fabric cluster. I've deployed a stateless app to the cluster and it seems to be working fine. When I have been connecting to the cluster I have used the IP Address of one of the nodes. Doing that, I can connect via Powershell using Connect-ServiceFabricCluster nodename:19000 and I can connect to the Service Fabric Explorer website (http://nodename:19080/explorer/index.html).
The examples online suggest that if I hosted in Azure I can connect to http://mycluster.eastus.cloudapp.azure.com:19000 and it resolves, however I can't work out what the equivalent is on my local. I tried connecting to my sample cluster: Connect-ServiceFabricCluster sampleCluster.domain.local:19000 but that returns:
WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service...
WARNING: Failed to contact Failover Manager Service, Attempting to contact FMM...
False
WARNING: No such host is known
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Am I missing something in my setup? Should there be a central DNS entry somewhere that allows me to connect to the cluster? Or am I trying to do something that isn't supported On-Premises?
Yup, you're missing a load balancer.
This is the best resource I could find to help, I'll paste relevant contents in the event of it becoming unavailable.
Reverse Proxy — When you provision a Service Fabric cluster, you have an option of installing Reverse Proxy on each of the nodes on the cluster. It performs the service resolution on the client’s behalf and forwards the request to the correct node which contains the application. In majority of the cases, services running on the Service Fabric run only on the subset of the nodes. Since the load balancer will not know which nodes contain the requested service, the client libraries will have to wrap the requests in a retry-loop to resolve service endpoints. Using Reverse Proxy will address the issue since it runs on each node and will know exactly on what nodes is the service running on. Clients outside the cluster can reach the services running inside the cluster via Reverse Proxy without any additional configuration.
Source: Azure Service Fabric is amazing
I have an Azure Service Fabric resource running, but the same rules apply. As the article states, you'll need a reverse proxy/load balancer to resolve not only what nodes are running the API, but also to balance the load between the nodes running that API. So, health probes are necessary too so that the load balancer knows which nodes are viable options for sending traffic to.
As an example, Azure creates 2 rules off the bat:
1. LBHttpRule on TCP/19080 with a TCP probe on port 19080 every 5 seconds with a 2 count error threshold.
2. LBRule on TCP/19000 with a TCP probe on port 19000 every 5 seconds with a 2 count error threshold.
What you need to add to make this forward-facing is a rule where you forward port 80 to your service http port. Then the health probe can be an http probe that hits a path to test a 200 return.
Once you get into the cluster, you can resolve the services normally and SF will take care of availability.
In Azure-land, this is abstracted again to using something like API Management to further reverse proxy it to SSL. What a mess but it works.
Once your load balancer is set up, you'll have a single IP to hit for management, publishing, and regular traffic.

Not able to call a web service hosted in Service Fabric

I've published a OWIN hosted web service to my remote cluster. I'm using a custom port 4444 created during the cluster creation. I see the AppPort rule for 4444. I'm also able to remote to one of the VM, and invoke the service locally. However, I'm still not able to call it remotely. It hangs for a while and doesn't return anything.
Start with this guide and make sure you have the Azure Load Balancer configured properly: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-connect-and-communicate-with-services/#service-fabric-in-azure
The trick is to make sure that when the load balancer sends traffic on a particular port to a node in the cluster there is a service instance there listening on that port. By default, the load balancer simply sends traffic to all nodes, so you have to make sure that you have a service instance listening on each node, or if not then have a load balancer probe actively checking which nodes do have a service instance listening on that port.