Communication between tasks in ECS with App Mesh and Cloudmap - amazon-ecs

My ECS task (that connected to App Mesh with Cloudmap) cannot reach other ECS task (that connected to App Mesh with Cloudmap as well).
dig +short products.services.local return nothing.
curl -v products.services.local:4000/graphql returns
Could not resolve host: products.services.local.
Any help how to debug it will be super helpful.
Btw, running these commands in Cloud9 (within the same VPC) works great. So it's a problem just within ECS or App Mesh with Cloudmap.
There is a private hosted zone in Route53 that has the correct A record (points to the correct task IP), which was created automatically by Cloudmap.
There is a Virtual service and a virtual node in App Mesh for both of the services. Both of their discovery type is AWS Cloud Map and they are pointing to the correct service name and namespace in Cloudmap.
DNS hostnames configuration is enabled on my VPC configuration

Related

What will happen if AWS Fargate Tasks are provisioned in private subnet with VPC Endpoints and NAT Gateway enabled?

Firstly, I have Fargate tasks in private subnets of a VPC and enable NAT Gateway to get connected with ECR for pulling the images & other on-premise servers via the internet. It works perfectly. Later I setup VPC endpoints for ECR (api & dkr), S3, Secrets, logs & remove NAT Gateway, it is working for communication with AWS Services but getting the problem for communicating with on-premise servers. So I enable NAT Gateway and then my application seems working perfectly with on-premise servers. But what I am still unclear is the communication with AWS Services (ECR, S3, Secrets and CloudWatch) happens via internet or private network with VPC endpoints? Please suggest me how to debug the communications.
Thank you for your advices in advance ~
I follow Use a private subnet with internet access & I can ssh into the tasks without VPC Endpoints & NAT gateway enabled. I cannot ssh when I try with VPC endpoints method as the communication happens via private link. I still cannot ssh with VPC endpoints method and NAT Gateway enabled.
--I think I should able to ssh as NAT Gateway is enabled now.-
The VPC endpoints you are creating are specifically "Interface Endpoints". When you create an interface endpoint, AWS
will add an elastic network interface (ENI) to your specified subnets and assign it a private IP address in your subnet's address space. In general,
you'll also tell AWS to add a DNS entry for that ENI which resolves the service's domain name against that IP (insetad
of the public IP). You can disable this, but it kind of defeats the purpose.
This effectively means that anytime you try to resolve the hostname for that service, it should resolve to your ENI's
IP address and thus go over privatelink. However, it is important to note that you need to configure your CLI/SDK for the region your ENI is in. Otherwise, it may use the generic DNS entry (which may point to us-east-1 specifically). That will resolve just fine (thanks to your NAT Gateway), but if you are running in another region, your traffic may route unexpectedly over the internet.
All of this is independent of SSH. Remember, VPC Interface Endpoints are only used to create a private IP address that
can be used to route to AWS services. If you are trying to SSH into a Fargate task, that task just needs to be routable. In this particular case, your Fargate tasks are running in your VPC, and are apparently directly routable. No NAT Gateway or interface endpoints should be necessary to reach them.

How to expose AWS Fargate ECS containers to internet with Route53 DNS?

I have an ECS Task running on fargate and my service get an public ip address automatically. How do I simply expose fargate task to internet with Route53 dns? I was looking around the internet for a whole day today an I couldn't find an example about this simplest possible scenario where I just want to expose a port from said task to the internet and map a Route53 record to point to its ip address.
If I understood correctly from the minimal information I found is that I would have to create an vpc endpoint but I couldn't find information about how to route the traffic to a task/container.
I am also having this problem. I was able to route traffic to a single container service in public subnets via application load balancers. However, now, I cannot reproduce this. Did you try the ALB yet?

ECS fargate healtcheck issue

I am using ECS Fargate model.I have two task defination and in each defination i have one service running (have place 1 container in each TD)
Then i have two ECS services which run these two task definations.
Network mode used is fargate.
I have created a load balancer which has two target groups.I have defined rules to divert traffic.
In target groups i have added IP's of the services running inside two task definitions (as network mode is awsvpc i got ENI)
I have seen for one of the target group health check is failing continuously with HTTP code 502.
I am observing the IP from ENI for that service changing continuously and same is updated in target group.
Questions:
Does ECS change IP in target group automatically ?
How to troubleshoot this HTTP code 502 as this is fargate i even cannot login inside container ?

Cannot create API Management VPC Link in AWS Console

I'm failing to add a VPC Link to my API Gateway that will link to my application load balancer. The symptom in the AWS Console is that the dropdown box for Target NLB is empty. If I attempt to force the issue via the AWS CLI, an entry is created; but the status says NLB ARN is malformed.
I've verified the following:
My application load balancer is in the same account and region as my API Gateway.
My user account has admin privileges. I created and added the recommended policy just in case I was missing something.
The NLB ARN was copied directly from the application load balancer page for the AWS CLI creation scenario.
I can invoke my API directly on the ECS instance (it has a public IP for now).
I can invoke my API through the application load balancer public IP.
Possible quirks with my configuration:
My application load balancer has a security group which limits access to a narrow range of IPs. I didn't think this would matter since VPC links are suppose to connect with the private DNS.
My ECS instance has private DNS enabled.
My ECS uses EC2 launch type, not Fargate.
Indeed, as suggested in a related post, my problem stems from initially creating an ALB (Application Load Balancer) rather than an NLB (Network Load Balancer). Once I had an NLB configured properly, I was able to configure the VPC Link as described in the AWS documentation.

Google Cloud Build deploy to GKE Private Cluster

I'm running a Google Kubernetes Engine with the "private-cluster" option.
I've also defined "authorized Master Network" to be able to remotely access the environment - this works just fine.
Now I want to setup some kind of CI/CD pipeline using Google Cloud Build -
after successfully building a new docker image, this new image should be automatically deployed to GKE.
When I first fired off the new pipeline, the deployment to GKE failed - the error message was something like: "Unable to connect to the server: dial tcp xxx.xxx.xxx.xxx:443: i/o timeout".
As I had the "authorized master networks" option under suspicion for being the root cause for the connection timeout, I've added 0.0.0.0/0 to the allowed networks and started the Cloud Build job again - this time everything went well and after the docker image was created it was deployed to GKE. Good.
The only problem that remains is that I don't really want to allow the whole Internet being able to access my Kubernetes master - that's a bad idea, isn't it?
Are there more elegant solutions to narrow down access by using allowed master networks and also being able to deploy via cloud build?
It's currently not possible to add Cloud Build machines to a VPC. Similarly, Cloud Build does not announce IP ranges of the build machines. So you can't do this today without creating a "ssh bastion instance" or a "proxy instance" on GCE within that VPC.
I suspect this would change soon. GCB existed before GKE private clusters and private clusters are still a beta feature.
We ended up doing the following:
1) Remove the deployment step from cloudbuild.yaml
2) Install Keel inside the private cluster and give it pub/sub editor privileges in the cloud builder / registry project
Keel will monitor changes in images and deploy them automatically based on your settings.
This has worked out great as now we get pushed sha hashed image updates, without adding vms or doing any kind of bastion/ssh host.
Updated answer (02/22/2021)
Unfortunately, while the below method works, IAP tunnels suffer from rate-limiting, it seems. If there are a lot of resources deployed via kubectl, then the tunnel times out after a while. I had to use another trick, which is to dynamically whitelist Cloud Build IP address via Terraform, and then to apply directly, which works every time.
Original answer
It is also possible to create an IAP tunnel inside a Cloud Build step:
- id: kubectl-proxy
name: gcr.io/cloud-builders/docker
entrypoint: sh
args:
- -c
- docker run -d --net cloudbuild --name kubectl-proxy
gcr.io/cloud-builders/gcloud compute start-iap-tunnel
bastion-instance 8080 --local-host-port 0.0.0.0:8080 --zone us-east1-b &&
sleep 5
This step starts a background Docker container named kubectl-proxy in cloudbuild network, which is used by all of the other Cloud Build steps. The Docker container establishes an IAP tunnel using Cloud Build Service Account identity. The tunnel connects to a GCE instance with a SOCKS or an HTTPS proxy pre-installed on it (an exercise left to the reader).
Inside subsequent steps, you can then access the cluster simply as
- id: setup-k8s
name: gcr.io/cloud-builders/kubectl
entrypoint: sh
args:
- -c
- HTTPS_PROXY=socks5://kubectl-proxy:8080 kubectl apply -f config.yml
The main advantages of this approach compared to the others suggested above:
No need to have a "bastion" host with a public IP - kubectl-proxy host can be entirely private, thus maintaining the privacy of the cluster
Tunnel connection relies on default Google credentials available to Cloud Build, and as such there's no need to store/pass any long-term credentials like an SSH key
I got cloudbuild working with my private GKE cluster following this google document:
https://cloud.google.com/architecture/accessing-private-gke-clusters-with-cloud-build-private-pools
This allows me to use cloudbuild and terraform to manage a GKE cluster with authorized network access to control plane enabled. I considered trying to maintain a ridiculous whitelist but that would ultimately defeat the purpose of using authorized network access control to begin with.
I would note that cloudbuild private pools are generally slower than non-private pools. This is due to the server-less nature of private pools. I have not experienced rate limiting so far as others have mentioned.
Our workaround was to add steps in the CI/CD -- to whitelist the cloudbuild's IP, via Authorized Master Network.
Note: Additional permission for the Cloud Build service account is needed
Kubernetes Engine Cluster Admin
On cloudbuild.yaml, add the whitelist step before the deployment/s.
This step fetches the Cloud Build's IP then updates the container clusters settings;
# Authorize Cloud Build to Access the Private Cluster (Enable Control Plane Authorized Networks)
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: 'Authorize Cloud Build'
entrypoint: 'bash'
args:
- -c
- |
apt-get install dnsutils -y &&
cloudbuild_external_ip=$(dig #resolver4.opendns.com myip.opendns.com +short) &&
gcloud container clusters update my-private-cluster --zone=$_ZONE --enable-master-authorized-networks --master-authorized-networks $cloudbuild_external_ip/32 &&
echo $cloudbuild_external_ip
Since the cloud build has been whitelisted, deployments will proceed without the i/o timeout error.
This removes the complexity of setting up VPN / private worker pools.
Disable the Control Plane Authorized Networks after the deployment.
# Disable Control Plane Authorized Networks after Deployment
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: 'Disable Authorized Networks'
entrypoint: 'gcloud'
args:
- 'container'
- 'clusters'
- 'update'
- 'my-private-cluster'
- '--zone=$_ZONE'
- '--no-enable-master-authorized-networks'
This approach works well even in cross-project / cross-environment deployments.
Update: I suppose this won't work with production strength for the same reason as #dinvlad's update above, i.e., rate limiting in IAP. I'll leave my original post here because it does solve the network connectivity problem, and illustrates the underlying networking mechanism.
Furthermore, even if we don't use it for Cloud Build, my method provides a way to tunnel from my laptop to a K8s private master node. Therefore, I can edit K8s yaml files on my laptop (e.g., using VS Code), and immediately execute kubectl from my laptop, rather than having to ship the code to a bastion host and execute kubectl inside the bastion host. I find this a big booster to development time productivity.
Original answer
================
I think I might have an improvement to the great solution provided by #dinvlad above.
I think the solution can be simplified without installing an HTTP Proxy Server. Still need a bastion host.
I offer the following Proof of Concept (without HTTP Proxy Server). This PoC illustrates the underlying networking mechanism without involving the distraction of Google Cloud Build (GCB). (When I have time in the future, I'll test out the full implementation on Google Cloud Build.)
Suppose:
I have a GKE cluster whose master node is private, e.g., having an IP address 10.x.x.x.
I have a bastion Compute Instance named my-bastion. It has only private IP but not external IP. The private IP is within the master authorized networks CIDR of the GKE cluster. Therefore, from within my-bastion, kubectl works against the private GKE master node. Because my-bastion doesn't have an external IP, my home laptop connects to it through IAP.
My laptop at home, with my home internet public IP address, doesn't readily have connectivity to the private GKE master node above.
The goal is for me to execute kubectl on my laptop against that private GKE cluster. From network architecture perspective, my home laptop's position is like the Google Cloud Build server.
Theory: Knowing that gcloud compute ssh (and the associated IAP) is a wrapper for SSH, the SSH Dynamic Port Forwarding should achieve that goal for us.
Practice:
## On laptop:
LAPTOP~$ kubectl get ns
^C <<<=== Without setting anything up, this hangs (no connectivity to GKE).
## Set up SSH Dynamic Port Forwarding (SOCKS proxy) from laptop's port 8443 to my-bastion.
LAPTOP~$ gcloud compute ssh my-bastion --ssh-flag="-ND 8443" --tunnel-through-iap
In another terminal of my laptop:
## Without using the SOCKS proxy, this returns my laptop's home public IP:
LAPTOP~$ curl https://checkip.amazonaws.com
199.xxx.xxx.xxx
## Using the proxy, the same curl command above now returns a different IP address,
## i.e., the IP of my-bastion.
## Note: Although my-bastion doesn't have an external IP, I have a GCP Cloud NAT
## for its subnet (for purpose unrelated to GKE or tunneling).
## Anyway, this NAT is handy as a demonstration for our curl command here.
LAPTOP~$ HTTPS_PROXY=socks5://127.0.0.1:8443 curl -v --insecure https://checkip.amazonaws.com
* Uses proxy env variable HTTPS_PROXY == 'socks5://127.0.0.1:8443' <<<=== Confirming it's using the proxy
...
* SOCKS5 communication to checkip.amazonaws.com:443
...
* TLSv1.2 (IN), TLS handshake, Finished (20): <<<==== successful SSL handshake
...
> GET / HTTP/1.1
> Host: checkip.amazonaws.com
> User-Agent: curl/7.68.0
> Accept: */*
...
< Connection: keep-alive
<
34.xxx.xxx.xxx <<<=== Returns the GCP Cloud NAT'ed IP address for my-bastion
Finally, the moment of truth for kubectl:
## On laptop:
LAPTOP~$ HTTPS_PROXY=socks5://127.0.0.1:8443 kubectl --insecure-skip-tls-verify=true get ns
NAME STATUS AGE
default Active 3d10h
kube-system Active 3d10h
It is now possible to create a pool of VM's that are connected to you private VPC and can be access from Cloud Build.
Quickstart