Airflow scheduler can not connect to Kubernetes service api - kubernetes

I am trying to setup airflow with Kubernetes executor and on scheduler container startup it hangs for a while and then I get https timeout error as follows. The ip address in message is correct and inside container I can run curl kubernetes:443 or curl 10.96.0.1:443 or nc -zv 10.96.0.1 443 so I assume there is no firewall or so blocking access.
I am using local kubernetes as well as aws EKS but same error, I can see that ip changes in different clusters.
I have looked at google to find a solution but did not see similar cases.
│ File "/usr/local/lib/python3.6/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 335, in run │
│ self.worker_uuid, self.kube_config) │
│ File "/usr/local/lib/python3.6/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 359, in _run │
│ **kwargs): │
│ File "/usr/local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream │
│ for line in iter_resp_lines(resp): │
│ File "/usr/local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines │
│ for seg in resp.read_chunked(decode_content=False): │
│ File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 781, in read_chunked │
│ self._original_response.close() │
│ File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__ │
│ self.gen.throw(type, value, traceback) │
│ File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 430, in _error_catcher │
│ raise ReadTimeoutError(self._pool, None, "Read timed out.") │
│ urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.96.0.1', port=443): Read timed out.
update: I found my problem, but no solution yet.
https://github.com/kubernetes-client/python/issues/990

There is an option to set the value via the ENV variable. In your charts/airflow.yaml file, you can set the variable as follows and that should solve your problem,
AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS: {"_request_timeout" : [50, 50]}
PR Reference: https://github.com/apache/airflow/pull/6643
Problem Discussion: https://issues.apache.org/jira/browse/AIRFLOW-6040
airflow.yaml full code
airflow:
image:
repository: airflow-docker-local
tag: 1
executor: Kubernetes
service:
type: LoadBalancer
config:
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:airflow#airflow-postgresql:5432/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://postgres:airflow#airflow-postgresql:5432/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:airflow#airflow-redis-master:6379/0
AIRFLOW__CORE__REMOTE_LOGGING: True
AIRFLOW__CORE__REMOTE_LOG_CONN_ID: my_s3_connection
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER: s3://xxx-airflow/logs
AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC: 25
AIRFLOW__CORE__LOAD_EXAMPLES: True
AIRFLOW__WEBSERVER__EXPOSE_CONFIG: True
AIRFLOW__CORE__FERNET_KEY: -xyz=
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY: airflow-docker-local
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: 1
AIRFLOW__KUBERNETES__WORKER_CONTAINER_IMAGE_PULL_POLICY: Never
AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME: airflow
AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM: airflow
AIRFLOW__KUBERNETES__NAMESPACE: airflow
AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS: {"_request_timeout" : [50, 50]}
persistence:
enabled: true
existingClaim: ''
workers:
enabled: true
postgresql:
enabled: true
redis:
enabled: true

Related

Issue with Terraform accessing list value of a key in YAML file

I am deploying azure service bus using terraform and yaml conf file. I am creating
azure service bus name pace, network rules for the service bus and service authorization rule for the name space using terraform.but,i want to define the multiple topics and multiple subscriptions under the topics in a yaml file which will be accessed by terraform as parameters during creating the resources "topic" and "subscription". I have defined the Multiple subscriptions as list value of the topic. The topics are created successfully but the Multiple subscriptions are not. The error , yaml, and terraform conf are given below.
Error: Incorrect attribute value type
│
│ on main.tf line 215, in resource "azurerm_servicebus_subscription" "subscription":
│ 215: name = each.value.servicebus_subscription
│ ├────────────────
│ │ each.value.servicebus_subscription is tuple with 2 elements
│
│ Inappropriate value for attribute "name": string required.
con.yaml
#-------
servicebus:
- servicebus_topic: tesTopic1
#enable_partitioning: "true"
servicebus_subscription: ['test-service1', 'test-service1']
- servicebus_topic: testTopic2
servicebus_subscription:['test-db1', 'test-service2']
pub_sub.tf
resource "azurerm_servicebus_subscription" "subscription" {
for_each = { for subscriptions in local.service_bus_conf : subscriptions.servicebus_topic=> subscriptions}
depends_on = [azurerm_servicebus_topic.topic]
name = each.value.servicebus_subscription
topic_id = data.azurerm_servicebus_topic.topic[each.value.servicebus_topic].id
}
````

are AWS EFS Mount Targets supported in AWS Local Zones?

I am trying to create an EFS mount target in us-east-1-atl-1a AWS Local Zone using Terraform, but I received following error. I attempted to create it manually using UI, but I don't see an option to select us-east-1-atl-1a as an AZ(See screenshot). Does anyone know if this AWS Local Zone supports EFS mount targets? AWS Local Zone info page doesn't mention EFS at all.
Terraform Error:
│ Error: UnsupportedAvailabilityZone: Mount targets are not supported in subnet's availability zone.
│ {
│ RespMetadata: {
│ StatusCode: 400,
│ RequestID: "23192f37-77e6-421b-b623-4a2b6dfb6217"
│ },
│ ErrorCode: "UnsupportedAvailabilityZone",
│ Message_: "Mount targets are not supported in subnet's availability zone."
│ }
│
│ with aws_efs_mount_target.efs-mounts[0],
│ on efs.tf line 7, in resource "aws_efs_mount_target" "efs-mounts":
│ 7: resource "aws_efs_mount_target" "efs-mounts" {
│
╵
╷
│ Error: creating EKS Cluster (d115): UnsupportedAvailabilityZoneException: Cannot create cluster 'd115' because us-east-1-atl-1a, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f
│ {
│ RespMetadata: {
│ StatusCode: 400,
│ RequestID: "51476966-09d3-4976-b8a1-f381b9c29c17"
│ },
│ ClusterName: "d115",
│ Message_: "Cannot create cluster 'd115' because us-east-1-atl-1a, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f",
│ ValidZones: [
│ "us-east-1a",
│ "us-east-1b",
│ "us-east-1c",
│ "us-east-1d",
│ "us-east-1f"
│ ]
│ }
EFS Mount Targets screenshot
EFS is not currently listed as a service available in Local Zones. You can see the list of services here - https://aws.amazon.com/about-aws/global-infrastructure/localzones/features/
EBS and FSx are the only storage options currently available.
This is not supported and only standard availablity zone are supported as the error indicated.
"us-east-1a",
│ "us-east-1b",
│ "us-east-1c",
│ "us-east-1d",
│ "us-east-1f"

Flink not starting TaskManagers on Kubernetes, job reached global-terminal state

I have deployed a Flink cluster to Kubrnetes and I only see the JobManagers running.
I had Flink running on another Kubernetes cluster where I did a SavePoint using the FlinkDeployment from the Flink Operator. The Savepoint was saved correctly. I then deployed the Flink app to the new Kubernetes cluster and patched the savepointLocationPath in FlinkDeployment.
The Flink pods now log this error
│ WARN org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Ignoring JobGraph submission 'Windchill ESI Post Processing' because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.
...
│ io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: Unable to update ConfigMapLock
...
│ Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.0.0.1/api/v1/namespaces/post-processing-int2/configmaps/post-processing-cluster-c │
│ onfig-map. Message: Operation cannot be fulfilled on configmaps "post-processing-cluster-config-map": the object has been modified; please apply your changes to the latest version and tr │
│ y again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=configmaps, name=post-processing-cluster-config-map, retryAfterSeconds=null, u │
│ id=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on configmaps "post-processing-cluster-config-map": the object has been modified; please apply your │
│ changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, st │
│ atus=Failure, additionalProperties={}).
The ConfgiMap mentioned in the error is present.
My question is how do I start a new TaskManager now? I have numberOfTaskSlots: 4 set. I tried shelling into a JobManager pod and running bin/taskmanager.sh start but this just started a process in the pod which doesn't seem correct to me. I then stopped it.
I am expecting to see a new TaskManager pod start up. Thank you
The clue was in the first line of the logs
WARN org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Ignoring JobGraph submission 'Windchill ESI Post Processing' because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution
My mistake started with this command
kubectl patch flinkdeployment/<name-of-flink-deployment> --type=merge -p '{"spec": {"job": {"state": "suspended", "upgradeMode": "savepoint"}}}'
The problem was the upgradeMode. That should not have been edited and left as last-state. Last state tells the Flink deployment to start from where it left off using the HA state (which is state stored in Azure Blob Storage in my case). savepoint will put the deployment in a FINISHED state and will not start a new TaskManager upon deployment.
Below is the correct edit
kubectl patch flinkdeployment/<name-of-flink-deployment> --type=merge -p '{"spec": {"job": {"state": "suspended"}}}'

Terraform Cloud, multiple applies, EKS

I'm having the following issue.
I'm trying to deploy an EKS cluster with EKS addons (vpc cni, kubeproxy) and k8s addons (autoscaler, fluentbit). My ADO repo that has the .tf files is connected to TF Cloud, meaning my state is remote. I've recently found out that k8s/terraform won't let you deploy an EKS cluster and its addons in the same run, for some reason (I would get many random errors, at random times). I had to have a separate terraform apply for eks and addons, respectively.
So, I've decided to modularize my code.
Before, my main folder looked like this:
├── Deployment
│ └── main.tf
│ └── eks.tf
│ └── addons.tf
Now, my folder looks like this:
└───Deployment
│ │
│ └─── eks_deploy
│ │ main.tf
│ │ eks.tf
│ │
│ └─── addons_deploy
│ │ main.tf
│ │ addons.tf
And so, I initialize the same remote backend in both. So far, so good. Went ahead with a terraform apply in my eks_deploy folder. Deployed without problems, a clean EKS cluster with no addons. Now, it was time to deploy addons.
And that's where we have a problem.
My main.tf files are the exact same in both folders. And the file looks like this:
terraform {
backend "remote" {}
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 3.66.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = ">= 2.7.1"
}
helm = {
source = "hashicorp/helm"
version = ">= 2.4.1"
}
}
}
data "aws_eks_cluster" "cluster" {
name = module.eks-ssp.eks_cluster_id
}
data "aws_eks_cluster_auth" "cluster" {
name = module.eks-ssp.eks_cluster_id
}
# I am aware you're not supposed to hardcode your creds
provider "aws" {
access_key = "xxx"
secret_key = "xxx"
region = "xxx"
assume_role {
role_arn = "xxx"
}
}
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.cluster.token
}
provider "helm" {
kubernetes {
host = data.aws_eks_cluster.cluster.endpoint
token = data.aws_eks_cluster_auth.cluster.token
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
}
}
The EKS cluster deployed without problems because it had an eks.tf file that contains the module and all needed info to deploy a cluster. However, my addon deployment throws the following errors:
╷
│ Error: Reference to undeclared module
│
│ on addons.tf line 60, in module "eks-ssp-kubernetes-addons":
│ 60: depends_on = [module.eks-ssp.self_managed_node_groups]
│
│ No module call named "eks-ssp" is declared in the root module.
╵
╷
│ Error: Reference to undeclared module
│
│ on main.tf line 22, in data "aws_eks_cluster" "cluster":
│ 22: name = module.eks-ssp.eks_cluster_id
│
│ No module call named "eks-ssp" is declared in the root module.
╵
╷
│ Error: Reference to undeclared module
│
│ on main.tf line 26, in data "aws_eks_cluster_auth" "cluster":
│ 26: name = module.eks-ssp.eks_cluster_id
│
│ No module call named "eks-ssp" is declared in the root module.
This is completely understandable, since the EKS cluster DOES NOT exist in the addon deployment, thus the addon deployment has no clue where to actually deploy those addons.
So my question is... how do I perform 2 different applies, for what's essentially the same resource (EKS), with each deployment being fully aware of each other (working as if they were in the same file and deployment? People mentioned "terragrunt", but I still don't understand how I could use it in my case, so if that's the solution you propose as well, please give a description of its way of use. There is also the following question - how would I connect the same repo, with 2 different folders/deployments, having separate applies? Does TF cloud even allow such a thing? At this point, I'm starting to think that a completely separate workspace, and hardcoded EKS values inside addons.tf is the only way. Thank you.

Terraform Kubernetes metadata name from variable

I'm trying to inject a variable in the terraform kubernetes_service -> metadata -> name field. I'm getting the following error
│ Error: metadata.0.name a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
│
│ with module.collector.kubernetes_deployment.dp_collector,
│ on modules/collector/main.tf line 3, in resource "kubernetes_deployment" "dp_collector":
│ 3: name = var.name
Is there any way to do that, from the error description I can't do it I guess.
var.name = "app_collector"
Why I want to do this? I have a couple of microservices and deploying is same except ports and names, hence I want to abstract the service in a module.