Unable to list Nomad job/task logs despite being able to submit Nomad job (using same token to authenticate) - hashicorp-vault

I am using Vault for secrets management to generate short-lived Nomad ACL tokens with which my deployment agent (GoCD) can authenticate against Nomad:
vault read -field=secret_id nomad/creds/gocd
My ACL policy for gocd is:
namespace "default" {
policy = "write"
}
I am using GoCD to submit Nomad jobs for deployment:
nomad job run {{ temp_directory.path }}/{{ service_name }}.nomad
Both of the above steps work as expected. However, when I try to get the logs of a failed Nomad deployment using
nomad alloc logs -token {{ nomad_token.stdout }} -job {{ service_name }} {{ task_name }}
I am getting
"Error reading file: Unexpected response code: 403 (Permission denied)"
According to the Nomad documentation, the "write" policy includes the "read-logs" capability.

I found that adding a "read" policy for node in my ACL policy did the trick:
namespace "default" {
policy = "write"
}
node {
policy = "read"
}
The problem I was facing is explained well in an existing GitHub issue.

Related

Flux Terraform controller not picking the correct Terraform state

I have a terraform controller for Flux running with a Github provider, however, it seems to be picking up the wrong Terraform state, so it keeps trying to recreate the resources again and again (and fails because they already exist)
This is how it is configured
apiVersion: infra.contrib.fluxcd.io/v1alpha1
kind: Terraform
metadata:
name: saas-github
namespace: flux-system
spec:
interval: 2h
approvePlan: "auto"
workspace: "prod"
backendConfig:
customConfiguration: |
backend "s3" {
bucket = "my-bucket"
key = "my-key"
region = "eu-west-1"
dynamodb_table = "state-lock"
role_arn = "arn:aws:iam::11111:role/my-role"
encrypt = true
}
path: ./terraform/saas/github
runnerPodTemplate:
metadata:
annotations:
iam.amazonaws.com/role: pod-role
sourceRef:
kind: GitRepository
name: infrastructure
namespace: flux-system
locally running terraform init with a state.config file that has a similar/same configuration it works fine and it detect the current state properly:
bucket = "my-bucket"
key = "infrastructure-github"
region = "eu-west-1"
dynamodb_table = "state-lock"
role_arn = "arn:aws:iam::111111:role/my-role"
encrypt = true
Reading the documentation I saw also a configPath that could be used, so I tried to point it to the state file, but then I got the error:
Failed to initialize kubernetes configuration: error loading config file couldn't get version/kind; json parse error
Which is weird, like it tries to load Kuberntes configuration, not Terraform, or at least it expects a json file, which is not the case of my state configuration
I'm running Terraform 1.3.1 on both locally and on the tf runner pod
On the runner pod I can see the generated_backend_config.tf and it is the same configuration and .terraform/terraform.tfstate also points to the bucket
The only suspicious thing on the logs that I could find is this:
- Finding latest version of hashicorp/github...
- Finding integrations/github versions matching "~> 4.0"...
- Finding latest version of hashicorp/aws...
- Installing hashicorp/github v5.9.1...
- Installed hashicorp/github v5.9.1 (signed by HashiCorp)
- Installing integrations/github v4.31.0...
- Installed integrations/github v4.31.0 (signed by a HashiCorp partner, key ID 38027F80D7FD5FB2)
- Installing hashicorp/aws v4.41.0...
- Installed hashicorp/aws v4.41.0 (signed by HashiCorp)
Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Warning: Additional provider information from registry
The remote registry returned warnings for
registry.terraform.io/hashicorp/github:
- For users on Terraform 0.13 or greater, this provider has moved to
integrations/github. Please update your source in required_providers.
It seems that it installs 2 github providers, one from hashicorp and one from integrations... I have changed versions of Terraform/provider during the development, and I have removed any reference to the hashicorp one, but this warning still happens
However, it also happens locally, where it reads the correct state, so I don't think it is related.

FluxCD on Azure AKS: Reconciler errors

I had to rerun flux bootstrap... on my cluster after a colleague accidentally ran flux bootstrap... on their new cluster using the existing branch and cluster from the same flux repo.
Running kubectl get gitrepositories -A has no errors -
flux-system flux-system ssh://git#git.group.net:7999/psmgsbb/flux.git stored artifact for revision 'master/252f6416c034bb67f06cc3e413e66704bc6b1069'
however I am seeing these errors now when I run flux logs --level=error
error ImagePolicy/post-processing-master-branch-policy.flux-system : Reconciler error cannot determine latest tag for policy version list argument cannot be empty
error HelmRelease/post-processing.post-processing-dev : Reconciler error previous release attempt remediation failed
error ImageRepository/post-processing-repository.flux-system : Reconciler error auth for "myacr.azurecr.io" not found in secret flux-system/psbombb-image-acr-auth-cc8mg5tk84
Regarding the secret above I ran:
kubcetl get secret -n flux-system psbombb-image-acr-auth-cc8mg5tk84 -oyaml
which gave me
apiVersion: v1
data:
.dockerconfigjson:
ewoJImRhdGEiOiAie1xuICBcI...<redacted>
kind: Secret
which decodes to
"data": "{
"auths": {
"myacr.azurecr.io": {
"auth":
"YTNlMTNlOGItYWQwNi00M2IzLTkyMjgtMjA0ZmQ2ODllMD<redacted>"
}
}
}"
So the ACR above myacr.azurecr.io does match the ACR in the secret. This error doesn't make sense to me?
Reconciler error auth for "myacr.azurecr.io" not found in secret flux-system/psbombb-image-acr-auth-cc8mg5tk84
So basically, do you know why reconcile fails now after a flux bootstrap?
Thank you
When flux bootstrap... was run accidentally on the cluster it upgraded kustomize to version 0.30.2. This was causing an issue with the formatting of the encrypted dockerconfigjson secret being written to Kubernetes.
When the dockerconfigjson contents were base64 decoded there were line feeds everywhere which seems to have caused the reconciler error whereby it could not find the ACR reference -> myacr.azurecr.io
I reverted the gotk-components.yaml kustomize-controller version back to the kustomize version prior to the accidental flux boostrap... i.e. from v03.2 to v022.3.
Once the Kubernetes Secret was recreated with the correct dockerconfigjson format, reconciliation started working correctly.

Why isn't my `KubernetesPodOperator` using the IRSA I've annotated worker pods with?

I've deployed an EKS cluster using the Terraform module terraform-aws-modules/eks/aws. I’ve deployed Airflow on this EKS cluster with Helm (the official chart, not the community one), and I’ve annotated worker pods with the following IRSA:
serviceAccount:
# Specifies whether a ServiceAccount should be created
create: true
# The name of the ServiceAccount to use.
# If not set and create is true, a name is generated using the release name
name: "airflow-worker"
# Annotations to add to worker kubernetes service account.
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/airflow-worker"
This airflow-worker role has a policy attached to it to enable it to assume a different role.
I have a Python program that assumes this other role and performs some S3 operations. I can exec into a running BashOperator pod, open a Python shell, assume this role, and issue the exact same S3 operations successfully.
But, when I create a Docker image with this program and try to call it from a KubernetesPodOperator task, I see the following error:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRole operation:
User: arn:aws:sts::123456789:assumed-role/core_node_group-eks-node-group-20220726041042973200000001/i-089c64b96cf7878d8 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::987654321:role/TheOtherRole
I don't really know what this role is, but I believe it was created automatically by the Terraform module. However, when I kubectl describe one of these failed pods, I see this:
Environment:
...
...
...
AWS_ROLE_ARN: arn:aws:iam::123456789:role/airflow-worker
My questions:
Why is this role being used, and not the IRSA airflow-worker that I've specified in the Helm chart's values?
What even is this role? It seems the Terraform module creates a number of roles automatically, but it is very difficult to tell what their purpose is or where they're used from the Terraform documentation.
How am I able to assume this role and do everything the Dockerized Python program does when in a shell in the pod? Okay, this is because other operators (such as BashOperator) do use the airflow-worker role. Just not KubernetesPodOperators.
What is the AWS_IAM_ROLE environment variable, and why isn't it being used?
Happy to provide more context if it's helpful.
In order to use the AWS role in EKS pod, you need to add this policy to it:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": " arn:aws:iam::123456789:role/airflow-worker”
},
"Action": "sts:AssumeRole"
}
]
}
Here you can find some information about AWS Security Token Service (STS).
For the tasks running in the worker prod, they will use the role automatically, but if you create a new pod, it will be separated from your worker pod, so you need to let it use the service account which attach the role in order to add the AWS role creds file to the pod.
This is pretty much by design. The non-KubernetesPodOperators use an auto-generated pod template file that has Helm chart values as default properties, while the KubernetesPodOperator needs its own pod template file. That, or it needs to essentially create one by passing arguments to KubernetesPodOperator(....
I fixed the ultimate issue by passing service_account="airflow-worker" to KubernetesPodOperator(....

How to fix kubernetes_config_map resource error on a newly provisioned EKS cluster via terraform?

I'm using Terraform to provision an EKS cluster (mostly following the example here). At the end of the tutorial, there's a method of outputting the configmap through the terraform output command, and then applying it to the cluster via kubectl apply -f <file>. I'm attempting to wrap this kubectl command into the Terraform file using the kubernetes_config_map resource, however when running Terraform for the first time, I receive the following error:
Error: Error applying plan:
1 error(s) occurred:
* kubernetes_config_map.config_map_aws_auth: 1 error(s) occurred:
* kubernetes_config_map.config_map_aws_auth: the server could not find the requested resource (post configmaps)
The strange thing is, every subsequent terraform apply works, and applies the configmap to the EKS cluster. This leads me to believe it is perhaps a timing issue? I tried to preform a bunch of actions in between the provisioning of the cluster and applying the configmap but that didn't work. I also put an explicit depends_on argument to ensure that the cluster has been fully provisioned first before attempting to apply the configmap.
provider "kubernetes" {
config_path = "kube_config.yaml"
}
locals {
map_roles = <<ROLES
- rolearn: ${aws_iam_role.eks_worker_iam_role.arn}
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
ROLES
}
resource "kubernetes_config_map" "config_map_aws_auth" {
metadata {
name = "aws-auth"
namespace = "kube-system"
}
data {
mapRoles = "${local.map_roles}"
}
depends_on = ["aws_eks_cluster.eks_cluster"]
}
I expect for this to run correctly the first time, however it only runs after applying the same file with no changes a second time.
I attempted to get more information by enabling the TRACE debug flag for terraform, however the only output I got was the exact same error as above.
Well, I don't know if that is fresh yet but I was dealing with the same troubles and found out that:
https://github.com/terraform-aws-modules/terraform-aws-eks/issues/699#issuecomment-601136543
So, in others words, I changed the cluster's name in aws_eks_cluster_auth block to a static name, and worked. Well, perhaps this is a bug on TF.
This seems like a timing issue while bootstrapping your cluster. Your kube-apiserver initially doesn't think there's a configmaps resource.
It's likely that the Role and RoleBinding that it's using the create the ConfigMap has not been fully configured in the cluster to allow it to create a ConfigMap (possibly within the EKS infrastructure) which uses the iam-authenticator and the following policies:
resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = "${aws_iam_role.demo-cluster.name}"
}
resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSServicePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
role = "${aws_iam_role.demo-cluster.name}"
}
The depends Terraform clause will not do much since it seems like the timing is happening within the EKS service.
I suggest you try the terraform-aws-eks module which uses the same resource described in the doc. You can also browse through the code if you'd like to figure out how they solve the problem that you are seeing.

Rabbitmq-ha helm chart, management plugin throwing error

I have deployed the rabbitmq-ha chart to kubernetes, then used kubectl port-forwarding to access the management ui. I can log in, but I dont see any data in the ui, some tabs are showing the error:
TypeError: Cannot read property 'name' of undefined TypeError: Cannot read property 'name' of undefined at Array.process (eval at compile (http://localhost:15672/js/ejs-1.0.min.js:1:6654), :100:139) at EJS.render (http://localhost:15672/js/ejs-1.0.min.js:1:1885) at format (http://localhost:15672/js/main.js:1086:21) at http://localhost:15672/js/main.js:444:24 at with_reqs (http://localhost:15672/js/main.js:1068:9) at http://localhost:15672/js/main.js:1064:17 at XMLHttpRequest.req.onreadystatechange (http://localhost:15672/js/main.js:1144:17)
https://github.com/helm/charts/tree/master/stable/rabbitmq-ha
I have deployed in the following way. I have a chart with a single requirement, rabbitmq.
I run the commands
$ helm dependency build ./rabbitmq
$ helm template --namespace rabbitmq-test --name rabbitmq-test . --output-dir ./output
$ kubectl apply -n rabbitmq-test -Rf ./output
/rabbitmq/Chart.yaml
apiVersion: v1
appVersion: "1.0"
description: A Helm chart for Kubernetes
name: rabbitmq-ha
version: 0.1.0
/rabbitmq/requirements.yaml
dependencies:
- name: rabbitmq-ha
version: 1.19.0
repository: https://kubernetes-charts.storage.googleapis.com
/rabbitmq/values.yaml (default settings from github, indented under rabbitmq-ha
rabbitmq-ha:
## RabbitMQ application credentials
## Ref: http://rabbitmq.com/access-control.html
##
rabbitmqUsername: guest
# rabbitmqPassword:
...
Everything appears to deploy correctly, I see no errors, I can enter the pod and use rabbitmqctl, the node_health_check command is successful, I can create queues etc.
To access management ui I run the command
kubectl port-forward -n rabbitmq-test rabbitmq-test-rabbitmq-ha-0 15672:15672
Then visit localhost:15672 and log in.
Which username are you logging in with? The helm values define application and management credentials. I had the same errors when logging in using the management user, that user only has permission for health checks etc. You need to login with the guest user
charts/values.yaml
## RabbitMQ application credentials
## Ref: http://rabbitmq.com/access-control.html
##
rabbitmqUsername: guest
# rabbitmqPassword:
## RabbitMQ Management user used for health checks
managementUsername: management
managementPassword: E9R3fjZm4ejFkVFE
You need to add administrator permissions, I don't know how to do it from chart level but with rabbitmqctl you can do it in that way:
rabbitmqctl set_user_tags YOUR_USER administrator