Create GKE cluster and namespace with Terraform

Create GKE cluster and namespace with Terraform - kubernetes

I need to create GKE cluster and then create namespace and install db through helm to that namespace. Now I have gke-cluster.tf that creates cluster with node pool and helm.tf, that has kubernetes provider and helm_release resource. It first creates cluster, but then tries to install db but namespace doesn't exist yet, so I have to run terraform apply again and it works. I want to avoid scenario with multiple folder and run terraform apply only once. What's the good practice for situaction like this? Thanks for the answers.

The create_namespace argument of helm_release resource can help you.
create_namespace - (Optional) Create the namespace if it does not yet exist. Defaults to false.
https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release#create_namespace
Alternatively, you can define a dependency between the namespace resource and helm_release like below:
resource "kubernetes_namespace" "prod" {
metadata {
annotations = {
name = "prod-namespace"
}
labels = {
namespace = "prod"
}
name = "prod"
}
}
resource "helm_release" "arango-crd" {
name = "arango-crd"
chart = "./kube-arangodb-crd"
namespace = "prod"
depends_on = [ kubernetes_namespace.prod ]
}

The solution posted by user adp is correct but I wanted to give more insight on using Terraform for this particular example in regards of running single commmand:
$ terraform apply --auto-approve.
Basing on following comments:
Can you tell how are you creating your namespace? Is it with kubernetes provider? - Dawid Kruk
resource "kubernetes_namespace" - Jozef Vrana
This setup needs specific order of execution. First the cluster, then the resources. By default Terraform will try to create all of the resources at the same time. It is crucial to use a parameter depends_on = [VALUE].
The next issue is that the kubernetes provider will try to fetch the credentials at the start of the process from ~/.kube/config. It will not wait for the cluster provisioning to get the actual credentials. It could:
fail when there is no .kube/config
fetch credentials for the wrong cluster.
There is ongoing feature request to resolve this kind of use case (also there are some workarounds):
Github.com: Hashicorp: Terraform: Issue: depends_on for providers
As an example:
# Create cluster
resource "google_container_cluster" "gke-terraform" {
project = "PROJECT_ID"
name = "gke-terraform"
location = var.zone
initial_node_count = 1
}
# Get the credentials
resource "null_resource" "get-credentials" {
depends_on = [google_container_cluster.gke-terraform]
provisioner "local-exec" {
command = "gcloud container clusters get-credentials ${google_container_cluster.gke-terraform.name} --zone=europe-west3-c"
}
}
# Create a namespace
resource "kubernetes_namespace" "awesome-namespace" {
depends_on = [null_resource.get-credentials]
metadata {
name = "awesome-namespace"
}
}
Assuming that you had earlier configured cluster to work on and you didn't delete it:
Credentials for Kubernetes cluster are fetched.
Terraform will create a cluster named gke-terraform
Terraform will run a local command to get the credentials for gke-terraform cluster
Terraform will create a namespace (using old information):
if you had another cluster in .kube/config configured, it will create a namespace in that cluster (previous one)
if you deleted your previous cluster, it will try to create a namespace in that cluster and fail (previous one)
if you had no .kube/config it will fail on the start
Important!
Using "helm_release" resource seems to get the credentials when provisioning the resources, not at the start!
As said you can use helm provider to provision the resources on your cluster to avoid the issues I described above.
Example on running a single command for creating a cluster and provisioning resources on it:
variable zone {
type = string
default = "europe-west3-c"
}
resource "google_container_cluster" "gke-terraform" {
project = "PROJECT_ID"
name = "gke-terraform"
location = var.zone
initial_node_count = 1
}
data "google_container_cluster" "gke-terraform" {
project = "PROJECT_ID"
name = "gke-terraform"
location = var.zone
}
resource "null_resource" "get-credentials" {
# do not start before resource gke-terraform is provisioned
depends_on = [google_container_cluster.gke-terraform]
provisioner "local-exec" {
command = "gcloud container clusters get-credentials ${google_container_cluster.gke-terraform.name} --zone=${var.zone}"
}
}
resource "helm_release" "mydatabase" {
name = "mydatabase"
chart = "stable/mariadb"
# do not start before the get-credentials resource is run
depends_on = [null_resource.get-credentials]
set {
name = "mariadbUser"
value = "foo"
}
set {
name = "mariadbPassword"
value = "qux"
}
}
Using above configuration will yield:
data.google_container_cluster.gke-terraform: Refreshing state...
google_container_cluster.gke-terraform: Creating...
google_container_cluster.gke-terraform: Still creating... [10s elapsed]
<--OMITTED-->
google_container_cluster.gke-terraform: Still creating... [2m30s elapsed]
google_container_cluster.gke-terraform: Creation complete after 2m38s [id=projects/PROJECT_ID/locations/europe-west3-c/clusters/gke-terraform]
null_resource.get-credentials: Creating...
null_resource.get-credentials: Provisioning with 'local-exec'...
null_resource.get-credentials (local-exec): Executing: ["/bin/sh" "-c" "gcloud container clusters get-credentials gke-terraform --zone=europe-west3-c"]
null_resource.get-credentials (local-exec): Fetching cluster endpoint and auth data.
null_resource.get-credentials (local-exec): kubeconfig entry generated for gke-terraform.
null_resource.get-credentials: Creation complete after 1s [id=4191245626158601026]
helm_release.mydatabase: Creating...
helm_release.mydatabase: Still creating... [10s elapsed]
<--OMITTED-->
helm_release.mydatabase: Still creating... [1m40s elapsed]
helm_release.mydatabase: Creation complete after 1m44s [id=mydatabase]

Related

Restoring a AWS documentdb snapshot with terraform

I am unsure how to restore an AWS documentdb cluster that is managed by terraform.
My terraform setup looks like this:
resource "aws_docdb_cluster" "this" {
cluster_identifier = var.env_name
engine = "docdb"
engine_version = "4.0.0"
master_username = "USERNAME"
master_password = random_password.this.result
db_cluster_parameter_group_name = aws_docdb_cluster_parameter_group.this.name
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
db_subnet_group_name = aws_docdb_subnet_group.this.name
deletion_protection = true
backup_retention_period = 7
preferred_backup_window = "07:00-09:00"
skip_final_snapshot = false
# Added on 6.25.22 to rollback an incorrect application of the namespace
# migration, which occurred at 2AM EST on June 23.
snapshot_identifier = "...the arn for the snapshot..."
}
resource "aws_docdb_cluster_instance" "this_2a" {
count = 1
engine = "docdb"
availability_zone = "us-east-1a"
auto_minor_version_upgrade = true
cluster_identifier = aws_docdb_cluster.this.id
instance_class = "db.r5.large"
}
resource "aws_docdb_cluster_instance" "this_2b" {
count = 1
engine = "docdb"
availability_zone = "us-east-1b"
auto_minor_version_upgrade = true
cluster_identifier = aws_docdb_cluster.this.id
instance_class = "db.r5.large"
}
resource "aws_docdb_subnet_group" "this" {
name = var.env_name
subnet_ids = module.vpc.private_subnets
}
I added the snapshot_identifier parameter and applied it, expecting a rollback. However, this did not have the intended effect of restoring documentdb state to its settings on June 23rd. (As far as I can tell, nothing changed at all)
I wanted to avoid using the AWS console approach (described here) because that creates a new cluster which won't be tracked by terraform.
What is the proper way of accomplishing this rollback using terraform?

The snapshot_identifier parameter is only used when Terraform creates a new cluster. Setting it after the cluster has been created just tells Terraform "If you ever have to recreate this cluster, use this snapshot".
To actually get Terraform to recreate the cluster you would need to do something else to make Terraform think the cluster needs to be recreated. Possible options are:
Run terraform taint aws_docdb_cluster.this to signal to Terraform that the resource needs to be recreated. It will then recreate it the next time you run terraform apply.
Delete the cluster through some other means, like the AWS console, and then run terraform apply.

The general approach is this, but i have no experience with documentdb. Hope this helps.
0. Take a backup of your terrafrom state file terraform state pull > backup_state_file_timestamp.json
Restore through the console to the point in time you want.
Remove the old instances and cluster from your terraform state file
terraform state rm aws_docdb_cluster_instance.this_2a
terraform state rm aws_docdb_cluster_instance.this_2b
terraform state rm aws_docdb_cluster.this
Import the manually restored cluster and instance into terraform
terraform import aws_docdb_cluster.this cluster_identifier
terraform import rm aws_docdb_cluster_instance.this_2a identifier
terraform import rm aws_docdb_cluster_instance.this_2b identifier
(see import at the bottom https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/docdb_cluster_instance and https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/docdb_cluster)

Terraform Unable to find Helm Release charts

I'm running Kubernetes on GCP and doing changes via Terraform v0.11.14
When running terraform plan I'm getting the error messages here
Error: Error refreshing state: 2 errors occurred:
* module.cls-xxx-us-central1-a-dev.helm_release.cert-manager: 1 error occurred:
* module.cls-xxx-us-central1-a-dev.helm_release.cert-manager: helm_release.cert-manager: error installing: the server could not find the requested resource
* module.cls-xxx-us-central1-a-dev.helm_release.nginx: 1 error occurred:
* module.cls-xxx-us-central1-a-dev.helm_release.nginx: helm_release.nginx: error installing: the server could not find the requested resource
Here's a copy of my helm.tf
resource "helm_release" "nginx" {
depends_on = ["google_container_node_pool.tally-np"]
name = "ingress-nginx"
chart = "ingress-nginx/ingress-nginx"
namespace = "kube-system"
}
resource "helm_release" "cert-manager" {
depends_on = ["google_container_node_pool.tally-np"]
name = "cert-manager"
chart = "stable/cert-manager"
namespace = "kube-system"
set {
name = "ingressShim.defaultIssuerName"
value = "letsencrypt-production"
}
set {
name = "ingressShim.defaultIssuerKind"
value = "ClusterIssuer"
}
provisioner "local-exec" {
command = "gcloud container clusters get-credentials ${var.cluster_name} --zone ${google_container_cluster.cluster.zone} && kubectl create -f ${path.module}/letsencrypt-prod.yaml"
}
}
I've read that Helm deprecated most of the old chart repos so I tried adding the repositories and installing the charts locally under the namespace kube-system but so far the issue is still persisting.
Here's the list of versions for Terraform and it's providers
Terraform v0.11.14
provider.google v2.17.0
provider.helm v0.10.2
provider.kubernetes v1.9.0
provider.random v2.2.1

As the community is moving towards Helm v3, the maintainers have depreciated the old helm model where we had a single mono repo called stable. The new model is like each product having its own repo. On November 13, 2020 the stable and incubator charts repository reached the end of development and became archives.
The archived charts are now hosted at a new URL. To continue using the archived charts, you will have to make some tweaks in your helm workflow.
Sample workaround:
helm repo add new-stable https://charts.helm.sh/stable
helm fetch new-stable/prometheus-operator

Nextflow doesn't use the right service account to deploy workflows to kubernetes

We're trying to use nextflow on a k8s namespace other than our default, the namespace we're using is nextflownamespace. We've created our PVC and ensured the default service account has an admin rolebinding. We're getting an error that nextflow can't access the PVC:
"message": "persistentvolumeclaims \"my-nextflow-pvc\" is forbidden:
User \"system:serviceaccount:mynamespace:default\" cannot get resource
\"persistentvolumeclaims\" in API group \"\" in the namespace \"nextflownamespace\"",
In that error we see that system:serviceaccount:mynamespace:default is incorrectly pointing to our default namespace, mynamespace, not nextflownamespace which we created for nextflow use.
We tried adding debug.yaml = true to our nextflow.config but couldn't find the YAML it submits to k8s to validate the error. Our config file looks like this:
profiles {
standard {
k8s {
executor = "k8s"
namespace = "nextflownamespace"
cpus = 1
memory = 1.GB
debug.yaml = true
}
aws{
endpoint = "https://s3.nautilus.optiputer.net"
}
}
We did verify that when we change the namespace to another arbitrary value the error message used the new arbitrary namespace, but the service account name continued to point to the users default namespace erroneously.
We've tried every variant of profiles.standard.k8s.serviceAccount = "system:serviceaccount:nextflownamespace:default" that we could think of but didn't get any change with those attempts.

I think it's best to avoid using nested config profiles with Nextflow. I would either remove the 'standard' layer from your profile or just make 'standard' a separate profile:
profiles {
standard {
process.executor = 'local'
}
k8s {
executor = "k8s"
namespace = "nextflownamespace"
cpus = 1
memory = 1.GB
debug.yaml = true
}
aws{
endpoint = "https://s3.nautilus.optiputer.net"
}
}

Unable to create windows nodepool on GKE cluster with google terraform GKE module

I am trying to provision a GKE cluster with windows node_pool using google modules, I am calling module
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"
version = "9.2.0"
I had to define two pools one for linux pool required by GKE and the windows one we require, terraform always succeeds in provisioning the linux node_pool but fails to provision the windows one and the error message
module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m31s elapsed]
module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m41s elapsed]
module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m51s elapsed]
module.gke.google_container_cluster.primary: Modifications complete after 24m58s [id=projects/xx-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev]
module.gke.google_container_node_pool.pools["windows-node-pool"]: Creating...
Error: error creating NodePool: googleapi: Error 400: Workload Identity is not supported on Windows nodes. Create the nodepool without workload identity by specifying --workload-metadata=GCE_METADATA., badRequest
on .terraform\modules\gke\terraform-google-kubernetes-engine-9.2.0\modules\beta-private-cluster-update-variant\cluster.tf line 341, in resource "google_container_node_pool" "pools":
341: resource "google_container_node_pool" "pools" {
I tried many places to set this metadata values but I coldn't get it right:
From terraform side :
I tried many places to add this metadata inside the node_config scope in the module itself or in my main.tf file where I call the module I tried to add it to the windows node_pool scope of the node_pools list but it didn't accept it with a message that setting WORKLOAD IDENTITY isn't expected here
I tried also setting enable_shielded_nodes = false but this didn't really help much.
I tried to test this if it is doable even through the command line this was my command line
C:\>gcloud container node-pools --region europe-west2 list
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
default-node-pool-d916 n1-standard-2 100 1.17.9-gke.600
C:\>gcloud container node-pools --region europe-west2 create window-node-pool --cluster=gke-nonpci-dev --image-type=WINDOWS_SAC --no-enable-autoupgrade --machine-type=n1-standard-2
WARNING: Starting in 1.12, new node pools will be created with their legacy Compute Engine instance metadata APIs disabled by default. To create a node pool with legacy instance metadata endpoints disabled, run `node-pools create` with the flag `--metadata disable-legacy-endpoints=true`.
This will disable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=Workload Identity is not supported on Windows nodes. Create the nodepool without workload identity by specifying --workload-metadata=GCE_METADATA.
C:\>gcloud container node-pools --region europe-west2 create window-node-pool --cluster=gke-nonpci-dev --image-type=WINDOWS_SAC --no-enable-autoupgrade --machine-type=n1-standard-2 --workload-metadata=GCE_METADATA --metadata disable-legacy-endpoints=true
This will disable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=Service account "874988475980-compute#developer.gserviceaccount.com" does not exist.
C:\>gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
* tf-xxx-xxx-xx-xxx#xx-xxx-xx-xxx-xxxx.iam.gserviceaccount.com
This service account from running gcloud auth list is the one I am running terraform with but I don't know where is this one in the error message coming from, even though trying to create the windows nodepool through command line as shown above also didn't work I am a bit stuck and I don't know what to do.
As module 9.2.0 is a stable module for us through all our linux based clusters we setup before, hence I thought this may be an old version for a windows node_pool I used the 11.0.0 instead to see if this would make it any different but ended up with a different error
module.gke.google_container_node_pool.pools["default-node-pool"]: Refreshing state... [id=projects/uk-tix-p1-npe-b821/locations/europe-west2/clusters/gke-nonpci-dev/nodePools/default-node-pool-d916]
Error: failed to execute ".terraform/modules/gke.gcloud_delete_default_kube_dns_configmap/terraform-google-gcloud-1.4.1/scripts/check_env.sh": fork/exec .terraform/modules/gke.gcloud_delete_default_kube_dns_configmap/terraform-google-gcloud-1.4.1/scripts/check_env.sh: %1 is not a valid Win32 application.
on .terraform\modules\gke.gcloud_delete_default_kube_dns_configmap\terraform-google-gcloud-1.4.1\main.tf line 70, in data "external" "env_override":
70: data "external" "env_override" {
Error: failed to execute ".terraform/modules/gke.gcloud_wait_for_cluster/terraform-google-gcloud-1.3.0/scripts/check_env.sh": fork/exec .terraform/modules/gke.gcloud_wait_for_cluster/terraform-google-gcloud-1.3.0/scripts/check_env.sh: %1 is not a valid Win32 application.
on .terraform\modules\gke.gcloud_wait_for_cluster\terraform-google-gcloud-1.3.0\main.tf line 70, in data "external" "env_override":
70: data "external" "env_override" {
This how I set node_pools parameters
node_pools = [
{
name = "linux-node-pool"
machine_type = var.nodepool_instance_type
min_count = 1
max_count = 10
disk_size_gb = 100
disk_type = "pd-standard"
image_type = "COS"
auto_repair = true
auto_upgrade = true
service_account = google_service_account.gke_cluster_sa.email
preemptible = var.preemptible
initial_node_count = 1
},
{
name = "windows-node-pool"
machine_type = var.nodepool_instance_type
min_count = 1
max_count = 10
disk_size_gb = 100
disk_type = "pd-standard"
image_type = var.nodepool_image_type
auto_repair = true
auto_upgrade = true
service_account = google_service_account.gke_cluster_sa.email
preemptible = var.preemptible
initial_node_count = 1
}
]
cluster_resource_labels = var.cluster_resource_labels
# health check and webhook firewall rules
node_pools_tags = {
all = [
"xx-xxx-xxx-local-xxx",
]
}
node_pools_metadata = {
all = {
// workload-metadata = "GCE_METADATA"
}
linux-node-pool = {
ssh-keys = join("\n", [for user, key in var.node_ssh_keys : "${user}:${key}"])
block-project-ssh-keys = true
}
windows-node-pool = {
workload-metadata = "GCE_METADATA"
}
}
this is a shared VPC where I provision my cluster with cluster version: 1.17.9-gke.600

Checkout https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/632 for the solution.
Error message is ambiguous and GKE has an internal bug to track this issue. We will improve the error message soon.

Specify namespace when creating kubernetes PV/PVC with Terraform

I am trying to create PV/PVC on a kubernetes GKE cluster using Terraform
However the documentation does not mention how can one specify the namespace that these resources should be created in.
I have tried adding it both in the spec and the metadata section but I get an error message:
resource "kubernetes_persistent_volume" "jenkins-persistent-volume" {
metadata {
name = "${var.kubernetes_persistent_volume_metadata_name}"
# tried placing it here -->> namespace = "${var.kubernetes_jenkins_namespace}"
}
spec {
# tried placing it here -->> namespace = "${var.kubernetes_jenkins_namespace}"
capacity = {
storage = "${var.kubernetes_persistent_volume_spec_capacity_storage}"
}
storage_class_name = "standard"
access_modes = ["ReadWriteMany"]
persistent_volume_source {
gce_persistent_disk {
fs_type = "ext4"
pd_name = "${google_compute_disk.jenkins-disk.name}"
}
}
}
}
Error: module.jenkins.kubernetes_persistent_volume.jenkins-persistent-volume: spec.0: invalid or unknown key: namespace
Where such a configuration be placed?

Persistent volumes are cluster-global objects and do not live in specific namespaces. ("It is a resource in the cluster just like a node is a cluster resource.") Correspondingly you can't include a namespace name anywhere on a kubernetes_persistent_volume resource.
If you're running in a cloud environment (and here your PV is creating a Google storage volume) its typical to only create a persistent volume claim, and let the cluster allocate the underlying volume for you. PVCs are namespace-scoped, and the Terraform kubernetes_persistent_volume_claim resource explicitly documents that you can include a namespace in the metadata block.