How to convert/migrate existing google cloud platform infrastructure to terraform or other IaC - kubernetes

Currently we have our kubernetes cluster master set to zonal, and require it to be regional. My idea is to convert the existing cluster and all workloads/nodes/resources to some infrastructure-as-code - preferably terraform (but could be as simple as a set of gcloud commands).
I know with GCP I can generate raw command lines for commands I'm about to run, but I don't know how (or if I even can) to convert existing infrastructure to the same.
Based on my research, it looks like it isn't exactly possible to do what I'm trying to do [in a straight-forward fashion]. So I'm looking for any advice, even if it's just to read some other documentation (for a tool I'm not familiar with maybe).
TL;DR: I'm looking to take my existing Google Cloud Platform Kubernetes cluster and rebuild it in order to change the location type from zonal to master - I don't actually care how this is done. What is a currently accepted best-practice way of doing this? If there isn't one, what is a quick and dirty way of doing this?
If you require me to specify further, I will - I have intentionally left out linking to specific research I've done.

Creating a Kubernetes cluster with terraform is very straightforward because ultimately making a Kubernetes cluster in GKE is straightforward, you'd just use the google_container_cluster and google_container_node_pool resources, like so:
resource "google_container_cluster" "primary" {
name = "${var.name}"
region = "${var.region}"
project = "${var.project_id}"
min_master_version = "${var.version}"
addons_config {
kubernetes_dashboard {
disabled = true
}
}
maintenance_policy {
daily_maintenance_window {
start_time = "03:00"
}
}
lifecycle {
ignore_changes = ["node_pool"]
}
node_pool {
name = "default-pool"
}
}
resource "google_container_node_pool" "default" {
name = "default"
project = "${var.project_id}"
region = "${var.region}"
cluster = "${google_container_cluster.primary.name}"
autoscaling {
min_node_count = "${var.node_pool_min_size}"
max_node_count = "${var.node_pool_max_size}"
}
management {
auto_repair = "${var.node_auto_repair}"
auto_upgrade = "${var.node_auto_upgrade}"
}
lifecycle {
ignore_changes = ["initial_node_count"]
}
node_config {
machine_type = "${var.node_machine_type}"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
depends_on = ["google_container_cluster.primary"]
}
For a more fully featured experience, there are terraform modules available like this one
Converting an existing cluster is considerably more fraught. If you want to use terraform import
terraform import google_container_cluster.mycluster us-east1-a/my-cluster
However, in your comment , you mentioned wanting to convert a zonal cluster to a regional cluster. Unfortunately, that's not possible at this time
You decide whether your cluster is zonal or regional when you create
it. You cannot convert an existing zonal cluster to regional, or vice
versa.
Your best bet, in my opinion, is to:
Create a regional cluster with terraform, giving the cluster a new name
Backup your existing zonal cluster, either using an etcd backup, or a more sophisticated backup using heptio-ark
Restore that backup to your regional cluster

I wanted to achieve exactly that: Take existing cloud infrastructure and bring it to infrastructure as code (IaC), i.e. put it in *.tf files
There were basically 2 options that I found and took into consideration:
terraform import (Documentation)
Because of the following limitation terraform import did not achieve exactly what I was looking for, because it requires to manually create the resources.
The current implementation of Terraform import can only import resources into the state. It does not generate configuration. A future version of Terraform will also generate configuration.
Because of this, prior to running terraform import it is necessary to write manually a resource configuration block for the resource, to which the imported object will be mapped.
Terraformer (GitHub Repo)
A CLI tool that generates tf/json and tfstate files based on existing infrastructure (reverse Terraform).
This tools is provider-agnostic and follows the flow as terraform, i.e. plan and import. It was able to import specific resources entire workspaces and convet it into *.tf files.

Related

How to upsize volume of Terraformed EKS node

We have been using Terraform for almost a year now to manage all kinds of resources on AWS from bastion hosts to VPCs, RDS and also EKS.
We are sometimes really baffled by the EKS module. It could however be due to lack of understanding (and documentation), so here it goes:
Problem: Upsizing Disk (volume)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "12.2.0"
cluster_name = local.cluster_name
cluster_version = "1.19"
subnets = module.vpc.private_subnets
#...
node_groups = {
first = {
desired_capacity = 1
max_capacity = 5
min_capacity = 1
instance_type = "m5.large"
}
}
I thought the default value for this (dev) k8s cluster's node can easily be the default 20GBs but it's filling up fast so I know want to change disk_size to let's say 40GBs.
=> I thought I could just add something like disk_size=40 and done.
terraform plan tells me I need to replace the node. This is a 1 node cluster, so not good. And even if it were I don't want to e.g. drain nodes. That's why I thought we are using managed k8s like EKS.
Expected behaviour: since these are elastic volumes I should be able to upsize but not downsize, why is that not possible? I can def. do so from the AWS UI.
Sure with a slightly scary warning:
Are you sure that you want to modify volume vol-xx?
It may take some time for performance changes to take full effect.
You may need to extend the OS file system on the volume to use any newly-allocated space
But I can work with the provided docs on that: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html?icmpid=docs_ec2_console
Any guidelines on how to up the storage? If I do so with the UI but don't touch Terraform then my EKS state will be nuked/out of sync.
To my knowledge, there is currently no way to resize an EKS node volume without recreating the node using Terraform.
Fortunately, there is a workaround: As you also found out, you can directly change the node size via the AWS UI or API. To update your state file afterward, you can run terraform apply -refresh-only to download the latest data (e.g., the increased node volume size). After that, you can change the node size in your Terraform plan to keep both plan and state in sync.
For the future, you might want to look into moving to ephemeral nodes as (at least my) experience shows that you will have unforeseeable changes to clusters and nodes from time to time. Already planning with replaceable nodes in mind will make these changes substantially easier.
By using the terraform-aws-eks terraform module you are actually following the "ephemeral nodes" paradigm, because for both ways of creating instances (self-managed workers or managed node groups) the module is creating Autoscaling Groups that create EC2 instances out of a Launch Template.
ASG and Launch Templates are specifically designed so that you don't care anymore about specific nodes, and rather you just care about the number of nodes. This means that for updating the nodes, you just replace them with new ones, which will use the new updated launch template (with more GBs for example, or with a new updated AMI, or a new instance type).
This is called "rolling updates", and it can be done manually (adding new instances, then draining the node, then deleting the old node), with scripts (see: eks-rolling-update in github by Hellofresh), or it can be done automagically if you use the AWS managed nodes (the ones you are actually using when specifying "node_groups", that is why if you add more GB, it will replace the node automatically when you run apply).
And this paradigm is the most common when operating Kubernetes in the cloud (and also very common on-premise datacenters when using virtualization).
Option 1) Self Managed Workers
With self managed nodes, when you change a parameter like disk_size or instance_type, it will change the Launch Template. It will update the $latest version tag, which is commonly where the ASG is pointing to (although can be changed). This means that old instances will not see any change, but new ones will have the updated configuration.
If you want to change the existing instances, you actually want to replace them with new ones. That is what this ephemeral nodes paradigm is.
One by one you can drain the old instances while increasing the number of desired_instances on the ASG, or let the cluster autoscaler do the job. Alternatively, you can use an automated script which does this for you for each ASG: https://github.com/hellofresh/eks-rolling-update
In terraform_aws_eks module, you create self managed workers by either using worker_groups or worker_groups_launch_template (recommended) field
Option 2) Managed Nodes
Managed nodes is an EKS-specific feature. You configure them very similarly, but in reality, it is an abstraction, and AWS will create the actual underlying ASG.
You can specify a Launch Template to be used by the ASG and its version. Some config can be specified at the managed node level (i.e. AMI and instance_types) and at the Launch Template (if it wasn't specified in the former).
Any change on the node group level config, or on the Launch Template version, will trigger an automatic rolling update, which will replace all old instances.
You can delay the rolling update by just not pointing to the $latest version (or pointing to $default, and not updating the $default tag when changing the LT).
In terraform_aws_eks module, you create self managed workers by using the node_groups field. You can also play with these settings: create_launch_template=true and set_instance_types_on_lt=true if you want the module to create the LT for you (alternatively you can just not use it, or pass a reference to one); and to set the instance_type on such LT as specified above.
But behavior is similar to worker groups. In no case you will have your existing instances changed. You can only change them manually.
However, there is an alternative: The manual way
You can use the EKS module to create the control plane, but then use a regular EC2 resource in terraform (https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance) to create one ore multiple (using count or for_each) instances.
If you create the instances using the aws_instance resource, then terraform will patch those instances (updated-in-place) when any change is allowed (i.e. increasing the root volue GB or the instance type; whereas changing the AMI will force a replacement).
The only tricky part, is that you need to configure the cloud-init script to make the instance join the cluster (something that is automatically done by the EKS module when using self/managed node groups).
However, it is very possible, and you can borrow the script from the module and plug it into the aws_instance's user_data field (https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance#user_data)
In this case (when talking about disk_size), however, you still need to manually (either by SSH, or by running an hacky exec using terraform) to patch the XFS filesystem so it sees the increased disk space.
Another alternative: Consider Kubernetes storage
That said, there is also another alternative for certain use cases. If you want to increase the disk space of those instances because of one of your applications using a hostPath, then it might be the case that you can use a kubernetes built-in storage solution using the EBS CSI driver.
For example, I manage an ElasticSearch cluster in Kubernetes (and deploy it from terraform with the helm module), and it uses dynamic storage provisioning to request an EBS volume (note that performance is the same, because both root and this other volume are EBS volumes). EBS CSI driver supports volume expansion, so I can just increase this disk by changing a terraform variable.
To conclude, I would not recommend the aws_instance way, unless you understand it and are sure you really want it. It may make sense in certain cases, but definitely not common

terraform remote backend using postgres

i am planning to use remote backend as postgres instead of s3 as enterprise standard.
terraform {
backend "pg" {
conn_str = "postgres://user:pass#db.example.com/schema_name"
}
}
When we use postgres remote backend, when we run terraform init, we have to provide schema which is specific to that terraform folder, as backend supports only one table and new record will be created with workspace name.
I am stuck now, as i have 50 projects and each have 2 tiers which is maintained in different folders, then we need to create 100 schemas in postgres. Also it is difficult to handle so many schemas in automated provisioning.
Can we handle something in similar to S3, where we have one bucket for all projects and multiple entries in same bucket with different key which specified in each terraform script. Can we use single schema for all projects and multiple tables/records based on key provide in backend configuration of each terraform folder.
You can use a single database and the pg provider will automatically create a specified schema.
Something like this:
terraform {
backend "pg" {
conn_str = "postgres://user:pass#db.example.com/terraform_backend"
schema = "fooapp"
}
}
This keeps the projects unique, at least. You could append a tier to that, too, or use Terraform Workspaces.
If you specify the config on the command line (aka partial configuration), as the provider recommends, it might make it easier to dynamically set for your use case:
terraform init \
-backend-config="conn_str=postgres://user:pass#db.example.com/terraform_backend" \
-backend-config="schema=fooapp-prod"
This works pretty well in my scenario similar to yours. Each project has a unique schema in a shared database and no tasks beyond the initial creation/configuration of the database is needed - the provider creates the schema as specified.

Kubernetes - Upgrading Kubernetes-cluster version through Terraform

I assume there are no stupid questions, so here is one that I could not find a direct answer to.
The situation
I currently have a Kubernetes-cluster running 1.15.x on AKS, deployed and managed through Terraform. AKS recently Azure announced that they would retire the 1.15 version of Kubernetes on AKS, and I need to upgrade the cluster to 1.16 or later. Now, as I understand the situation, upgrading the cluster directly in Azure would have no consequences for the content of the cluster, I.E nodes, pods, secrets and everything else currently on there, but I can not find any proper answer to what would happen if I upgrade the cluster through Terraform.
Potential problems
So what could go wrong? In my mind, the worst outcome would be that the entire cluster would be destroyed, and a new one would be created. No pods, no secrets, nothing. Since there is so little information out there, I am asking here, to see if there are anyone with more experience with Terraform and Kubernetes that could potentially help me out.
To summary:
Terraform versions
Terraform v0.12.17
+ provider.azuread v0.7.0
+ provider.azurerm v1.37.0
+ provider.random v2.2.1
What I'm doing
§ terraform init
//running terrafrom plan with new Kubernetes version declared for AKS
§ terraform plan
//Following changes are announced by Terraform:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
#module.mycluster.azurerm_kubernetes_cluster.default will be updated in-place...
...
~ kubernetes_version = "1.15.5" -> "1.16.13"
...
Plan: 0 to add, 1 to change, 0 to destroy.
What I want to happen
Terraform will tell Azure to upgrade the existing AKS-service, not destroy before creating a new one. I assume that this will happen, as Terraform announces that it will "update in-place", instead of adding new and/or destroying existing clusters.
I found this question today and thought I'd add my experience as well. I made the following changes:
Changed the kubernetes_version under azurerm_kubernetes_cluster from "1.16.15" -> "1.17.16"
Changed the orchestrator_version under default_node_pool from "1.16.15" -> "1.17.16"
Increased the node_count under default_node_pool from 1 -> 2
A terraform plan showed that it was going to update in-place. I then performed a terraform apply which completed successfully. kubectl get nodes showed that an additional node was created, but both nodes in the pool were still on the old version. After further inspection in Azure Portal it was found that only the k8s cluster version was upgraded and not the version of the node pool. I then executed terraform plan again and again it showed that the orchestrator_version under default_node_pool was going to be updated in-place. I then executed terraform apply which then proceeded to upgrade the version of the node pool. It did that whole thing where it creates an additional node in the pool (with the new version) and sets the status to NodeSchedulable while setting the existing node in the pool to NodeNotSchedulable. The NodeNotSchedulable node is then replaced by a new node with the new k8s version and eventually set to NodeSchedulable. It did this for both nodes. Afterwards all nodes were upgraded without any noticeable downtime.
I'd say this shows that the Terraform method is non-destructive, even if there have at times been oversights in the upgrade process (but still non-destructive in this example): https://github.com/terraform-providers/terraform-provider-azurerm/issues/5541
If you need higher confidence for this change then you could alternativly consider using the Azure-based upgrade method, refreshing the changes back into your state, and tweaking the code until a plan generation doesn't show anything intolerable. The two azurerm_kubernetes_cluster arguments dealing with version might be all you need to tweak.

Redshift COPY command takes forever

I have created a brand new Redshift cluster with Terraform. I am able to connect to it and run COPY command but even for small CSV files with 10 lines it takes forever. I have rebooted cluster but is the same. Also there are no errors in stl_load_errors table.
Did someone experience similar issue?
Here is Terraform code for creating VPC endpoint. Maybe at some point I will create complete template for Redshift and update this post.
resource "aws_vpc_endpoint" "s3_vpc_endpoint" {
vpc_id = aws_vpc.vpc.id
service_name = "com.amazonaws.${var.region}.s3"
}
resource "aws_vpc_endpoint_route_table_association" "s3_vpc_endpoint_route_table_association" {
route_table_id = aws_route_table.route_table.id
vpc_endpoint_id = aws_vpc_endpoint.s3_vpc_endpoint.id
}

In teraform, is there a way to refresh the state of a resource using TF files without using CLI commands?

I have a requirement to refresh the state of a resource "ibm_is_image" using TF files without using CLI commands ?
I know that we can import the state of a resource using "terraform import ". But I should do the same using IaC in TF files.
How to achieve this ?
Example:
In workspace1, I create a resource "f5_custom_image" which gets deleted later from command line. In workspace2, the same code in TF file will assume that "f5_custom_image" already exists and it fails to read the custom image resource. So, my code has to refresh the terraform state of this resource for every execution of "terraform apply":
resource "ibm_is_image" "f5_custom_image" {
depends_on = ["data.ibm_is_images.custom_images"]
href = "${local.image_url}"
name = "${var.vnf_vpc_image_name}"
operating_system = "centos-7-amd64"
timeouts {
create = "30m"
delete = "10m"
}
}
In Terraform's model, an object is fully managed by a single Terraform configuration and nothing else. Having an object be managed by multiple configurations or having an object be created by Terraform but then deleted later outside of Terraform is not a supported workflow.
Terraform is intended for managing long-lived architecture that you will gradually update over time. It is not designed to manage build artifacts like machine images that tend to be created, used, and then destroyed.
The usual architecture for this sort of use-case is to consider the creation of the image as a "build" step, carried out using some other software outside of Terraform, and then we use Terraform only for the "deploy" step, at which point the long-lived infrastructure is either created or updated to use the new image.
That leads to a build and deploy pipeline with a series of steps like this:
Use separate image build software to construct the image, and record the id somewhere from which it can be retrieved using a data source in Terraform.
Run terraform apply to update the long-lived infrastructure to make use of the new image. The Terraform configuration should include a data block to read the image id from wherever it was recorded in the previous step.
If desired, destroy the image using software outside of Terraform once Terraform has completed.
When implementing a pipeline like this, it's optional but common to also consider a "rollback" process to use in case the new image is faulty:
Reset the recorded image id that Terraform is reading from back to the id that was stored prior to the new build step.
Run terraform apply to update the long-lived infrastructure back to using the old image.
Of course, supporting that would require retaining the previous image long enough to prove that the new image is functioning correctly, so the normal build and deploy pipeline would need to retain at least one historical image per run to roll back to. With that said, if you have a means to quickly recreate a prior image during rollback then this special workflow isn't strictly needed: instead, you can implement rollback instead by "rolling forward" to an image constructed with the prior configuration.
An example software package commonly used to prepare images for use with Terraform on other cloud vendors is HashiCorp Packer, but sadly it looks like it does not have IBM Cloud support and so you may need to look for some similar software that does support IBM Cloud, or write something yourself using the IBM Cloud SDK.