How to prevent hashicorp vault from sealing? - hashicorp-vault

Our hashicorp vault deployment on k8s (on premise) seem to seal itself after few days. I am unable to find a way to keep it always unsealed so that applications which are using it do not fail.

Activate auto-unsealing. Here's my GCP example, in Terraform (I am running Hashicorp Vault on Kubernetes).
resource "google_service_account" "hashicorp_vault" {
project = var.project
account_id = "hashicorp-vault"
display_name = "Hashicorp Vault Service Account"
}
resource "google_service_account_iam_member" "hashicorp_vault_iam_workload_identity_user_member" {
service_account_id = google_service_account.hashicorp_vault.name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project}.svc.id.goog[${helm_release.hashicorp_vault.namespace}/hashicorp-vault]"
}
resource "google_project_iam_custom_role" "hashicorp_vault_role" {
project = var.project
role_id = "hashicorp_vault"
title = "Hashicorp Vault"
permissions = [
"cloudkms.cryptoKeyVersions.useToEncrypt",
"cloudkms.cryptoKeyVersions.useToDecrypt",
"cloudkms.cryptoKeys.get",
]
}
resource "google_project_iam_member" "cicd_bot_role_member" {
project = var.project
role = google_project_iam_custom_role.hashicorp_vault_role.name
member = "serviceAccount:${google_service_account.hashicorp_vault.email}"
}
resource "google_kms_key_ring" "hashicorp_vault" {
project = var.project
location = var.region
name = "hashicorp-vault"
}
resource "google_kms_crypto_key" "hashicorp_vault_recovery_key" {
name = "hashicorp-vault-recovery-key"
key_ring = google_kms_key_ring.hashicorp_vault.id
lifecycle {
prevent_destroy = true
}
}
resource "helm_release" "hashicorp_vault" {
name = "hashicorp-vault"
repository = "https://helm.releases.hashicorp.com"
chart = "vault"
version = var.hashicorp_vault_version
namespace = "hashicorp-vault"
create_namespace = true
set {
name = "server.extraEnvironmentVars.VAULT_SEAL_TYPE"
value = "gcpckms"
}
set {
name = "server.extraEnvironmentVars.GOOGLE_PROJECT"
value = var.project
}
set {
name = "server.extraEnvironmentVars.GOOGLE_REGION"
value = var.region
}
set {
name = "server.extraEnvironmentVars.VAULT_GCPCKMS_SEAL_KEY_RING"
value = google_kms_key_ring.hashicorp_vault.name
}
set {
name = "server.extraEnvironmentVars.VAULT_GCPCKMS_SEAL_CRYPTO_KEY"
value = google_kms_crypto_key.hashicorp_vault_recovery_key.name
}
set {
name = "server.serviceaccount.annotations.iam\\.gke\\.io/gcp-service-account"
value = google_service_account.hashicorp_vault.email
}
}
After doing this, I noticed that my Hashicorp Vault pod was in error state, so I deleted it so it could pick up the new environment variables. Then, it came online with a message indicating that it was ready for "migration" to the new unsealing strategy.
Then, use the operator to migrate to the new sealing strategy:
vault operator unseal -migrate

Related

Vault Helm chart run with terraform does not create an ingress on kubernetes

I'm trying to install Vault on a Kubernetes Cluster by running the Vault Helm chart out of Terraform. For some reason the ingress doesn't get created.
When I forward the pods port the ui comes up fine, so I assume everything is working, but the ingress not being available is tripping me up.
Edit: There are no errors while running terraform apply.
If there is another point where I should look, please tell me.
This is my helm_release resource:
name = "vault"
repository = "https://helm.releases.hashicorp.com"
chart = "vault"
namespace = "vault"
create_namespace = true
set {
name = "ui.enabled"
value = "true"
}
#Set ingress up to use cert-manager provided secret
set {
name = "ingress.enabled"
value = "true"
}
set {
name = "ingress.annotations.cert-manager\\.io/cluster-issuer"
value = "letsencrypt-cluster-prod"
}
set {
name = "ingress.annotations.kubernetes\\.io/ingress\\.class"
value = "nginx"
}
set {
name = "ingress.tls[0].hosts[0]"
value = var.vault_hostname
}
set {
name = "ingress.hosts[0].host"
value = var.vault_hostname
}
set {
name = "ingress.hosts[0].paths[0]"
value = "/"
}
}
I'm relatively new to all of these techs, having worked with puppet before, so if someone could point me in the right direction, I'd be much obliged.
I achieved enabling ingress with a local variable, here is the working example
locals {
values = {
server= {
ingress = {
enabled = var.server_enabled
labels = {
traffic = "external"
}
ingressClassName = "nginx"
annotations = {
"kubernetes.io/tls-acme" = "true"
"nginx.ingress.kubernetes.io/ssl-redirect" = "true"
}
hosts = [{
host = vault.example.com
paths = ["/"]
}]
tls = [
{
secretName = vault-tls-secret
hosts = ["vault.example.com"]
}
]
}
}
}
}
resource "helm_release" "vault" {
name = "vault"
namespace = "vault"
repository = "https://helm.releases.hashicorp.com"
chart = "vault"
version = "0.19.0"
create_namespace = true
# other value to set
#set {
# name = "server.ha.enabled"
#value = "true"
#}
values = [
yamlencode(local.values)
]
}

Automated .kube/config file created with Terraform GKE module doesn't work

We are using the below terraform module to create the GKE cluster and the local config file. But the kubectl command doesn't work. The terminal just keeps on waiting. Then we have to manually run the command gcloud container clusters get-credentials to fetch and set up the local config credentials post the cluster creation, and then the kubectl command works. It doesn't feel elegant to execute the gcloud command after the terraform setup. Therefore please help to correct what's wrong.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = ">= 3.42.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = ">= 3.0.0"
}
}
}
provider "google" {
credentials = var.gcp_creds
project = var.project_id
region = var.region
zone = var.zone
}
provider "google-beta" {
credentials = var.gcp_creds
project = var.project_id
region = var.region
zone = var.zone
}
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"
project_id = var.project_id
name = "cluster"
regional = true
region = var.region
zones = ["${var.region}-a", "${var.region}-b", "${var.region}-c"]
network = module.vpc.network_name
subnetwork = module.vpc.subnets_names[0]
ip_range_pods = "gke-pod-ip-range"
ip_range_services = "gke-service-ip-range"
horizontal_pod_autoscaling = true
enable_private_nodes = true
master_ipv4_cidr_block = "${var.control_plane_cidr}"
node_pools = [
{
name = "node-pool"
machine_type = "${var.machine_type}"
node_locations = "${var.region}-a,${var.region}-b,${var.region}-c"
min_count = "${var.node_pools_min_count}"
max_count = "${var.node_pools_max_count}"
disk_size_gb = "${var.node_pools_disk_size_gb}"
auto_repair = true
auto_upgrade = true
},
]
}
# Configure the authentication and authorisation of the cluster
module "gke_auth" {
source = "terraform-google-modules/kubernetes-engine/google//modules/auth"
depends_on = [module.gke]
project_id = var.project_id
location = module.gke.location
cluster_name = module.gke.name
}
# Define a local file that will store the necessary info such as certificate, user and endpoint to access the cluster
resource "local_file" "kubeconfig" {
content = module.gke_auth.kubeconfig_raw
filename = "~/.kube/config"
}
UPDATE:
We found that the automated .kube/config file was created with the wrong public IP of the API Server whereas the one created with the gcloud command contains the correct IP of the API Server. Any guesses why terraform one is fetching the wrong public IP?

Helm - Kubernetes cluster unreachable: the server has asked for the client to provide credentials

I'm trying to deploy an EKS self managed with Terraform. While I can deploy the cluster with addons, vpc, subnet and all other resources, it always fails at helm:
Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials
with module.eks-ssp-kubernetes-addons.module.ingress_nginx[0].helm_release.nginx[0]
on .terraform/modules/eks-ssp-kubernetes-addons/modules/kubernetes-addons/ingress-nginx/main.tf line 19, in resource "helm_release" "nginx":
resource "helm_release" "nginx" {
This error repeats for metrics_server, lb_ingress, argocd, but cluster-autoscaler throws:
Warning: Helm release "cluster-autoscaler" was created but has a failed status.
with module.eks-ssp-kubernetes-addons.module.cluster_autoscaler[0].helm_release.cluster_autoscaler[0]
on .terraform/modules/eks-ssp-kubernetes-addons/modules/kubernetes-addons/cluster-autoscaler/main.tf line 1, in resource "helm_release" "cluster_autoscaler":
resource "helm_release" "cluster_autoscaler" {
My main.tf looks like this:
terraform {
backend "remote" {}
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 3.66.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = ">= 2.7.1"
}
helm = {
source = "hashicorp/helm"
version = ">= 2.4.1"
}
}
}
data "aws_eks_cluster" "cluster" {
name = module.eks-ssp.eks_cluster_id
}
data "aws_eks_cluster_auth" "cluster" {
name = module.eks-ssp.eks_cluster_id
}
provider "aws" {
access_key = "xxx"
secret_key = "xxx"
region = "xxx"
assume_role {
role_arn = "xxx"
}
}
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.cluster.token
}
provider "helm" {
kubernetes {
host = data.aws_eks_cluster.cluster.endpoint
token = data.aws_eks_cluster_auth.cluster.token
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
}
}
My eks.tf looks like this:
module "eks-ssp" {
source = "github.com/aws-samples/aws-eks-accelerator-for-terraform"
# EKS CLUSTER
tenant = "DevOpsLabs2b"
environment = "dev-test"
zone = ""
terraform_version = "Terraform v1.1.4"
# EKS Cluster VPC and Subnet mandatory config
vpc_id = "xxx"
private_subnet_ids = ["xxx","xxx", "xxx", "xxx"]
# EKS CONTROL PLANE VARIABLES
create_eks = true
kubernetes_version = "1.19"
# EKS SELF MANAGED NODE GROUPS
self_managed_node_groups = {
self_mg = {
node_group_name = "DevOpsLabs2b"
subnet_ids = ["xxx","xxx", "xxx", "xxx"]
create_launch_template = true
launch_template_os = "bottlerocket" # amazonlinux2eks or bottlerocket or windows
custom_ami_id = "xxx"
public_ip = true # Enable only for public subnets
pre_userdata = <<-EOT
yum install -y amazon-ssm-agent \
systemctl enable amazon-ssm-agent && systemctl start amazon-ssm-agent \
EOT
disk_size = 10
instance_type = "t2.small"
desired_size = 2
max_size = 10
min_size = 0
capacity_type = "" # Optional Use this only for SPOT capacity as capacity_type = "spot"
k8s_labels = {
Environment = "dev-test"
Zone = ""
WorkerType = "SELF_MANAGED_ON_DEMAND"
}
additional_tags = {
ExtraTag = "t2x-on-demand"
Name = "t2x-on-demand"
subnet_type = "public"
}
create_worker_security_group = false # Creates a dedicated sec group for this Node Group
},
}
}
enable_amazon_eks_vpc_cni = true
amazon_eks_vpc_cni_config = {
addon_name = "vpc-cni"
addon_version = "v1.7.5-eksbuild.2"
service_account = "aws-node"
resolve_conflicts = "OVERWRITE"
namespace = "kube-system"
additional_iam_policies = []
service_account_role_arn = ""
tags = {}
}
enable_amazon_eks_kube_proxy = true
amazon_eks_kube_proxy_config = {
addon_name = "kube-proxy"
addon_version = "v1.19.8-eksbuild.1"
service_account = "kube-proxy"
resolve_conflicts = "OVERWRITE"
namespace = "kube-system"
additional_iam_policies = []
service_account_role_arn = ""
tags = {}
}
#K8s Add-ons
enable_aws_load_balancer_controller = true
enable_metrics_server = true
enable_cluster_autoscaler = true
enable_aws_for_fluentbit = true
enable_argocd = true
enable_ingress_nginx = true
depends_on = [module.eks-ssp.self_managed_node_groups]
}
OP has confirmed in the comment that the problem was resolved:
Of course. I think I found the issue. Doing "kubectl get svc" throws: "An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:iam::xxx:user/terraform_deploy is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxx:user/terraform_deploy"
Solved it by using my actual role, that's crazy. No idea why it was calling itself.
For similar problem look also this issue.
I solved this error by adding dependencies in the helm installations.
The depends_on will wait for the step to successfully complete and then helm module runs.
module "nginx-ingress" {
depends_on = [module.eks, module.aws-load-balancer-controller]
source = "terraform-module/release/helm"
...}
module "aws-load-balancer-controller" {
depends_on = [module.eks]
source = "terraform-module/release/helm"
...}
module "helm_autoscaler" {
depends_on = [module.eks]
source = "terraform-module/release/helm"
...}

Terraform Plan deleting AKS Node Pool

Terraform plan always forces AKS cluster to be recreated if we increase worker node in node pool
Trying Creating AKS Cluster with 1 worker node, via Terraform, it went well , Cluster is Up and running.
Post that, i tried to add one more worker node in my AKS, Terraform Show Plan: 2 to add, 0 to change, 2 to destroy.
Not Sure how can we increase worker node in aks node pool, if it delate the existing node pool.
default_node_pool {
name = var.nodepool_name
vm_size = var.instance_type
orchestrator_version = data.azurerm_kubernetes_service_versions.current.latest_version
availability_zones = var.zones
enable_auto_scaling = var.node_autoscalling
node_count = var.instance_count
enable_node_public_ip = var.publicip
vnet_subnet_id = data.azurerm_subnet.subnet.id
node_labels = {
"node_pool_type" = var.tags[0].node_pool_type
"environment" = var.tags[0].environment
"nodepool_os" = var.tags[0].nodepool_os
"application" = var.tags[0].application
"manged_by" = var.tags[0].manged_by
}
}
Error
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement
Terraform will perform the following actions:
# azurerm_kubernetes_cluster.aks_cluster must be replaced
-/+ resource "azurerm_kubernetes_cluster" "aks_cluster" {
Thanks
Satyam
I tested the same in my environment by creating a cluster with 2 node counts and then changed it to 3 using something like below :
If you are using HTTP_proxy then it will by default force a replacement on that block and that's the reason the whole cluster will get replaced with the new configurations.
So, for a solution you can use lifecycle block in your code as I have done below:
lifecycle {
ignore_changes = [http_proxy_config]
}
The code will be :
resource "azurerm_kubernetes_cluster" "aks_cluster" {
name = "${var.global-prefix}-${var.cluster-id}-${var.envid}-azwe-aks-01"
location = data.azurerm_resource_group.example.location
resource_group_name = data.azurerm_resource_group.example.name
dns_prefix = "${var.global-prefix}-${var.cluster-id}-${var.envid}-azwe-aks-01"
kubernetes_version = var.cluster-version
private_cluster_enabled = var.private_cluster
default_node_pool {
name = var.nodepool_name
vm_size = var.instance_type
orchestrator_version = data.azurerm_kubernetes_service_versions.current.latest_version
availability_zones = var.zones
enable_auto_scaling = var.node_autoscalling
node_count = var.instance_count
enable_node_public_ip = var.publicip
vnet_subnet_id = azurerm_subnet.example.id
}
# RBAC and Azure AD Integration Block
role_based_access_control {
enabled = true
}
http_proxy_config {
http_proxy = "http://xxxx"
https_proxy = "http://xxxx"
no_proxy = ["localhost","xxx","xxxx"]
}
# Identity (System Assigned or Service Principal)
identity {
type = "SystemAssigned"
}
# Add On Profiles
addon_profile {
azure_policy {enabled = true}
}
# Network Profile
network_profile {
network_plugin = "azure"
network_policy = "calico"
}
lifecycle {
ignore_changes = [http_proxy_config]
}
}

Update K8 storage class, persistent volume, and persistent volume claim when K8 secret is updated

I have a K8 cluster that has smb mounted drives connected to an AWS Storage Gateway / file share. We've recently undergone a migration of that SGW to another AWS account and while doing that the IP address and password for that SGW changed.
I noticed that our existing setup has a K8 storage class that looks for a K8 secret called "smbcreds". In that K8 secret they have keys "username" and "password". I'm assuming it's in line with the setup guide for the Helm chart we're using "csi-driver-smb".
I assumed changing the secret used for the storage class would update everything downstream that uses that storage class, but apparently it does not. I'm obviously a little cautious when it comes to potentially blowing away important data, what do I need to do to update everything to use the new secret and IP config?
Here is a simple example of our setup in Terraform -
provider "kubernetes" {
config_path = "~/.kube/config"
config_context = "minikube"
}
resource "helm_release" "container_storage_interface_for_aws" {
count = 1
name = "local-filesystem-csi"
repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts"
chart = "csi-driver-smb"
namespace = "default"
}
resource "kubernetes_storage_class" "aws_storage_gateway" {
count = 1
metadata {
name = "smbmount"
}
storage_provisioner = "smb.csi.k8s.io"
reclaim_policy = "Retain"
volume_binding_mode = "WaitForFirstConsumer"
parameters = {
source = "//1.2.3.4/old-file-share"
"csi.storage.k8s.io/node-stage-secret-name" = "smbcreds"
"csi.storage.k8s.io/node-stage-secret-namespace" = "default"
}
mount_options = ["vers=3.0", "dir_mode=0777", "file_mode=0777"]
}
resource "kubernetes_persistent_volume_claim" "aws_storage_gateway" {
count = 1
metadata {
name = "smbmount-volume-claim"
}
spec {
access_modes = ["ReadWriteMany"]
resources {
requests = {
storage = "10Gi"
}
}
storage_class_name = "smbmount"
}
}
resource "kubernetes_deployment" "main" {
metadata {
name = "sample-pod"
}
spec {
replicas = 1
selector {
match_labels = {
app = "sample-pod"
}
}
template {
metadata {
labels = {
app = "sample-pod"
}
}
spec {
volume {
name = "shared-fileshare"
persistent_volume_claim {
claim_name = "smbmount-volume-claim"
}
}
container {
name = "ubuntu"
image = "ubuntu"
command = ["sleep", "3600"]
image_pull_policy = "IfNotPresent"
volume_mount {
name = "shared-fileshare"
read_only = false
mount_path = "/data"
}
}
}
}
}
}
My original change was to change the K8 secret "smbcreds" and change source = "//1.2.3.4/old-file-share" to source = "//5.6.7.8/new-file-share"
The solution I settled on was to create a second K8 storage class and persistent volume claim that's connected to the new AWS Storage Gateway. I then switched the K8 deployments to use the new PVC.