Deployment manager fails with a cryptic error - kubernetes

So, I have a deployment manager configuration that is supposed to create a GKE cluster.
Here is the configuration:
resources:
- name: mycluster
type: container.v1.cluster
properties:
zone: us-central1-a
cluster:
name: mycluster
description: hello mycluster
masterAuth:
username: admin
password: password
loggingService: logging.googleapis.com
monitoringService: monitoring.googleapis.com
addonsConfig:
httpLoadBalancing:
disabled: false
horizontalPodAutoscaling:
disabled: false
nodePools:
-
name: default
initialNodeCount: 3
config:
machineType: n1-standard-1
diskSizeGb: 100
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
labels:
nodepool: default
autoscaling:
enabled: false
-
name: other
initialNodeCount: 2
config:
machineType: n1-standard-2
diskSizeGb: 100
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
labels:
nodepool: other
autoscaling:
enabled: false
locations:
- us-central1-a
- us-central1-b
- us-central1-c
When I run gcloud deployment-manager deployments create mydeployment --config config.yaml the deployment runs for a few minutes and fails with:
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation operation-xxxxxxxxxx-xxxxxxxxx-xxxxxxx-xxxxxxx:
errors:
- code: RESOURCE_ERROR
location: /deployments/mydeployment/resources/mycluster
message: 'Unexpected response from resource of type container.v1.cluster: 404 {"statusMessage":"Not
Found","requestPath":null}'
The cluster actually does get successfully created in GKE, and I can interact with it as normal. Deleting the failed deployment with gcloud deployment-manager deployments delete mydeployment deletes the deployment, but leaves the cluster hanging around.
What am I doing wrong here? I've tried other container.v1.cluster samples from around the web (such as https://github.com/mkarthikworld/caddy/tree/master/gke-caddy), they all fail for me with the same error.
Not sure where else to look.

Related

velero with velero-plugin-for-aws backup jobs failed

k3s cluster.
I have used velero helm installation:
helm install vmware-tanzu/velero --namespace velero-minio -f helm-custom-values-minio.yaml --generate-name --create-namespace
and
helm install vmware-tanzu/velero --namespace velero-aws -f helm-custom-values-aws.yaml --generate-name --create-namespace
Custom helm values:
helm-custom-values-minio.yaml
configuration:
provider: aws
backupStorageLocation:
bucket: k3s-backup
name: minio
default: false
config:
region: minio
s3ForcePathStyle: true
s3Url: http://10.10.5.15:9009
volumeSnapshotLocation:
name: minio
config:
region: minio
credentials:
secretContents:
cloud: |
[default]
aws_access_key_id=minioadm
aws_secret_access_key=<password>
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:latest
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
snapshotsEnabled: true
deployRestic: true
and helm-custom-values-aws.yaml
configuration:
provider: aws
backupStorageLocation:
name: aws-s3
bucket: k3s-backup-aws
default: false
provider: aws
config:
region: us-east-1
s3ForcePathStyle: false
volumeSnapshotLocation:
name: aws-s3
provider: aws
config:
region: us-east-1
credentials:
secretContents:
cloud: |
[default]
aws_access_key_id=A..............MJ
aws_secret_access_key=qZ79rA/yVUq2c................xnIA
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:latest
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
snapshotsEnabled: true
deployRestic: true
velero backup jobs:
velero create backup k3s-mongodb-restic-minio --include-namespaces mongodb --default-volumes-to-restic=true --storage-location minio -n velero-minio
velero create backup k3s-mongodb-restic-aws --include-namespaces mongodb --default-volumes-to-restic=true --storage-location aws-s3 -n velero-aws
....
They all failed:
Restic Backups:
Failed:
mongodb/mongodb-cluster-0: agent-scripts, data-volume, healthstatus, hooks, logs-volume, mongodb-cluster-keyfile, tmp
mongodb/mongodb-cluster-1: agent-scripts, data-volume, healthstatus, hooks, logs-volume, mongodb-cluster-keyfile, tmp
time="2022-10-17T17:42:32Z" level=error msg="Error backing up item" backup=velero-minio/k3s-mongodb-restic-minio error="pod volume backup failed: running Restic backup, stderr=Fatal: unable to open config file: Stat: The Access Key Id you provided does not exist in our records.\nIs there a repository at the following location?\ns3:http://10.10.5.15:9009/k3s-backup/restic/mongodb\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=mongodb-cluster-0
...
velero get backup-locations -n velero-aws
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
aws-s3 aws k3s-backup-aws Available 2022-10-17 14:12:46 -0400 EDT ReadWrite
...
velero get backup-locations -n velero-minio
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
minio aws k3s-backup Available 2022-10-17 14:16:25 -0400 EDT ReadWrite
velero backup part works without errors but restic fails for all my jobs (mongodb is the only example).
It looks like the restic can't create snapshots for my nfs pvc.
What am I doing wrong?
It looks like velero doesn't work with multiple installations, at least the restic part fails (in my case, two instances in name spaces velero-aws and velero-minio).
So, I installed only one instance of velero to work with minio.
Removed --default-volumes-to-restic=true from the backup job configuration.
Used opt-in pod volume backup with the restic integration.
Each pod that has pvc volume needs to be annotated, like the following:
kubectl -n mongodb annotate pod/mongodb-cluster-0 backup.velero.io/backup-volumes=logs-volume,data-volume
I have not tried velero-pvc-watcher, probably it works well
Now backup works with no errors.

Is EFS a good logs backup option if Loki pod terminated accidentally in EKS Fargate

I am currently using Loki to store logs generated by my applications from EKS Fargate. Sidecar pattern with promtail is used to scrape logs. Single Loki pod is used and S3 is configured as a destination to store logs. It works nicely as expected. However, when I tested the availability of the logging system by deleting pods, I discovered that if Loki’s pod was deleted, some logs would be missing (range around 20 mins before the pod was deleted to the time the pod was deleted) even after the pod restarted.
To solve this problem, I tried to use EFS as the persistent volume of Loki’ pod, mounting the path /loki. The whole process is followed by this article (https://aws.amazon.com/blogs/aws/new-aws-fargate-for-amazon-eks-now-supports-amazon-efs/). But I have got an error from the Loki pod with msg "error running loki" err="mkdir /loki/compactor: permission denied”
Therefore, I have 2 questions in my mind:
Should I use EFS as a solution for log backup in my case?
Why did I get a permission denied inside the pod, any ways to solve this problem?
My Loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
# grpc_listen_port: 9096
ingester:
wal:
enabled: true
dir: /loki/wal
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
# final_sleep: 0s
chunk_idle_period: 3m
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_target_size: 1048576
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
bucketnames: bucketnames
endpoint: s3.us-west-2.amazonaws.com
region: us-west-2
access_key_id: access_key_id
secret_access_key: secret_access_key
sse_encryption: true
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 5m
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 48h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 96h
querier:
query_ingesters_within: 0
analytics:
reporting_enabled: false
Deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: fargate-api-dev
name: dev-loki
spec:
selector:
matchLabels:
app: dev-loki
template:
metadata:
labels:
app: dev-loki
spec:
volumes:
- name: loki-config
configMap:
name: dev-loki-config
- name: dev-loki-efs-pv
persistentVolumeClaim:
claimName: dev-loki-efs-pvc
containers:
- name: loki
image: loki:2.6.1
args:
- -print-config-stderr=true
- -config.file=/tmp/loki.yaml
resources:
limits:
memory: "500Mi"
cpu: "200m"
ports:
- containerPort: 3100
volumeMounts:
- name: dev-loki-config
mountPath: /tmp
readOnly: false
- name: dev-loki-efs-pv
mountPath: /loki
Promtail-config.yaml
server:
log_level: info
http_listen_port: 9080
clients:
- url: http://loki.com/loki/api/v1/push
positions:
filename: /run/promtail/positions.yaml
scrape_configs:
- job_name: api-log
static_configs:
- targets:
- localhost
labels:
job: apilogs
pod: ${POD_NAME}
__path__: /var/log/*.log
I had a similar issue using EFS as volume to store the logs and I found this solution https://github.com/grafana/loki/issues/2018#issuecomment-1030221498
Basically loki container by it's own is not able to create a directory to start working, so we used a initcotainer to do it for it.
This solution worked like a charm for.

How to wait until env for appid is created in jelastic manifest installation?

I have the following manifest:
jpsVersion: 1.3
jpsType: install
application:
id: shopozor-k8s-cluster
name: Shopozor k8s cluster
version: 0.0
baseUrl: https://raw.githubusercontent.com/shopozor/services/dev
settings:
fields:
- name: envName
caption: Env Name
type: string
default: shopozor
- name: topo
type: radio-fieldset
values:
0-dev: '<b>Development:</b> one master (1) and one scalable worker (1+)'
1-prod: '<b>Production:</b> multi master (3) with API balancers (2+) and scalable workers (2+)'
default: 0-dev
- name: version
type: string
caption: Version
default: v1.16.3
onInstall:
- installKubernetes
- enableSubDomains
actions:
installKubernetes:
install:
jps: https://github.com/jelastic-jps/kubernetes/blob/${settings.version}/manifest.jps
envName: ${settings.envName}
displayName: ${settings.envName}
settings:
deploy: cmd
cmd: |-
curl -fsSL ${baseUrl}/scripts/install_k8s.sh | /bin/bash
topo: ${settings.topo}
dashboard: version2
ingress-controller: Nginx
storage: true
api: true
monitoring: true
version: ${settings.version}
jaeger: false
enableSubDomains:
- jelastic.env.binder.AddDomains[cp]:
domains: staging,api-staging,assets-staging,api,assets
Unfortunately, when I run that manifest, the k8s cluster gets installed, but the subdomains cannot be created (yet), because:
[15:26:28 Shopozor.cluster:3]: enableSubDomains: {"action":"enableSubDomains","params":{}}
[15:26:29 Shopozor.cluster:4]: api [cp]: {"method":"jelastic.env.binder.AddDomains","params":{"domains":"staging,api-staging,assets-staging,api,assets"},"nodeGroup":"cp"}
[15:26:29 Shopozor.cluster:4]: ERROR: api.response: {"result":2303,"source":"JEL","error":"env for appid [5ce25f5a6988fbbaf34999b08dd1d47c] not created."}
What jelastic API methods can I use to perform the necessary waiting until subdomain creation is possible?
My current workaround is to split that manifest into two manifests: one cluster installation manifest and one update manifest creating the subdomains. However, I'd like to have everything in the same manifest.
Please change this:
enableSubDomains:
- jelastic.env.binder.AddDomains[cp]:
domains: staging,api-staging,assets-staging,api,assets
to:
enableSubDomains:
- jelastic.env.binder.AddDomains[cp]:
envName: ${settings.envName}
domains: staging,api-staging,assets-staging,api,assets

gke cluster deployment with custom network

I am trying to create a yaml file to deploy gke cluster in a custom network I created. I get an error
JSON payload received. Unknown name \"network\": Cannot find field."
I have tried a few names for the resources but I am still seeing the same issue
resources:
- name: myclus
type: container.v1.cluster
properties:
network: projects/project-251012/global/networks/dev-cloud
zone: "us-east4-a"
cluster:
initialClusterVersion: "1.12.9-gke.13"
currentMasterVersion: "1.12.9-gke.13"
## Initial NodePool config.
nodePools:
- name: "myclus-pool1"
initialNodeCount: 3
version: "1.12.9-gke.13"
config:
machineType: "n1-standard-1"
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
## Duplicates node pool config from v1.cluster section, to get it explicitly managed.
- name: myclus-pool1
type: container.v1.nodePool
properties:
zone: us-east4-a
clusterId: $(ref.myclus.name)
nodePool:
name: "myclus-pool1"
I expect it to place the cluster nodes in this network.
The network field needs to be part of the cluster spec. The top-level of properties should just be zone and cluster, network should be on the same indentation as initialClusterVersion. See more on the container.v1.cluster API reference page
Your manifest should look more like:
EDIT: there is some confusion in the API reference docs concerning deprecated fields. I offered a YAML that applies to the new API, not the one you are using. I've update with the correct syntax for the basic v1 API and further down I've added the newer API (which currently relies on gcp-types to deploy.
resources:
- name: myclus
type: container.v1.cluster
properties:
projectId: [project]
zone: us-central1-f
cluster:
name: my-clus
zone: us-central1-f
network: [network_name]
subnetwork: [subnet] ### leave this field blank if using the default network
initialClusterVersion: "1.13"
nodePools:
- name: my-clus-pool1
initialNodeCount: 0
config:
imageType: cos
- name: my-pool-1
type: container.v1.nodePool
properties:
projectId: [project]
zone: us-central1-f
clusterId: $(ref.myclus.name)
nodePool:
name: my-clus-pool2
initialNodeCount: 0
version: "1.13"
config:
imageType: ubuntu
The newer API (which provides more functionality and allows you to use more features including the v1beta1 API and beta features) would look something like this:
resources:
- name: myclus
type: gcp-types/container-v1:projects.locations.clusters
properties:
parent: projects/shared-vpc-231717/locations/us-central1-f
cluster:
name: my-clus
zone: us-central1-f
network: shared-vpc
subnetwork: local-only ### leave this field blank if using the default network
initialClusterVersion: "1.13"
nodePools:
- name: my-clus-pool1
initialNodeCount: 0
config:
imageType: cos
- name: my-pool-2
type: gcp-types/container-v1:projects.locations.clusters.nodePools
properties:
parent: projects/shared-vpc-231717/locations/us-central1-f/clusters/$(ref.myclus.name)
nodePool:
name: my-clus-separate-pool
initialNodeCount: 0
version: "1.13"
config:
imageType: ubuntu
Another note, you may want to modify your scopes, the current scopes will not allow you to pull images from gcr.io, some system pods may not spin up properly and if you are using Google's repository, you will be unable to pull those images.
Finally, you don't want to repeat the node pool resource in both the cluster spec and separately. Instead, create the cluster with a basic (default) node pool, for all additional node pools, create them as separate resources to manage them without going through the cluster. There are very few updates you can perform on a node pool, asside from resizing

Cloud Foundry bosh Error 100: Can't find network

I'm attempting to setup a service broker to add postgres to our Cloud Foundry installation. We're running our system on vmWare. I'm using this release in order to do that:
cf-services-contrib-release
I need to setup the networks: section in the manifest, and what I'm setting there isn't working.
This is what my networks look like in the vmWare vCenter UI:
And this is what my clusters and resource pools look like in the vCenter UI:
I tried both with and without quotes, around the 'name' of the network. But I'm now getting an error saying that bosh can't find the network:
Failed compiling packages > rootfs_lucid64/9b3f611b46e076b94b37645c98f9100e7bcef5dd: Can't find network: VLAN1130_LB_100.114.130.0 (00:00:01)
Failed compiling packages > postgresql93/06163819b694f8d9836586d024f64c11efe30180: Can't find network: VLAN1130_LB_100.114.130.0 (00:00:01)
Failed compiling packages > postgresql92/2867893e714aae6e6b76bd06e7aa30d47023c46e: Can't find network: VLAN1130_LB_100.114.130.0 (00:00:01)
Error 100: Can't find network: VLAN1130_LB_100.114.130.0
Task 2430 error
This was my latest configuration attempt:
networks:
- name: default
type: manual
subnets:
- range: 100.114.130.0/24
gateway: 100.114.130.1
cloud_properties:
name: VLAN1130_LB_100.114.130.0
I also tried using single quotes as below. But I got the same error as above!
networks:
- name: default
type: manual
subnets:
- range: 100.114.130.0/24
gateway: 100.114.130.1
cloud_properties:
name: 'VLAN1130_LB_100.114.130.0'
Our network that we're on is this one: 100.114.130.0/24
So it makes sense to select VLAN1130_LB_100.114.130.0 in the config.
I've tried setting all of these options in the yaml file with no quotes. And none of them seem to work!
<ul>
<li>USH_UCS_CLOUD_FOUNDRY: <a href="https://gist.github.com/bluethundr/18ac490e96a5e02fad65">postgres_2432_debug.txt</li>
<li>USH_UCS_CLOUD_FOUNDRY_DVS: postgres_2433_debug.txt</li>
<li>USH_UCS_CLOUD_FO-DVUplinks-435272: postgres_2434_debug.txt </li>
<li>VLAN1129_LB_100.114.129.0: postgres_2435_debug.txt</li>
<li>VLAN1130_LB_100.114.130.0: postgres_2436_debug.txt</li>
<li>VLAN14-ESXI_MGMT-3.156.14.0: <a href="https://gist.github.com/bluethundr/dbde624e63842721a133">postgres_2437_debug.txt</li>
</ul>
I wouldn't expect VLAN1129_LB_100.114.129.0 to work, but I tried it anyway, just to be complete.
I've supplied debug dumps of each failed attempt next to each setting you see above. Surely one of them must work! But as you can see none of them did.
Here's my complete yaml file that I deployed with the 'bosh deploy' command:
name: cf-22b9f4d62bb6f0563b71
director_uuid: fd713790-b1bc-401a-8ea1-b8209f1cc90c
releases:
- name: cf-services-contrib
version: 6
compilation:
workers: 3
network: default
reuse_compilation_vms: true
cloud_properties:
ram: 5120
disk: 10240
cpu: 2
update:
canaries: 1
canary_watch_time: 30000-60000
update_watch_time: 30000-60000
max_in_flight: 4
networks:
- name: default
type: manual
subnets:
- range: 100.114.130.0/24
gateway: 100.114.130.1
cloud_properties:
name: VLAN1130_LB_100.114.130.0
resource_pools:
- name: 'USH_UCS_CLOUD_FOUNDRY_NONPROD_01_RP'
network: default
stemcell:
name: bosh-vsphere-esxi-ubuntu-trusty-go_agent
version: '2865.1'
cloud_properties:
cpu: 2
ram: 4096
disk: 10240
datacenters:
- name: 'Universal City'
clusters:
- USH_UCS_CLOUD_FOUNDRY_NONPROD_01: {resource_pool: 'USH_UCS_CLOUD_FOUNDRY_NONPROD_01_RP'}
jobs:
- name: gateways
release: cf-services-contrib
templates:
- name: postgresql_gateway_ng
instances: 1
resource_pool: 'USH_UCS_CLOUD_FOUNDRY_NONPROD_01_RP'
networks:
- name: default
default: [dns, gateway]
properties:
# Service credentials
uaa_client_id: "cf"
uaa_endpoint: http://uaa.devcloudwest.example.com
uaa_client_auth_credentials:
username: admin
password: secret
- name: postgresql_service_node
release: cf-services-contrib
template: postgresql_node_ng
instances: 1
resource_pool: 'USH_UCS_CLOUD_FOUNDRY_NONPROD_01_RP'
persistent_disk: 10000
properties:
postgresql_node:
plan: default
networks:
- name: default
default: [dns, gateway]
properties:
networks:
apps: default
management: default
cc:
srv_api_uri: http://api.devcloudwest.example.com
nats:
address: 100.114.130.11
port: 25555
user: nats #CHANGE
password: secret
authorization_timeout: 5
service_plans:
postgresql:
default:
description: "Developer, 250MB storage, 10 connections"
free: true
job_management:
high_water: 230
low_water: 20
configuration:
capacity: 125
max_clients: 10
quota_files: 4
quota_data_size: 240
enable_journaling: true
backup:
enable: false
lifecycle:
enable: false
serialization: enable
snapshot:
quota: 1
postgresql_gateway:
token: f75df200-4daf-45b5-b92a-cb7fa1a25660
default_plan: default
supported_versions: ["9.3"]
version_aliases:
current: "9.3"
cc_api_version: v2
postgresql_node:
supported_versions: ["9.3"]
default_version: "9.3"
max_tmp: 900
password: secret
How can we get past this issue?
From Amit's comment:
The name used in Cloud Properties must include any nested sub-folders. In the provided configuration the network is nested under USH_UCS_CLOUD_FOUNDRY, so the value for name should reflect that, i.e. USH_UCS_CLOUD_FOUNDRY/VLAN1130_LB_100.114.130.0 no quotes are required.
networks:
- name: default
type: manual
subnets:
- range: 100.114.130.0/24
gateway: 100.114.130.1
cloud_properties:
name: USH_UCS_CLOUD_FOUNDRY/VLAN1130_LB_100.114.130.0