Failing K8s rabbitmq-peer-discovery-k8s clustering - kubernetes

I'm trying to bring up a RabbitMQ cluster on Kubernetes using Rabbitmq-peer-discovery-k8s plugin and I always have only on pod running and ready but the next one always fails.
I tried multiple changes to my configuration and this is what got at least one pod running
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rabbitmq
namespace: namespace-dev
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: namespace-dev
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: namespace-dev
subjects:
- kind: ServiceAccount
name: rabbitmq
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: endpoint-reader
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: "rabbitmq-data"
labels:
name: "rabbitmq-data"
release: "rabbitmq-data"
namespace: "namespace-dev"
spec:
capacity:
storage: 5Gi
accessModes:
- "ReadWriteMany"
nfs:
path: "/path/to/nfs"
server: "xx.xx.xx.xx"
persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "rabbitmq-data-claim"
namespace: "namespace-dev"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
release: rabbitmq-data
---
# headless service Used to access pods using hostname
kind: Service
apiVersion: v1
metadata:
name: rabbitmq-headless
namespace: namespace-dev
spec:
clusterIP: None
# publishNotReadyAddresses, when set to true, indicates that DNS implementations must publish the notReadyAddresses of subsets for the Endpoints associated with the Service. The default value is false. The primary use case for setting this field is to use a StatefulSet's Headless Service to propagate SRV records for its Pods without respect to their readiness for purpose of peer discovery. This field will replace the service.alpha.kubernetes.io/tolerate-unready-endpoints when that annotation is deprecated and all clients have been converted to use this field.
# Since access to the Pod using DNS requires Pod and Headless service to be started before launch, publishNotReadyAddresses is set to true to prevent readinessProbe from finding DNS when the service is not started.
publishNotReadyAddresses: true
ports:
- name: amqp
port: 5672
- name: http
port: 15672
selector:
app: rabbitmq
---
# Used to expose the dashboard to the external network
kind: Service
apiVersion: v1
metadata:
namespace: namespace-dev
name: rabbitmq-service
spec:
type: NodePort
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
nodePort: 31672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
nodePort: 30672
selector:
app: rabbitmq
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-config
namespace: namespace-dev
data:
enabled_plugins: |
[rabbitmq_management,rabbitmq_peer_discovery_k8s].
rabbitmq.conf: |
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.node_cleanup.interval = 10
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
queue_master_locator=min-masters
loopback_users.guest = false
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 2
cluster_formation.k8s.service_name = rabbitmq-headless
cluster_formation.k8s.hostname_suffix = .rabbitmq-headless.namespace-dev.svc.cluster.local
vm_memory_high_watermark.absolute = 1.6GB
disk_free_limit.absolute = 2GB
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: rabbitmq
spec:
serviceName: rabbitmq-headless # Must be the same as the name of the headless service, used for hostname propagation access pod
selector:
matchLabels:
app: rabbitmq # In apps/v1, it needs to be the same as .spec.template.metadata.label for hostname propagation access pods, but not in apps/v1beta
replicas: 3
template:
metadata:
labels:
app: rabbitmq # In apps/v1, the same as .spec.selector.matchLabels
# setting podAntiAffinity
annotations:
scheduler.alpha.kubernetes.io/affinity: >
{
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [{
"labelSelector": {
"matchExpressions": [{
"key": "app",
"operator": "In",
"values": ["rabbitmq"]
}]
},
"topologyKey": "kubernetes.io/hostname"
}]
}
}
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq
image: rabbitmq:3.7.10
resources:
limits:
cpu: "0.5"
memory: 2Gi
requests:
cpu: "0.3"
memory: 2Gi
volumeMounts:
- name: config-volume
mountPath: /etc/rabbitmq
- name: rabbitmq-data
mountPath: /var/lib/rabbitmq/mnesia
ports:
- name: http
protocol: TCP
containerPort: 15672
- name: amqp
protocol: TCP
containerPort: 5672
livenessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 5
readinessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 5
imagePullPolicy: IfNotPresent
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit#$(HOSTNAME).rabbitmq-headless.namespace-dev.svc.cluster.local"
# If service_name is set in ConfigMap, there is no need to set it again here.
# - name: K8S_SERVICE_NAME
# value: "rabbitmq-headless"
- name: RABBITMQ_ERLANG_COOKIE
value: "mycookie"
volumes:
- name: config-volume
configMap:
name: rabbitmq-config
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
- name: rabbitmq-data
persistentVolumeClaim:
claimName: rabbitmq-data-claim
I only get one pod running and ready instead of the 3 replicas
[admin#devsvr3 yaml]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
rabbitmq-0 1/1 Running 0 2m2s
rabbitmq-1 0/1 Running 1 43s
Inspecting the failing pod I got this.
[admin#devsvr3 yaml]$ kubectl logs rabbitmq-1
## ##
## ## RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2019-02-06 21:09:03.303 [info] <0.211.0>
Starting RabbitMQ 3.7.10 on Erlang 21.2.3
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
2019-02-06 21:09:03.315 [info] <0.211.0>
node : rabbit#rabbitmq-1.rabbitmq-headless.namespace-dev.svc.cluster.local
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : XhdCf8zpVJeJ0EHyaxszPg==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-1.rabbitmq-headless.namespace-dev.svc.cluster.local
2019-02-06 21:09:10.617 [error] <0.219.0> Unable to parse vm_memory_high_watermark value "1.6GB"
2019-02-06 21:09:10.617 [info] <0.219.0> Memory high watermark set to 103098 MiB (108106919116 bytes) of 257746 MiB (270267297792 bytes) total
2019-02-06 21:09:10.690 [info] <0.221.0> Enabling free disk space monitoring
2019-02-06 21:09:10.690 [info] <0.221.0> Disk free limit set to 2000MB
2019-02-06 21:09:10.698 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2019-02-06 21:09:10.698 [info] <0.225.0> FHC read buffering: OFF
2019-02-06 21:09:10.699 [info] <0.225.0> FHC write buffering: ON
2019-02-06 21:09:10.702 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-1.rabbitmq-headless.namespace-dev.svc.cluster.local is empty. Assuming we need to join an existing cluster or initialise from scratch...
2019-02-06 21:09:10.702 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2019-02-06 21:09:10.702 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-02-06 21:09:10.702 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-02-06 21:09:10.702 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-02-06 21:09:10.710 [info] <0.211.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2019-02-06 21:09:10.711 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 138
2019-02-06 21:09:10.711 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,815}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
[admin#devsvr3 yaml]$
What did I do wrong here?

finally i fixed it by adding this in /etc/resolv.conf of my pods:
[my-rabbit-svc].[my-rabbitmq-namespace].svc.[cluster-name]
to add this in my pod i used this setting in my StatefulSet:
dnsConfig:
searches:
- [my-rabbit-svc].[my-rabbitmq-namespace].svc.[cluster-name]
full documentation here

Try to set:
cluster_formation.k8s.host = [your kubernetes endpoint ip addres]
cluster_formation.k8s.port = [your kubernetes endpoint port]
because it seems that your pod cannot solve this name:
kubernetes.default.svc.cluster.local

Try using this stable helm chart https://github.com/helm/charts/tree/master/stable/rabbitmq

One of the possible solution here, instead of attaching dns config to pod - use k8s proxy sidecar. So, instead of solving
kubernetes.default.svc.cluster.local
You could setup sidecar container like
- name: "k8s-api-sidecar"
image: "tommyvn/kubectl-proxy:latest"
in statefullset/deployment
And change configmap to use it
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = localhost
cluster_formation.k8s.port = 8001
cluster_formation.k8s.scheme = http
If you take a look into this repository https://github.com/tommyvn/kubectl-proxy , you will find that it is just a call kubectl proxy.

Related

How to install Loki+Promtail to forward K8S pod logs to Grafana Cloud

I am still new to K8S infrastructure but I am trying to convert VM infrastructure to K8S on GCP/GKE and I am stuck at forwarding the logs properly after getting Prometheus metrics forwarded correctly. I am also trying to do this without helm, to better understand K8S.
The logs of the loki pod, look as expected when comparing to a docker format in a VM setup.
But I do not know how to start the promtail service without a port, since in a docker format promtail does not have to expose a port. I get the following error:
The Service "promtail" is invalid: spec.ports: Required value
My configuration files look like:
loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
ingester:
wal:
enabled: true
dir: /tmp/wal
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /tmp/loki/boltdb-shipper-active
cache_location: /tmp/loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
filesystem:
directory: /tmp/loki/chunks
compactor:
working_directory: /tmp/loki/boltdb-shipper-compactor
shared_store: filesystem
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_burst_size_mb: 16
ingestion_rate_mb: 16
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
ruler:
storage:
type: local
local:
directory: /tmp/loki/rules
rule_path: /tmp/loki/rules-temp
alertmanager_url: http://localhost:9093
ring:
kvstore:
store: inmemory
enable_api: true
promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
# this is the place where promtail will store the progress about how far it has read the logs
positions:
filename: /tmp/positions.yaml
# address of loki server to which promtail should push the logs
clients:
- url: https://999999:...=#logs-prod3.grafana.net/api/prom/push
# which logs to read/scrape
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: node
static_configs:
- targets:
- localhost
labels:
job: node # label-1
host: localhost # label-2
__path__: /var/lib/docker/containers/*/*log
Then the deployment files:
loki-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki
spec:
selector:
matchLabels:
app: loki
network: cluster-1
replicas: 1
template:
metadata:
labels:
app: loki
network: cluster-1
spec:
containers:
- name: loki
image: grafana/loki
ports:
- containerPort: 3100
volumeMounts:
- name: loki-config-volume
mountPath: /etc/loki/loki.yml
subPath: loki.yml
volumes:
- name: loki-config-volume
configMap:
name: "loki-config"
---
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: monitoring
spec:
selector:
app: loki
type: NodePort
ports:
- name: loki
protocol: TCP
port: 3100
And finally promtail-deploy.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: promtail
spec:
selector:
matchLabels:
app: promtail
network: cluster-1
replicas: 1
template:
metadata:
labels:
app: promtail
network: cluster-1
spec:
containers:
- name: promtail
image: grafana/promtail
volumeMounts:
- name: promtail-config-volume
mountPath: /mnt/config/promtail-config.yml
subPath: promtail.yml
volumes:
- name: promtail-config-volume
configMap:
name: "promtail-config"
---
apiVersion: v1
kind: Service
metadata:
name: promtail
namespace: monitoring
The issue you're describing is answered exactly by the error message.
Your second Kubernetes Service manifest, named promtail, does not have any specification. For services, at least spec.ports is required. You should add a label selector as well, so the Service can pick up the Deployment's pods properly.
apiVersion: v1
kind: Service
metadata:
name: promtail
namespace: monitoring
spec:
selector:
app: promtail
ports:
- port: <ServicePort>
targetPort: <PodPort>
However, if you do not need to communicate with the Promtail pods from external services, then simply skip creating the Service itself.
May I add, if you need to expose these logs to a service running outside of your cluster, such as Grafana Cloud, you should create a Service of LoadBalancer type for Loki instead. This will request a public IP for it, making it accessible worldwide - assuming your Kubernetes cluster is managed by some cloud provider.
Making Loki public is insecure, but a good first step towards consuming these logs externally.

Unable to fetch metrics from custom metrics API: the server is currently unable to handle the request

I'm using a HPA based on a custom metric on GKE.
The HPA is not working and it's showing me this error log:
unable to fetch metrics from custom metrics API: the server is currently unable to handle the request
When I run kubectl get apiservices | grep custom I get
v1beta1.custom.metrics.k8s.io services/prometheus-adapter False (FailedDiscoveryCheck) 135d
this is the HPA spec config :
spec:
scaleTargetRef:
kind: Deployment
name: api-name
apiVersion: apps/v1
minReplicas: 3
maxReplicas: 50
metrics:
- type: Object
object:
target:
kind: Service
name: api-name
apiVersion: v1
metricName: messages_ready_per_consumer
targetValue: '1'
and this is the service's spec config :
spec:
ports:
- name: worker-metrics
protocol: TCP
port: 8080
targetPort: worker-metrics
selector:
app.kubernetes.io/instance: api
app.kubernetes.io/name: api-name
clusterIP: 10.8.7.9
clusterIPs:
- 10.8.7.9
type: ClusterIP
sessionAffinity: None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
What should I do to make it work ?
First of all, confirm that the Metrics Server POD is running in your kube-system namespace. Also, you can use the following manifest:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp
If so, take a look into the logs and look for any stackdriver adapter’s line. This issue is commonly caused due to a problem with the custom-metrics-stackdriver-adapter. It usually crashes in the metrics-server namespace. To solve that, use the resource from this URL, and for the deployment, use this image:
gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.1
Another common root cause of this is an OOM issue. In this case, adding more memory solves the problem. To assign more memory, you can specify the new memory amount in the configuration file, as the following example shows:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
In the above example, the Container has a memory request of 100 MiB and a memory limit of 200 MiB. In the manifest, the "--vm-bytes", "150M" argument tells the Container to attempt to allocate 150 MiB of memory. You can visit this Kubernetes Official Documentation to have more references about the Memory settings.
You can use the following threads for more reference GKE - HPA using custom metrics - unable to fetch metrics, Stackdriver-metadata-agent-cluster-level gets OOMKilled, and Custom-metrics-stackdriver-adapter pod keeps crashing.
What do you get for kubectl get pod -l "app.kubernetes.io/instance=api,app.kubernetes.io/name=api-name"?
There should be a pod, to which the service reffers.
If there is a pod, check its logs with kubectl logs <pod-name>. you can add -f to kubectl logs command, to follow the logs.
Adding this block in my EKS nodes security group rules solved the issue for me:
node_security_group_additional_rules = {
...
ingress_cluster_metricserver = {
description = "Cluster to node 4443 (Metrics Server)"
protocol = "tcp"
from_port = 4443
to_port = 4443
type = "ingress"
source_cluster_security_group = true
}
...
}

how to access google storage bucket while running jupyter notebook with pyspark on GKE kubernetes?

my goal is to run pyspark code on jupyter on k8s while reading logs form a google storage bucket. sounds simple, maybe
after much clicking sweat & tears i've managed to run jupyter with pyspark on k8s but fail to read from a google storage bucket. or to put it in code terms i fail to run:
df = spark.read.parquet("gs://bucket_name/puppy.snappy.parquet")
i've built the setup as follows. first a yaml file setting up a jupyter notebook running on a statefulset:
apiVersion: v1
kind: ServiceAccount
metadata:
name: jupyter
namespace: spark
labels:
release: jupyter
secrets:
- name: bucket-key
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: jupyter
labels:
release: jupyter
namespace: spark
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- get
- delete
- list
- watch
- apiGroups:
- ""
resources:
- services
verbs:
- get
- create
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- get
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- create
- list
- watch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: jupyter
labels:
release: jupyter
namespace: spark
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: jupyter
subjects:
- kind: ServiceAccount
name: jupyter
namespace: spark
---
apiVersion: v1
kind: Service
metadata:
name: jupyter
labels:
release: jupyter
spec:
type: ClusterIP
selector:
release: jupyter
ports:
- name: http
port: 8888
protocol: TCP
- name: blockmanager
port: 7777
protocol: TCP
- name: driver
port: 2222
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-headless
labels:
release: jupyter
spec:
type: ClusterIP
clusterIP: None
publishNotReadyAddresses: false
selector:
release: jupyter
ports:
- name: http
port: 8888
protocol: TCP
- name: blockmanager
port: 7777
protocol: TCP
- name: driver
port: 2222
protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: jupyter
namespace: spark
labels:
release: jupyter
spec:
replicas:
updateStrategy:
type: RollingUpdate
serviceName: jupyter-headless
podManagementPolicy: Parallel
volumeClaimTemplates:
- metadata:
name: notebook-data
labels:
release: jupyter
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 100Mi
selector:
matchLabels:
release: jupyter
template:
metadata:
labels:
release: jupyter
annotations:
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 30
serviceAccountName: jupyter
dnsConfig:
options:
- name: ndots
value: "1"
volumes:
- name: bucket-service-account-vol
secret:
secretName: bucket-key
containers:
- name: jupyter
image: "jjgershon/spark:3.1.1-hadoop-3.2.0-gcp-jupyter"
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8888
protocol: TCP
- name: blockmanager
containerPort: 7777
protocol: TCP
- name: driver
containerPort: 2222
protocol: TCP
volumeMounts:
- name: notebook-data
mountPath: /home/notebook
- name: bucket-service-account-vol
mountPath: /var/secrets/google
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
resources:
limits:
cpu: 500m
memory: 2048Mi
requests:
cpu: 500m
memory: 2048Mi
the yaml is mostly from this post
now, this yaml succeeds in granting the jupyter notebook itself the iam service account but not the executor pods. when i run my code the executor pods are created
kubectl -n spark get pods
NAME READY STATUS RESTARTS AGE
gcplocalstack-736c087b06a73790-exec-1 1/1 Running 0 23s
gcplocalstack-736c087b06a73790-exec-2 1/1 Running 0 23s
jupyter-0 1/1 Running 0 151m
but i get this error:
2*****4-compute#developer.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object
the reason for the error is that the pod is pointing to the default IAM service account that doesn't have access to the bucket i'm trying to read.
to solve this i've added the following to the jupyter notebook:
"spark.hadoop.google.cloud.auth.service.account.enable": "true",
"spark.kubernetes.authenticate.driver.serviceAccountName": "jupyter",
"spark.kubernetes.driver.secrets.bucket-key": "/var/secrets/google/key.json",
"spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS":
"/var/secrets/google/key.json",
"spark.kubernetes.driver.secrets.key": "/var/secrets/google",
"spark.kubernetes.executor.secrets.key": "/var/secrets/google",
"spark.kubernetes.executor.secrets.bucket-key": "/var/secrets/google/key.json",
"spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
"spark.hadoop.google.cloud.auth.service.account.json.keyfile":
"/var/secrets/google/key.json",
but now the executor pods get stuck on:
NAME READY STATUS RESTARTS AGE
gcplocalstack-1c6d917b06ac59fa-exec-1 0/1 ContainerCreating 0 53s
gcplocalstack-1c6d917b06ac59fa-exec-2 0/1 ContainerCreating 0 53s
jupyter-0 1/1 Running 0 157m
and i get this error message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
this might be since the new IAM service account doesn't have permission for GKE but i'm not sure. and seems like this is the line of code causing the problems: "spark.kubernetes.executor.secrets.key": "/var/secrets/google",
the full pyspark code on jupyter notebook:
from pyspark import SparkConf
from pyspark.sql import SparkSession
config = {
"spark.kubernetes.driver.pod.name": "jupyter-0",
"spark.kubernetes.namespace": "spark",
"spark.kubernetes.container.image": "jjgershon/spark:3.1.1-hadoop-3.2.0-gcp",
"spark.executor.instances": "2",
"spark.executor.memory": "1g",
"spark.executor.cores": "1",
"spark.driver.blockManager.port": "7777",
"spark.driver.port": "2222",
"spark.driver.host": "jupyter.spark.svc.cluster.local",
"spark.driver.bindAddress": "0.0.0.0",
"spark.hadoop.google.cloud.auth.service.account.enable": "true",
"spark.kubernetes.authenticate.driver.serviceAccountName": "jupyter",
"spark.kubernetes.driver.secrets.bucket-key": "/var/secrets/google/key.json",
"spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
"spark.kubernetes.driver.secrets.key": "/var/secrets/google",
"spark.kubernetes.executor.secrets.key": "/var/secrets/google",
"spark.kubernetes.executor.secrets.bucket-key": "/var/secrets/google/key.json",
"spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
"spark.hadoop.google.cloud.auth.service.account.json.keyfile": "/var/secrets/google/key.json",
}
def get_spark_session(app_name: str, conf: SparkConf):
conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local")
for key, value in config.items():
conf.set(key, value)
return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
spark = get_spark_session("gcplocalstack",SparkConf())
df = spark.read.parquet("gs://bucket_nane/puppy.snappy.parquet")
i will be forever grateful for any help
thanks!!! :)

Rabbit mq - Error while waiting for Mnesia tables

I have installed rabbitmq using helm chart on a kubernetes cluster. The rabbitmq pod keeps restarting. On inspecting the pod logs I get the below error
2020-02-26 04:42:31.582 [warning] <0.314.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-02-26 04:42:31.582 [info] <0.314.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
When I try to do kubectl describe pod I get this error
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-rabbitmq-0
ReadOnly: false
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rabbitmq-config
Optional: false
healthchecks:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rabbitmq-healthchecks
Optional: false
rabbitmq-token-w74kb:
Type: Secret (a volume populated by a Secret)
SecretName: rabbitmq-token-w74kb
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m27s (x878 over 7h21m) kubelet, gke-analytics-default-pool-918f5943-w0t0 Readiness probe failed: Timeout: 70 seconds ...
Checking health of node rabbit#rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Status of node rabbit#rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
I have provisioned the above on Google Cloud on a kubernetes cluster. I am not sure during what specific situation it started failing. I had to restart the pod and since then it has been failing.
What is the issue here ?
TLDR
helm upgrade rabbitmq --set clustering.forceBoot=true
Problem
The problem happens for the following reason:
All RMQ pods are terminated at the same time due to some reason (maybe because you explicitly set the StatefulSet replicas to 0, or something else)
One of them is the last one to stop (maybe just a tiny bit after the others). It stores this condition ("I'm standalone now") in its filesystem, which in k8s is the PersistentVolume(Claim). Let's say this pod is rabbitmq-1.
When you spin the StatefulSet back up, the pod rabbitmq-0 is always the first to start (see here).
During startup, pod rabbitmq-0 first checks whether it's supposed to run standalone. But as far as it can see on its own filesystem, it's part of a cluster. So it checks for its peers and doesn't find any. This results in a startup failure by default.
rabbitmq-0 thus never becomes ready.
rabbitmq-1 is never starting because that's how StatefulSets are deployed - one after another. If it were to start, it would start successfully because it sees that it can run standalone as well.
So in the end, it's a bit of a mismatch between how RabbitMQ and StatefulSets work. RMQ says: "if everything goes down, just start everything and the same time, one will be able to start and as soon as this one is up, the others can rejoin the cluster." k8s StatefulSets say: "starting everything all at once is not possible, we'll start with the 0".
Solution
To fix this, there is a force_boot command for rabbitmqctl which basically tells an instance to start standalone if it doesn't find any peers. How you can use this from Kubernetes depends on the Helm chart and container you're using. In the Bitnami Chart, which uses the Bitnami Docker image, there is a value clustering.forceBoot = true, which translates to an env variable RABBITMQ_FORCE_BOOT = yes in the container, which will then issue the above command for you.
But looking at the problem, you can also see why deleting PVCs will work (other answer). The pods will just all "forget" that they were part of a RMQ cluster the last time around, and happily start. I would prefer the above solution though, as no data is being lost.
Just deleted the existing persistent volume claim and reinstalled rabbitmq and it started working.
So every time after installing rabbitmq on a kubernetes cluster and if I scale down the pods to 0 and when I scale up the pods at a later time I get the same error. I also tried deleting the Persistent Volume Claim without uninstalling the rabbitmq helm chart but still the same error.
So it seems each time I scale down the cluster to 0, I need to uninstall the rabbitmq helm chart, delete the corresponding Persistent Volume Claims and install the rabbitmq helm chart each time to make it working.
IF you are in the same scenario like me and you don't know who deployed the helm chart and how was it deployed... you can edit the statefulset directly to avoid messing up more things..
I was able to make it work without deleting the helm_chart
kubectl -n rabbitmq edit statefulsets.apps rabbitmq
under the spec section I added as following the env variable RABBITMQ_FORCE_BOOT = yes:
spec:
containers:
- env:
- name: RABBITMQ_FORCE_BOOT # New Line 1 Added
value: "yes" # New Line 2 Added
And that should fix the issue also... please first try to do it in a proper way as is explained above by Ulli.
In my case solution was simple
Step1: Downscale the statefulset it will not delete the PVC.
kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=1
Step2: Access the RabbitMQ Pod.
kubectl exec -it rabbitmq-1-rabbitmq-0 -n Rabbit
Step3: Reset the cluster
rabbitmqctl stop_app
rabbitmqctl force_boot
Step4:Rescale the statefulset
kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=4
I also got a similar kind of error as given below.
2020-06-05 03:45:37.153 [info] <0.234.0> Waiting for Mnesia tables for
30000 ms, 9 retries left 2020-06-05 03:46:07.154 [warning] <0.234.0>
Error while waiting for Mnesia tables:
{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-06-05 03:46:07.154 [info] <0.234.0> Waiting for Mnesia tables for
30000 ms, 8 retries left
In my case, the slave node(server) of the RabbitMQ cluster was down. Once I started the slave node, master node's started without an error.
test this deploy:
kind: Service
apiVersion: v1
metadata:
namespace: rabbitmq-namespace
name: rabbitmq
labels:
app: rabbitmq
type: LoadBalancer
spec:
type: NodePort
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
nodePort: 31672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
nodePort: 30672
- name: stomp
protocol: TCP
port: 61613
targetPort: 61613
selector:
app: rabbitmq
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbitmq-namespace
name: rabbitmq-lb
labels:
app: rabbitmq
spec:
# Headless service to give the StatefulSet a DNS which is known in the cluster (hostname-#.app.namespace.svc.cluster.local, )
# in our case - rabbitmq-#.rabbitmq.rabbitmq-namespace.svc.cluster.local
clusterIP: None
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
- name: stomp
port: 61613
selector:
app: rabbitmq
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-config
namespace: rabbitmq-namespace
data:
enabled_plugins: |
[rabbitmq_management,rabbitmq_peer_discovery_k8s,rabbitmq_stomp].
rabbitmq.conf: |
## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more.
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
## Should RabbitMQ node name be computed from the pod's hostname or IP address?
## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
## Set to "hostname" to use pod hostnames.
## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
## environment variable.
cluster_formation.k8s.address_type = hostname
## Important - this is the suffix of the hostname, as each node gets "rabbitmq-#", we need to tell what's the suffix
## it will give each new node that enters the way to contact the other peer node and join the cluster (if using hostname)
cluster_formation.k8s.hostname_suffix = .rabbitmq.rabbitmq-namespace.svc.cluster.local
## How often should node cleanup checks run?
cluster_formation.node_cleanup.interval = 30
## Set to false if automatic removal of unknown/absent nodes
## is desired. This can be dangerous, see
## * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
## * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
## See http://www.rabbitmq.com/ha.html#master-migration-data-locality
queue_master_locator=min-masters
## See http://www.rabbitmq.com/access-control.html#loopback-users
loopback_users.guest = false
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: rabbitmq-namespace
spec:
serviceName: rabbitmq
replicas: 3
selector:
matchLabels:
name: rabbitmq
template:
metadata:
labels:
app: rabbitmq
name: rabbitmq
state: rabbitmq
annotations:
pod.alpha.kubernetes.io/initialized: "true"
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq-k8s
image: rabbitmq:3.8.3
volumeMounts:
- name: config-volume
mountPath: /etc/rabbitmq
- name: data
mountPath: /var/lib/rabbitmq/mnesia
ports:
- name: http
protocol: TCP
containerPort: 15672
- name: amqp
protocol: TCP
containerPort: 5672
livenessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 10
resources:
requests:
memory: "0"
cpu: "0"
limits:
memory: "2048Mi"
cpu: "1000m"
readinessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: RABBITMQ_USE_LONGNAME
value: "true"
# See a note on cluster_formation.k8s.address_type in the config file section
- name: RABBITMQ_NODENAME
value: "rabbit#$(HOSTNAME).rabbitmq.$(NAMESPACE).svc.cluster.local"
- name: K8S_SERVICE_NAME
value: "rabbitmq"
- name: RABBITMQ_ERLANG_COOKIE
value: "mycookie"
volumes:
- name: config-volume
configMap:
name: rabbitmq-config
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: "default"
resources:
requests:
storage: 3Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rabbitmq
namespace: rabbitmq-namespace
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: rabbitmq-namespace
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: rabbitmq-namespace
subjects:
- kind: ServiceAccount
name: rabbitmq
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: endpoint-reader

unmarshalerDecoder: quantities must match the regular expression

When I am installing CoreDNS using this command ,by the way,the OS version is: CentOS 7.6 and Kubernetes version is: v1.15.2:
kubectl create -f coredns.yaml
The output is:
[root#ops001 coredns]# kubectl create -f coredns.yaml
serviceaccount/coredns created
clusterrole.rbac.authorization.k8s.io/system:coredns created
clusterrolebinding.rbac.authorization.k8s.io/system:coredns created
configmap/coredns created
service/kube-dns created
Error from server (BadRequest): error when creating "coredns.yaml": Deployment in version "v1" cannot be handled as a Deployment: v1.Deployment.Spec: v1.DeploymentSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.Containers: []v1.Container: v1.Container.Resources: v1.ResourceRequirements.Requests: Limits: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|__LIMIT__"},"request|..., bigger context ...|limits":{"memory":"__PILLAR__DNS__MEMORY__LIMIT__"},"requests":{"cpu":"100m","memory":"70Mi"}},"secu|...
this is my coredns.yaml:
# __MACHINE_GENERATED_WARNING__
apiVersion: v1
kind: ServiceAccount
metadata:
name: coredns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: Reconcile
name: system:coredns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: EnsureExists
name: system:coredns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: EnsureExists
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
# replicas: not specified here:
# 1. In order to make Addon Manager do not reconcile this replicas parameter.
# 2. Default is 1.
# 3. Will be tuned in real time if DNS horizontal auto-scaling is turned on.
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
priorityClassName: system-cluster-critical
serviceAccountName: coredns
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
nodeSelector:
beta.kubernetes.io/os: linux
containers:
- name: coredns
image: gcr.azk8s.cn/google-containers/coredns:1.3.1
imagePullPolicy: IfNotPresent
resources:
limits:
memory: __PILLAR__DNS__MEMORY__LIMIT__
requests:
cpu: 100m
memory: 70Mi
args: [ "-conf", "/etc/coredns/Corefile" ]
volumeMounts:
- name: config-volume
mountPath: /etc/coredns
readOnly: true
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
dnsPolicy: Default
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.254.0.2
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
- name: metrics
port: 9153
protocol: TCP
am I missing something?
From this error message
Error from server (BadRequest):
error when creating "coredns.yaml":
Deployment in version "v1" cannot be handled as a Deployment:
v1.Deployment.Spec:
v1.DeploymentSpec.Template: v
1.PodTemplateSpec.Spec:
v1.PodSpec.Containers: []v1.Container:
v1.Container.Resources:
v1.ResourceRequirements.Requests: Limits: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|__LIMIT__"},"request|..., bigger context ...|limits":{"memory":"__PILLAR__DNS__MEMORY__LIMIT__"},"requests":{"cpu":"100m","memory":"70Mi"}},"secu|...
This part is root-cause.
unmarshalerDecoder:
quantities must match the regular expression
'^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'
What quantities are there?
Seems like
v1.ResourceRequirements.Requests: Limits:
So please, change Requests.Limits from __PILLAR__DNS__MEMORY__LIMIT__ to other value.
Please refer to coredns/deployment in your deployments there are fields like limits {"memory":"__PILLAR__DNS__MEMORY__LIMIT__".
As described in the docs you can use own script to override some parameters while switching from kube-dns to COREDNS there is deploy script.
Installing CoreDNS
In Kubernetes version 1.13 and later the CoreDNS feature gate is removed and CoreDNS is used by default.
So you can use your original installation and see default values in config map and deployment.
kubectl get configmap coredns -n kube-system -o yaml
Hope this help.