filebeat cannot connect to logstash because of selinux - kubernetes

filebeat cannot connect to logstash because this happen when i upgrade kubernetes to 1.24 the log path is changed, on this version is no longer using docker as a runtime
i only change this path and cannot connect to logstash
paths:
- "/var/log/containers/*.log"
but when i run this on kubernetes with docker runtime is work
paths:
- /var/lib/docker/containers/*/*.log
here my filebeat.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: kube-system
labels:
k8s-app: filebeat
data:
filebeat.yml: |-
filebeat.config:
prospectors:
# Mounted `filebeat-prospectors` configmap:
path: ${path.config}/prospectors.d/*.yml
# Reload prospectors configs as they change:
reload.enabled: false
modules:
path: ${path.config}/modules.d/*.yml
# Reload module configs as they change:
reload.enabled: false
processors:
- add_cloud_metadata:
output.logstash:
hosts: ["log.xxx.com:5044"]
ssl.certificate_authorities: ["/usr/share/filebeat/certs/elk-ca.pem"]
ssl.certificate: "/usr/share/filebeat/certs/beats-client.pem"
ssl.key: "/usr/share/filebeat/certs/beats-client.key"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-prospectors
namespace: kube-system
labels:
k8s-app: filebeat
data:
kubernetes.yml: |-
- type: log
enabled: true
symlinks: true
paths:
- "/var/log/containers/*.log"
processors:
- add_kubernetes_metadata:
in_cluster: true
default_matchers.enabled: false
matchers:
- logs_path:
logs_path: /var/log/containers/
logs
2022-12-19T13:06:35.189Z INFO instance/beat.go:611 Home path: [/usr/share/filebeat] Config path: [/usr/share/filebeat] Data path: [/usr/share/filebeat/data] Logs path: [/usr/share/filebeat/logs]
2022-12-19T13:06:35.190Z INFO instance/beat.go:618 Beat UUID: 68776622-4469-4ff4-bf1a-442d13456c75
2022-12-19T13:06:35.190Z INFO [seccomp] seccomp/seccomp.go:116 Syscall filter successfully installed
2022-12-19T13:06:35.190Z INFO [beat] instance/beat.go:931 Beat info {"system_info": {"beat": {"path": {"config": "/usr/share/filebeat", "data": "/usr/share/filebeat/data", "home": "/usr/share/filebeat", "logs": "/usr/share/filebeat/logs"}, "type": "filebeat", "uuid": "68776622-4469-4ff4-bf1a-442d13456c75"}}}
2022-12-19T13:06:35.191Z INFO [beat] instance/beat.go:940 Build info {"system_info": {"build": {"commit": "1d55b4bd9dbf106a4ad4bc34fe9ee425d922363b", "libbeat": "6.7.1", "time": "2019-04-02T15:01:15.000Z", "version": "6.7.1"}}}
2022-12-19T13:06:35.191Z INFO [beat] instance/beat.go:943 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":4,"version":"go1.10.8"}}}
2022-12-19T13:06:35.193Z INFO [beat] instance/beat.go:947 Host info {"system_info": {"host": {"architecture":"x86_64","boot_time":"2022-12-15T13:22:10Z","containerized":true,"name":"filebeat-2gh9x","ip":["127.0.0.1/8","::1/128","10.220.48.144/32","fe80::74:65ff:fe3e:27f0/64"],"kernel_version":"3.10.0-1160.80.1.el7.x86_64","mac":["02:74:65:3e:27:f0"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":6,"patch":1810,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0}}}
2022-12-19T13:06:35.195Z INFO [beat] instance/beat.go:976 Process info {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"effective":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/filebeat", "exe": "/usr/share/filebeat/filebeat", "name": "filebeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2022-12-19T13:06:34.150Z"}}}
2022-12-19T13:06:35.195Z INFO instance/beat.go:280 Setup Beat: filebeat; Version: 6.7.1
2022-12-19T13:06:35.197Z INFO [publisher] pipeline/module.go:110 Beat name: filebeat-2gh9x
2022-12-19T13:06:35.197Z WARN [cfgwarn] beater/filebeat.go:89 DEPRECATED: config.prospectors are deprecated, Use `config.inputs` instead. Will be removed in version: 7.0.0
2022-12-19T13:06:35.198Z INFO instance/beat.go:402 filebeat start running.
2022-12-19T13:06:35.198Z INFO registrar/registrar.go:97 No registry file found under: /usr/share/filebeat/data/registry. Creating a new registry file.
2022-12-19T13:06:35.200Z INFO registrar/registrar.go:134 Loading registrar data from /usr/share/filebeat/data/registry
2022-12-19T13:06:35.200Z INFO registrar/registrar.go:141 States Loaded from registrar: 0
2022-12-19T13:06:35.200Z WARN beater/filebeat.go:367 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2022-12-19T13:06:35.200Z INFO crawler/crawler.go:72 Loading Inputs: 0
2022-12-19T13:06:35.245Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s
2022-12-19T13:06:35.484Z INFO kubernetes/util.go:86 kubernetes: Using pod name filebeat-2gh9x and namespace kube-system to discover kubernetes node
2022-12-19T13:06:35.576Z INFO add_cloud_metadata/add_cloud_metadata.go:345 add_cloud_metadata: hosting provider type detected as ec2, metadata={"availability_zone":"ap-southeast-1a","instance_id":"i-01eae62cd2b63a734","machine_type":"t3.xlarge","provider":"ec2","region":"ap-southeast-1"}
2022-12-19T13:06:35.730Z INFO kubernetes/util.go:93 kubernetes: Using node ip-10-12-xxx.ap-southeast-1.compute.internal discovered by in cluster pod node query
2022-12-19T13:06:35.731Z INFO kubernetes/watcher.go:182 kubernetes: Performing a resource sync for *v1.PodList
2022-12-19T13:06:35.766Z INFO kubernetes/watcher.go:198 kubernetes: Resource sync done
2022-12-19T13:06:35.767Z INFO kubernetes/watcher.go:242 kubernetes: Watching API for resource events
2022-12-19T13:06:35.767Z INFO log/input.go:138 Configured paths: [/var/log/containers/*.log]
2022-12-19T13:06:35.767Z INFO cfgfile/reload.go:150 Config reloader started
2022-12-19T13:06:35.767Z INFO crawler/crawler.go:106 Loading and starting Inputs completed. Enabled inputs: 0
2022-12-19T13:06:35.768Z INFO cfgfile/reload.go:150 Config reloader started
2022-12-19T13:06:35.768Z INFO cfgfile/reload.go:205 Loading of config files completed.
2022-12-19T13:06:35.770Z INFO kubernetes/util.go:86 kubernetes: Using pod name filebeat-2gh9x and namespace kube-system to discover kubernetes node
2022-12-19T13:06:35.928Z INFO kubernetes/util.go:93 kubernetes: Using node ip-10-12-xxx.ap-southeast-1.compute.interna discovered by in cluster pod node query
2022-12-19T13:06:35.928Z INFO kubernetes/watcher.go:182 kubernetes: Performing a resource sync for *v1.PodList
2022-12-19T13:06:35.934Z INFO kubernetes/watcher.go:198 kubernetes: Resource sync done
2022-12-19T13:06:35.934Z INFO kubernetes/watcher.go:242 kubernetes: Watching API for resource events
2022-12-19T13:06:35.935Z INFO log/input.go:138 Configured paths: [/var/log/containers/*.log]
2022-12-19T13:06:35.935Z INFO input/input.go:114 Starting input of type: log; ID: 12715084902863065571
2022-12-19T13:06:35.935Z INFO cfgfile/reload.go:205 Loading of config files completed.
UPDATE
its selinux problem
any idea fix this instead of change the selinux to permissive ?
type=AVC msg=audit(1671531416.824:408337): avc: denied { read } for pid=10185 comm="filebeat" name="containers" dev="dm-3" ino=524357 scontext=system_u:system_r:container_t:s0:c579,c853 tcontext=system_u:object_r:container_log_t:s0 tclass=dir permissive=0
type=SYSCALL msg=audit(1671531416.824:408337): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c4202b3b20 a2=80000 a3=0 items=0 ppid=10123 pid=10185 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="filebeat" exe="/usr/share/filebeat/filebeat" subj=system_u:system_r:container_t:s0:c579,c853 key=(null)
type=PROCTITLE msg=audit(1671531416.824:408337): proctitle=66696C6562656174002D63002F6574632F66696C65626561742E796D6C002D65

Related

Open-telemetry auto instrumentation does not work without sidecar

I work at a startup, and we recently migrated our workloads to use Kubernetes, specifically we are running inside a cluster in EKS (AWS).
I'm currently trying to implement a observability stack on our cluster. I'm running Signoz on a separate EC2 instance (for tests, and because our cluster uses small machines that are not supported by their helm chart).
In the cluster, I'm running the Open Telemetry Operator, and have managed to deploy a Collector in deployment mode, and have validated that it is able to connect to the signoz instance. However, when I try to auto instrument my applications, I'm not able to do it without using sidecars.
My manifest file for the elements above is below.
apiVersion: v1
kind: Namespace
metadata:
name: opentelemetry
labels:
name: opentelemetry
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: opentelemetry
spec:
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 10s
exporters:
otlp:
endpoint: obs.stg.company.domain:4317
tls:
insecure: true
logging:
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp, logging]
logs:
receivers: [otlp]
processors: []
exporters: [otlp, logging]
---
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
namespace: opentelemetry
spec:
exporter:
endpoint: http://otel-collector-collector.opentelemetry.svc.cluster.local:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "0.25"
dotnet:
nodejs:
When I apply the annotation instrumentation.opentelemetry.io/inject-dotnet=opentelemetry/auto-instrumentation to the deployment of the application, or even to the namespace, and delete the pod (so it is recreated), I can see that the init container for dotnet auto instrumentation runs without a problem, but I get no traces, metrics or logs, either on the Collector or in Signoz.
If I create another collector in sidecar mode, like the one below, point the instrumentation to this collector, and also apply annotation sidecar.opentelemetry.io/inject=sidecar to the namespace, everything works fine.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: sidecar
namespace: application
spec:
mode: sidecar
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
logging:
otlp:
endpoint: "http://otel-collector-collector.opentelemetry.svc.cluster.local:4317"
tls:
insecure: true
service:
telemetry:
logs:
level: "debug"
pipelines:
traces:
receivers: [otlp]
processors: []
exporters: [logging, otlp]
The reason I am trying to do it without sidecars is that, as I said before, we have quite a small cluster, and would like to keep overhead to a minimum.
So, I would like first to understand if I should even be worried about sidecars, if their overhead is measurably different than not using them.
And second, I would like to understand what went wrong with my config, since I believe I followed all the instructions in Signoz's documentation.
Thank you for any help you guys can provide.

Can't Curl Services running in the kubernetes cluster from the vm in istio mesh

I am trying to deploy Istio on Virtual Machines. I am current architecture I have Kubernetes cluster which run the istio control plane (istiod) and a vm which is running the famous bookinfo istio application rating application. I am following the multi-network implementation as describe here (https://istio.io/latest/docs/setup/install/virtual-machine/). I have followed each step of the documentation and have successfully completed all the setup.
Error:
When I am trying to call the service running in kubernetes I get an error upstream connect error or disconnect/reset before headers. reset reason: connection failure
I can successfully call the service running on the vm from kubernetes.
Log of istio services running on the vm
2022-09-02T14:24:08.165388Z info FLAG: --domain=""
2022-09-02T14:24:08.165394Z info FLAG: --help="false"
2022-09-02T14:24:08.165396Z info FLAG: --log_as_json="false"
2022-09-02T14:24:08.165399Z info FLAG: --log_caller=""
2022-09-02T14:24:08.165401Z info FLAG: --log_output_level="dns:debug"
2022-09-02T14:24:08.165404Z info FLAG: --log_rotate=""
2022-09-02T14:24:08.165407Z info FLAG: --log_rotate_max_age="30"
2022-09-02T14:24:08.165409Z info FLAG: --log_rotate_max_backups="1000"
2022-09-02T14:24:08.165412Z info FLAG: --log_rotate_max_size="104857600"
2022-09-02T14:24:08.165414Z info FLAG: --log_stacktrace_level="default:none"
2022-09-02T14:24:08.165420Z info FLAG: --log_target="[stdout]"
2022-09-02T14:24:08.165423Z info FLAG: --meshConfig="./etc/istio/config/mesh"
2022-09-02T14:24:08.165426Z info FLAG: --outlierLogPath=""
2022-09-02T14:24:08.165428Z info FLAG: --proxyComponentLogLevel=""
2022-09-02T14:24:08.165431Z info FLAG: --proxyLogLevel="debug"
2022-09-02T14:24:08.165433Z info FLAG: --serviceCluster="istio-proxy"
2022-09-02T14:24:08.165436Z info FLAG: --stsPort="0"
2022-09-02T14:24:08.165438Z info FLAG: --templateFile=""
2022-09-02T14:24:08.165441Z info FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2022-09-02T14:24:08.165450Z info FLAG: --vklog="0"
2022-09-02T14:24:08.165457Z info Version 1.13.2-91533d04e894ff86b80acd6d7a4517b144f9e19a-Clean
2022-09-02T14:24:08.165587Z info Proxy role ips=[10.243.0.35 fe80::3cff:fe38:afc8] type=sidecar id=istio-on-vm-three.ratings domain=ratings.svc.cluster.local
2022-09-02T14:24:08.165626Z info Apply mesh config from file defaultConfig:
discoveryAddress: istiod.istio-system.svc:15012
meshId: mesh1
proxyMetadata:
CANONICAL_REVISION: latest
CANONICAL_SERVICE: ratings
ISTIO_META_AUTO_REGISTER_GROUP: ratings
ISTIO_META_CLUSTER_ID: cc90a48f0mfd7shso5g0
ISTIO_META_DNS_CAPTURE: "true"
ISTIO_META_MESH_ID: mesh1
ISTIO_META_NETWORK: ""
ISTIO_META_WORKLOAD_NAME: ratings
ISTIO_METAJSON_LABELS: '{"app":"ratings","service.istio.io/canonical-name":"ratings","service.istio.io/canonical-revision":"latest"}'
POD_NAMESPACE: ratings
SERVICE_ACCOUNT: bookinfo-ratings
TRUST_DOMAIN: cluster.local
tracing:
zipkin:
address: zipkin.istio-system:9411
2022-09-02T14:24:08.166897Z info Apply proxy config from env
serviceCluster: ratings.ratings
controlPlaneAuthPolicy: MUTUAL_TLS
2022-09-02T14:24:08.167480Z info Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
meshId: mesh1
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
CANONICAL_REVISION: latest
CANONICAL_SERVICE: ratings
ISTIO_META_AUTO_REGISTER_GROUP: ratings
ISTIO_META_CLUSTER_ID: cc90a48f0mfd7shso5g0
ISTIO_META_DNS_CAPTURE: "true"
ISTIO_META_MESH_ID: mesh1
ISTIO_META_NETWORK: ""
ISTIO_META_WORKLOAD_NAME: ratings
ISTIO_METAJSON_LABELS: '{"app":"ratings","service.istio.io/canonical-name":"ratings","service.istio.io/canonical-revision":"latest"}'
POD_NAMESPACE: ratings
SERVICE_ACCOUNT: bookinfo-ratings
TRUST_DOMAIN: cluster.local
serviceCluster: ratings.ratings
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
zipkin:
address: zipkin.istio-system:9411
2022-09-02T14:24:08.167495Z info JWT policy is third-party-jwt
2022-09-02T14:24:13.167597Z info timed out waiting for platform detection, treating it as Unknown
2022-09-02T14:24:13.167892Z info Opening status port 15020
2022-09-02T14:24:13.168029Z debug dns initialized DNS search=[.] servers=[127.0.0.53:53]
2022-09-02T14:24:13.169553Z info dns Starting local udp DNS server on 127.0.0.1:15053
2022-09-02T14:24:13.169584Z info dns Starting local tcp DNS server on 127.0.0.1:15053
2022-09-02T14:24:13.169628Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2022-09-02T14:24:13.169647Z info Using CA istiod.istio-system.svc:15012 cert with certs: /etc/certs/root-cert.pem
2022-09-02T14:24:13.169782Z info citadelclient Citadel client using custom root cert: /etc/certs/root-cert.pem
2022-09-02T14:24:13.182361Z info ads All caches have been synced up in 5.02146778s, marking server ready
2022-09-02T14:24:13.182736Z info sds SDS server for workload certificates started, listening on "etc/istio/proxy/SDS"
2022-09-02T14:24:13.182795Z info xdsproxy Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "cc90a48f0mfd7shso5g0"
2022-09-02T14:24:13.182770Z info sds Starting SDS grpc server
2022-09-02T14:24:13.183203Z info starting Http service at 127.0.0.1:15004
2022-09-02T14:24:13.184810Z info Pilot SAN: [istiod.istio-system.svc]
2022-09-02T14:24:13.186415Z info Starting proxy agent
2022-09-02T14:24:13.186444Z info Epoch 0 starting
2022-09-02T14:24:13.186463Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l debug --concurrency 2]
2022-09-02T14:24:13.264923Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
2022-09-02T14:24:13.284519Z info cache generated new workload certificate latency=101.82115ms ttl=23h59m59.715492792s
2022-09-02T14:24:13.284554Z info cache Root cert has changed, start rotating root cert
2022-09-02T14:24:13.284578Z info ads XDS: Incremental Pushing:0 ConnectedEndpoints:0 Version:
2022-09-02T14:24:13.284993Z info cache returned workload trust anchor from cache ttl=23h59m59.715012276s
2022-09-02T14:24:13.327799Z info ads ADS: new connection for node:istio-on-vm-three.ratings-1
2022-09-02T14:24:13.327908Z info cache returned workload certificate from cache ttl=23h59m59.672096732s
2022-09-02T14:24:13.328260Z info ads SDS: PUSH request for node:istio-on-vm-three.ratings resources:1 size:4.0kB resource:default
2022-09-02T14:24:13.367689Z info ads ADS: new connection for node:istio-on-vm-three.ratings-2
2022-09-02T14:24:13.367769Z info cache returned workload trust anchor from cache ttl=23h59m59.63223465s
2022-09-02T14:24:13.367948Z info ads SDS: PUSH request for node:istio-on-vm-three.ratings resources:1 size:1.1kB resource:ROOTCA
2022-09-02T14:24:13.387123Z debug dns updated lookup table with 96 hosts
2022-09-02T14:24:22.280792Z info Agent draining Proxy
2022-09-02T14:24:22.280825Z info Status server has successfully terminated
2022-09-02T14:24:22.281118Z error accept tcp [::]:15020: use of closed network connection
2022-09-02T14:24:22.282028Z info Graceful termination period is 5s, starting...
2022-09-02T14:24:27.282551Z info Graceful termination period complete, terminating remaining proxies.
2022-09-02T14:24:27.282610Z warn Aborted all epochs
2022-09-02T14:24:27.282622Z warn Aborting epoch 0
2022-09-02T14:24:27.282889Z info Epoch 0 aborted normally
2022-09-02T14:24:27.282899Z info Agent has successfully terminated
2022-09-02T14:24:57.386419Z info FLAG: --concurrency="0"
2022-09-02T14:24:57.386463Z info FLAG: --domain=""
2022-09-02T14:24:57.386471Z info FLAG: --help="false"
2022-09-02T14:24:57.386474Z info FLAG: --log_as_json="false"
2022-09-02T14:24:57.386477Z info FLAG: --log_caller=""
2022-09-02T14:24:57.386480Z info FLAG: --log_output_level="dns:debug"
2022-09-02T14:24:57.386482Z info FLAG: --log_rotate=""
2022-09-02T14:24:57.386486Z info FLAG: --log_rotate_max_age="30"
2022-09-02T14:24:57.386489Z info FLAG: --log_rotate_max_backups="1000"
2022-09-02T14:24:57.386492Z info FLAG: --log_rotate_max_size="104857600"
2022-09-02T14:24:57.386495Z info FLAG: --log_stacktrace_level="default:none"
2022-09-02T14:24:57.386504Z info FLAG: --log_target="[stdout]"
2022-09-02T14:24:57.386507Z info FLAG: --meshConfig="./etc/istio/config/mesh"
2022-09-02T14:24:57.386510Z info FLAG: --outlierLogPath=""
2022-09-02T14:24:57.386512Z info FLAG: --proxyComponentLogLevel=""
2022-09-02T14:24:57.386515Z info FLAG: --proxyLogLevel="debug"
2022-09-02T14:24:57.386518Z info FLAG: --serviceCluster="istio-proxy"
2022-09-02T14:24:57.386521Z info FLAG: --stsPort="0"
2022-09-02T14:24:57.386533Z info FLAG: --templateFile=""
2022-09-02T14:24:57.386544Z info FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2022-09-02T14:24:57.386553Z info FLAG: --vklog="0"
2022-09-02T14:24:57.386559Z info Version 1.13.2-91533d04e894ff86b80acd6d7a4517b144f9e19a-Clean
2022-09-02T14:24:57.386705Z info Proxy role ips=[10.243.0.35 fe80::3cff:fe38:afc8] type=sidecar id=istio-on-vm-three.ratings domain=ratings.svc.cluster.local
2022-09-02T14:24:57.386749Z info Apply mesh config from file defaultConfig:
discoveryAddress: istiod.istio-system.svc:15012
meshId: mesh1
proxyMetadata:
CANONICAL_REVISION: latest
CANONICAL_SERVICE: ratings
ISTIO_META_AUTO_REGISTER_GROUP: ratings
ISTIO_META_CLUSTER_ID: cc90a48f0mfd7shso5g0
ISTIO_META_DNS_CAPTURE: "true"
ISTIO_META_MESH_ID: mesh1
ISTIO_META_NETWORK: ""
ISTIO_META_WORKLOAD_NAME: ratings
ISTIO_METAJSON_LABELS: '{"app":"ratings","service.istio.io/canonical-name":"ratings","service.istio.io/canonical-revision":"latest"}'
POD_NAMESPACE: ratings
SERVICE_ACCOUNT: bookinfo-ratings
TRUST_DOMAIN: cluster.local
tracing:
zipkin:
address: zipkin.istio-system:9411
2022-09-02T14:24:57.387852Z info Apply proxy config from env
serviceCluster: ratings.ratings
controlPlaneAuthPolicy: MUTUAL_TLS
2022-09-02T14:24:57.388363Z info Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
meshId: mesh1
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
CANONICAL_REVISION: latest
CANONICAL_SERVICE: ratings
ISTIO_META_AUTO_REGISTER_GROUP: ratings
ISTIO_META_CLUSTER_ID: cc90a48f0mfd7shso5g0
ISTIO_META_DNS_CAPTURE: "true"
ISTIO_META_MESH_ID: mesh1
ISTIO_META_NETWORK: ""
ISTIO_META_WORKLOAD_NAME: ratings
ISTIO_METAJSON_LABELS: '{"app":"ratings","service.istio.io/canonical-name":"ratings","service.istio.io/canonical-revision":"latest"}'
POD_NAMESPACE: ratings
SERVICE_ACCOUNT: bookinfo-ratings
TRUST_DOMAIN: cluster.local
serviceCluster: ratings.ratings
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
zipkin:
address: zipkin.istio-system:9411
2022-09-02T14:24:57.388378Z info JWT policy is third-party-jwt
2022-09-02T14:25:02.388947Z info timed out waiting for platform detection, treating it as Unknown
2022-09-02T14:25:02.389180Z debug dns initialized DNS search=[.] servers=[127.0.0.53:53]
2022-09-02T14:25:02.389248Z info dns Starting local udp DNS server on 127.0.0.1:15053
2022-09-02T14:25:02.389249Z info Opening status port 15020
2022-09-02T14:25:02.389413Z info dns Starting local tcp DNS server on 127.0.0.1:15053
2022-09-02T14:25:02.389432Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2022-09-02T14:25:02.389445Z info Using CA istiod.istio-system.svc:15012 cert with certs: /etc/certs/root-cert.pem
2022-09-02T14:25:02.389532Z info citadelclient Citadel client using custom root cert: /etc/certs/root-cert.pem
2022-09-02T14:25:02.402154Z info ads All caches have been synced up in 5.019952409s, marking server ready
2022-09-02T14:25:02.402449Z info sds SDS server for workload certificates started, listening on "etc/istio/proxy/SDS"
2022-09-02T14:25:02.402475Z info xdsproxy Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "cc90a48f0mfd7shso5g0"
2022-09-02T14:25:02.402487Z info sds Starting SDS grpc server
2022-09-02T14:25:02.402794Z info starting Http service at 127.0.0.1:15004
2022-09-02T14:25:02.403926Z info Pilot SAN: [istiod.istio-system.svc]
2022-09-02T14:25:02.405489Z info Starting proxy agent
2022-09-02T14:25:02.405522Z info Epoch 0 starting
2022-09-02T14:25:02.405560Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l debug --concurrency 2]
2022-09-02T14:25:02.480867Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
2022-09-02T14:25:02.552937Z info ads ADS: new connection for node:istio-on-vm-three.ratings-1
2022-09-02T14:25:02.592884Z info ads ADS: new connection for node:istio-on-vm-three.ratings-2
2022-09-02T14:25:02.602362Z info cache generated new workload certificate latency=199.854356ms ttl=23h59m59.397649371s
2022-09-02T14:25:02.602401Z info cache Root cert has changed, start rotating root cert
2022-09-02T14:25:02.602421Z info ads XDS: Incremental Pushing:0 ConnectedEndpoints:2 Version:
2022-09-02T14:25:02.602531Z info cache returned workload trust anchor from cache ttl=23h59m59.397477611s
2022-09-02T14:25:02.602586Z info cache returned workload certificate from cache ttl=23h59m59.397417006s
2022-09-02T14:25:02.602881Z info cache returned workload trust anchor from cache ttl=23h59m59.397122534s
2022-09-02T14:25:02.604303Z info ads SDS: PUSH request for node:istio-on-vm-three.ratings resources:1 size:1.1kB resource:ROOTCA
2022-09-02T14:25:02.604361Z info cache returned workload trust anchor from cache ttl=23h59m59.395642519s
2022-09-02T14:25:02.604393Z info ads SDS: PUSH for node:istio-on-vm-three.ratings resources:1 size:1.1kB resource:ROOTCA
2022-09-02T14:25:02.604384Z info ads SDS: PUSH request for node:istio-on-vm-three.ratings resources:1 size:4.0kB resource:default
2022-09-02T14:25:02.622631Z debug dns updated lookup table with 96 hosts
2022-09-02T14:25:04.329218Z debug dns request ;; opcode: QUERY, status: NOERROR, id: 24280
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; QUESTION SECTION:
;details.default.svc. IN AAAA
;; ADDITIONAL SECTION:
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 1200
protocol=udp edns=true id=6240baac-c243-45be-9a10-dfe500a83cfa
2022-09-02T14:25:04.329282Z debug dns response for hostname "details.default.svc." (found=true): ;; opcode: QUERY, status: NOERROR, id: 24280
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;details.default.svc. IN AAAA
protocol=udp edns=true id=6240baac-c243-45be-9a10-dfe500a83cfa
2022-09-02T14:25:04.329305Z debug dns request ;; opcode: QUERY, status: NOERROR, id: 17619
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; QUESTION SECTION:
;details.default.svc. IN A
;; ADDITIONAL SECTION:
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 1200
protocol=udp edns=true id=30fd3d3c-efed-4a27-b8ba-113f56efb67d
2022-09-02T14:25:04.329371Z debug dns response for hostname "details.default.svc." (found=true): ;; opcode: QUERY, status: NOERROR, id: 17619
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;details.default.svc. IN A
;; ANSWER SECTION:
details.default.svc. 30 IN A 172.21.199.92
protocol=udp edns=true id=30fd3d3c-efed-4a27-b8ba-113f56efb67d
Gateway configuration for istiod
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: >
{"apiVersion":"networking.istio.io/v1alpha3","kind":"Gateway","metadata":{"annotations":{},"name":"istiod-gateway","namespace":"istio-system"},"spec":{"selector":{"istio":"eastwestgateway"},"servers":[{"hosts":["*"],"port":{"name":"tls-istiod","number":15012,"protocol":"tls"},"tls":{"mode":"PASSTHROUGH"}},{"hosts":["*"],"port":{"name":"tls-istiodwebhook","number":15017,"protocol":"tls"},"tls":{"mode":"PASSTHROUGH"}}]}}
creationTimestamp: '2022-09-02T13:54:17Z'
generation: 1
managedFields:
- apiVersion: networking.istio.io/v1alpha3
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
.: {}
f:selector:
.: {}
f:istio: {}
f:servers: {}
manager: kubectl-client-side-apply
operation: Update
time: '2022-09-02T13:54:17Z'
name: istiod-gateway
namespace: istio-system
resourceVersion: '3685'
uid: 23f776c9-a4d1-43a7-8992-72be4f933d9d
spec:
selector:
istio: eastwestgateway
servers:
- hosts:
- '*'
port:
name: tls-istiod
number: 15012
protocol: tls
tls:
mode: PASSTHROUGH
- hosts:
- '*'
port:
name: tls-istiodwebhook
number: 15017
protocol: tls
tls:
mode: PASSTHROUGH
Virtual service for istiod
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: >
{"apiVersion":"networking.istio.io/v1alpha3","kind":"VirtualService","metadata":{"annotations":{},"name":"istiod-vs","namespace":"istio-system"},"spec":{"gateways":["istiod-gateway"],"hosts":["*"],"tls":[{"match":[{"port":15012,"sniHosts":["*"]}],"route":[{"destination":{"host":"istiod.istio-system.svc.cluster.local","port":{"number":15012}}}]},{"match":[{"port":15017,"sniHosts":["*"]}],"route":[{"destination":{"host":"istiod.istio-system.svc.cluster.local","port":{"number":443}}}]}]}}
creationTimestamp: '2022-09-02T13:54:17Z'
generation: 1
managedFields:
- apiVersion: networking.istio.io/v1alpha3
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
.: {}
f:gateways: {}
f:hosts: {}
f:tls: {}
manager: kubectl-client-side-apply
operation: Update
time: '2022-09-02T13:54:17Z'
name: istiod-vs
namespace: istio-system
resourceVersion: '3686'
uid: d1b88fde-20a3-48dd-a549-dfe77407e206
spec:
gateways:
- istiod-gateway
hosts:
- '*'
tls:
- match:
- port: 15012
sniHosts:
- '*'
route:
- destination:
host: istiod.istio-system.svc.cluster.local
port:
number: 15012
- match:
- port: 15017
sniHosts:
- '*'
route:
- destination:
host: istiod.istio-system.svc.cluster.local
port:
number: 443
Please let me know if you need more information to debug/
After a lot of debugging and trial and error I found the problem and solved. First the variables in definition to create workload group in the official istio documentation is not explained properly. As per the official documentation in the workload group we need to mention the network of the vm but doesn't which network as a vm can have interfaces mapping to a public and private network. The solution is that you need to mention the network ip which is mapping to default network interface i.e in my case my eth0 interface mapped to the private ip of the vm, hence for me the workload definition was something like this
apiVersion: networking.istio.io/v1alpha3
kind: WorkloadGroup
metadata:
name: "${VM_APP}"
namespace: "${VM_NAMESPACE}"
spec:
metadata:
labels:
app: "${VM_APP}"
template:
serviceAccount: "${SERVICE_ACCOUNT}"
network: "${VM'S_PRIVATE_IP}"
probe:
periodSeconds: 5
initialDelaySeconds: 1
httpGet:
port: 8080
path: /ready
Second the command provided in the docs to create the workload entry is incomplete. To get a mesh expansion to work in a multi-network mesh the command should be
istioctl x workload entry configure -f workloadgroup.yaml -o "${WORK_DIR}" --clusterID "${CLUSTER}" --ingressIP ${EAST_WEST_GATEWAY_IP_ADDRESS} --externalIP ${PRIVATE_IP_OF_THE_VM or ETH0_IP_ADDRESS} --autoregister

Unable to login to high available keycloak cluster

Im using bitnami helm chart for keycloak. and trying to achieve High availability with 3 keycloak replics, using DNS ping.
Chart version: 5.2.8
Image version: 15.1.1-debian-10-r10
Helm repo: https://charts.bitnami.com/bitnami -> bitnami/keycloak
The modified parameters of values.yaml file is as follows:
global:
image:
registry: docker.io
repository: bitnami/keycloak
tag: 15.1.1-debian-10-r10
pullPolicy: IfNotPresent
pullSecrets: []
debug: true
proxyAddressForwarding: true
serviceDiscovery:
enabled: true
protocol: dns.DNS_PING
properties:
- dns_query=keycloak.identity.svc.cluster.local
transportStack: tcp
cache:
ownersCount: 3
authOwnersCount: 3
replicaCount: 3
ingress:
enabled: true
hostname: my-keycloak.keycloak.example.com
apiVersion: ""
ingressClassName: "nginx"
path: /
pathType: ImplementationSpecific
annotations: {}
tls: false
extraHosts: []
extraTls: []
secrets: []
existingSecret: ""
servicePort: http
When login to the keycloak UI, after entering the username and password , the login does not happen, it redirects the back to login page.
From the pod logs I see following error:
0:07:05,251 WARN [org.keycloak.events] (default task-1) type=CODE_TO_TOKEN_ERROR, realmId=master, clientId=security-admin-console, userId=null, ipAddress=10.244.5.46, error=invalid_code, grant_type=authorization_code, code_id=157e0483-67fa-4ea4-a964-387f3884cbc9, client_auth_method=client-secret
When I checked about this error in forums, As per some suggestions to set proxyAddressForwarding to true, but with this as well, issue remains same.
Apart from this I have tried some other version of the helm chart , but with that the UI itself does not load correctly with page not found errors.
Update
I get the above error i.e, CODE_TO_TOKEN_ERROR in logs when I use the headless service with ingress. But if I use the service of type ClusterIP with ingress , the error is as follows:
06:43:37,587 WARN [org.keycloak.events] (default task-6) type=LOGIN_ERROR, realmId=master, clientId=null, userId=null, ipAddress=10.122.0.26, error=expired_code, restart_after_timeout=true, authSessionParentId=453870cd-5580-495d-8f03-f73498cd3ace, authSessionTabId=1d17vpIoysE
Another additional information I would like to post is , I see following INFO in all the keycloak pod logs at the startup.
05:27:10,437 INFO [org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) my-keycloak-0: no members discovered after 3006 ms: creating cluster as coordinator
This sounds like the 3 members have not combined and not formed a keycloak cluster.
One common scenario that may lead to such a situation is when the node that issued the access code is not the one who has received the code to token request. So the client gets the access code from node 1 but the second request reaches node 2 and the value is not yet in this node's cache. The safest approach to prevent such scenario is to setup a session sticky load balancer.
I suggest you to try setting the service.spec.sessionAffinity to ClientIP. Its default value is None.
This part of the error
expired_code
might indicate a mismatch in timekeeping between the server and the client

Argo Workflow + Spark operator + App logs not generated

Am in very early stages of exploring Argo with Spark operator to run Spark samples on the minikube setup on my EC2 instance.
Following are the resources details, not sure why am not able to see the spark app logs.
WORKFLOW.YAML
kind: Workflow
metadata:
name: spark-argo-groupby
spec:
entrypoint: sparkling-operator
templates:
- name: spark-groupby
resource:
action: create
manifest: |
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
generateName: spark-argo-groupby
spec:
type: Scala
mode: cluster
image: gcr.io/spark-operator/spark:v3.0.3
imagePullPolicy: Always
mainClass: org.apache.spark.examples.GroupByTest
mainApplicationFile: local:///opt/spark/spark-examples_2.12-3.1.1-hadoop-2.7.jar
sparkVersion: "3.0.3"
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.0.0
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.0.0
- name: sparkling-operator
dag:
tasks:
- name: SparkGroupBY
template: spark-groupby
ROLES
# Role for spark-on-k8s-operator to create resources on cluster
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: spark-cluster-cr
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rules:
- apiGroups:
- sparkoperator.k8s.io
resources:
- sparkapplications
verbs:
- '*'
---
# Allow airflow-worker service account access for spark-on-k8s
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argo-spark-crb
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: spark-cluster-cr
subjects:
- kind: ServiceAccount
name: default
namespace: argo
ARGO UI
To dig deep I tried all the steps that's listed on https://dev.to/crenshaw_dev/how-to-debug-an-argo-workflow-31ng yet could not get app logs.
Basically when I run these examples am expecting spark app logs to be printed - in this case output of following Scala example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/GroupByTest.scala
Interesting when I list PODS, I was expecting to see driver pods and executor pods but always see only one POD and it has above logs as in the image attached. Please help me to understand why logs are not generated and how can I get it?
RAW LOGS
$ kubectl logs spark-pi-dag-739246604 -n argo
time="2021-12-10T13:28:09.560Z" level=info msg="Starting Workflow Executor" version="{v3.0.3 2021-05-11T21:14:20Z 02071057c082cf295ab8da68f1b2027ff8762b5a v3.0.3 clean go1.15.7 gc linux/amd64}"
time="2021-12-10T13:28:09.581Z" level=info msg="Creating a docker executor"
time="2021-12-10T13:28:09.581Z" level=info msg="Executor (version: v3.0.3, build_date: 2021-05-11T21:14:20Z) initialized (pod: argo/spark-pi-dag-739246604) with template:\n{\"name\":\"sparkpi\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"resource\":{\"action\":\"create\",\"manifest\":\"apiVersion: \\\"sparkoperator.k8s.io/v1beta2\\\"\\nkind: SparkApplication\\nmetadata:\\n generateName: spark-pi-dag\\nspec:\\n type: Scala\\n mode: cluster\\n image: gjeevanm/spark:v3.1.1\\n imagePullPolicy: Always\\n mainClass: org.apache.spark.examples.SparkPi\\n mainApplicationFile: local:///opt/spark/spark-examples_2.12-3.1.1-hadoop-2.7.jar\\n sparkVersion: 3.1.1\\n driver:\\n cores: 1\\n coreLimit: \\\"1200m\\\"\\n memory: \\\"512m\\\"\\n labels:\\n version: 3.0.0\\n executor:\\n cores: 1\\n instances: 1\\n memory: \\\"512m\\\"\\n labels:\\n version: 3.0.0\\n\"},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio:9000\",\"bucket\":\"my-bucket\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"my-minio-cred\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"my-minio-cred\",\"key\":\"secretkey\"},\"key\":\"spark-pi-dag/spark-pi-dag-739246604\"}}}"
time="2021-12-10T13:28:09.581Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-12-10T13:28:09.581Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2021-12-10T13:28:10.348Z" level=info msg=argo/SparkApplication.sparkoperator.k8s.io/spark-pi-daghhl6s
time="2021-12-10T13:28:10.348Z" level=info msg="Starting SIGUSR2 signal monitor"
time="2021-12-10T13:28:10.348Z" level=info msg="No output parameters"
As Michael mentioned in his answer, Argo Workflows does not know how other CRDs (such as SparkApplication that you used) work and thus could not pull the logs from the pods created by that particular CRD.
However, you can add the label workflows.argoproj.io/workflow: {{workflow.name}} to the pods generated by SparkApplication to let Argo Workflows know and then use argo logs -c <container-name> to pull the logs from those pods.
You can find an example here but Kubeflow CRD but in your case you'll want to add labels to the executor and driver to your SparkApplication CRD in the resource template: https://github.com/argoproj/argo-workflows/blob/master/examples/k8s-resource-log-selector.yaml
Argo Workflows' resource templates (like your spark-groupby template) are simplistic. The Workflow controller is running kubectl create, and that's where its involvement in the SparkApplication ends.
The logs you're seeing from the Argo Workflow pod describe the kubectl create process. Your resource is written to a temporary yaml file and then applied to the cluster.
time="2021-12-10T13:28:09.581Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-12-10T13:28:09.581Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2021-12-10T13:28:10.348Z" level=info msg=argo/SparkApplication.sparkoperator.k8s.io/spark-pi-daghhl6s
Old answer:
To view the logs generated by your SparkApplication, you'll need to
follow the Spark docs. I'm not familiar, but I'm guessing the
application gets run in a Pod somewhere. If you can find that pod, you
should be able to view the logs with kubectl logs.
It would be really cool if Argo Workflows could pull Spark logs into
its UI. But building a generic solution would probably be
prohibitively difficult.
Update:
Check Yuan's answer. There's a way to pull the Spark logs into the Workflows CLI!

GKE - HPA using custom metrics - unable to fetch metrics

I have custom metrics exported to Google Cloud Monitoring and i want to scale my deployment according to it.
This is my HPA:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: <DEPLOYMENT>-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: <DEPLOYMENT>
minReplicas: 5
maxReplicas: 100
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbit_mq|test|messages_count"
metricSelector:
matchLabels:
metric.labels.name: production
targetValue: 1
When describing the hpa i see:
Warning FailedComputeMetricsReplicas 4m23s (x12 over 7m23s) horizontal-pod-autoscaler Invalid metrics (1 invalid out of 1), last error was: failed to get externa
l metric custom.googleapis.com|rabbit_mq|test|messages_count: unable to get external metric production/custom.googleapis.com|rabbit_mq|test|messages_count/&LabelSelect
or{MatchLabels:map[string]string{metric.labels.name: production,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently una
ble to handle the request (get custom.googleapis.com|rabbit_mq|test|messages_count.external.metrics.k8s.io)
Warning FailedGetExternalMetric 2m23s (x20 over 7m23s) horizontal-pod-autoscaler unable to get external metric production/custom.googleapis.com|rabbit_mq|te
st|messages_count/&LabelSelector{MatchLabels:map[string]string{metric.labels.name: production,},MatchExpressions:[],}: unable to fetch metrics from external metrics AP
I: the server is currently unable to handle the request (get custom.googleapis.com|rabbit_mq|test|messages_count.external.metrics.k8s.io)
And:
Metrics: ( current / target )
"custom.googleapis.com|rabbit_mq|test|messages_count" (target value): <unknown> / 1
Kubernetes is unable to get the metric.
I validated that the metric is available and updated through the Monitoring dashboard.
Cluster nodes has Full Control for Stackdriver Monitoring:
Kubernetes version is 1.15.
What may be causing that?
Edit 1
Discovered that stackdriver-metadata-agent-cluster-level deployment is CrashLoopBack.
kubectl -n=kube-system logs stackdriver-metadata-agent-cluster-le
vel-f8dcd8b45-nl8dj -c metadata-agent
Logs from container:
vel-f8dcd8b45-nl8dj -c metadata-agent
I0408 11:50:41.999214 1 log_spam.go:42] Command line arguments:
I0408 11:50:41.999263 1 log_spam.go:44] argv[0]: '/k8s_metadata'
I0408 11:50:41.999271 1 log_spam.go:44] argv[1]: '-logtostderr'
I0408 11:50:41.999277 1 log_spam.go:44] argv[2]: '-v=1'
I0408 11:50:41.999284 1 log_spam.go:46] Process id 1
I0408 11:50:41.999311 1 log_spam.go:50] Current working directory /
I0408 11:50:41.999336 1 log_spam.go:52] Built on Jun 27 20:15:21 (1561666521)
at gcm-agent-dev-releaser#ikle14.prod.google.com:/google/src/files/255462966/depot/branches/gcm_k8s_metadata_release_branch/255450506.1/OVERLAY_READONLY/google3
as //cloud/monitoring/agents/k8s_metadata:k8s_metadata
with gc go1.12.5 for linux/amd64
from changelist 255462966 with baseline 255450506 in a mint client based on //depot/branches/gcm_k8s_metadata_release_branch/255450506.1/google3
Build label: gcm_k8s_metadata_20190627a_RC00
Build tool: Blaze, release blaze-2019.06.17-2 (mainline #253503028)
Build target: //cloud/monitoring/agents/k8s_metadata:k8s_metadata
I0408 11:50:41.999641 1 trace.go:784] Starting tracingd dapper tracing
I0408 11:50:41.999785 1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
W0408 11:50:42.003682 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
E0408 11:50:43.999995 1 main.go:110] Will only handle some server resources due to partial failure: unable to retrieve the complete list of server APIs: custom.m
etrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, exter
nal.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
I0408 11:50:44.000286 1 main.go:134] Initiating watch for { v1 nodes} resources
I0408 11:50:44.000394 1 main.go:134] Initiating watch for { v1 pods} resources
I0408 11:50:44.097181 1 main.go:134] Initiating watch for {batch v1beta1 cronjobs} resources
I0408 11:50:44.097488 1 main.go:134] Initiating watch for {apps v1 daemonsets} resources
I0408 11:50:44.098123 1 main.go:134] Initiating watch for {extensions v1beta1 daemonsets} resources
I0408 11:50:44.098427 1 main.go:134] Initiating watch for {apps v1 deployments} resources
I0408 11:50:44.098713 1 main.go:134] Initiating watch for {extensions v1beta1 deployments} resources
I0408 11:50:44.098919 1 main.go:134] Initiating watch for { v1 endpoints} resources
I0408 11:50:44.099134 1 main.go:134] Initiating watch for {extensions v1beta1 ingresses} resources
I0408 11:50:44.099207 1 main.go:134] Initiating watch for {batch v1 jobs} resources
I0408 11:50:44.099303 1 main.go:134] Initiating watch for { v1 namespaces} resources
I0408 11:50:44.099360 1 main.go:134] Initiating watch for {apps v1 replicasets} resources
I0408 11:50:44.099410 1 main.go:134] Initiating watch for {extensions v1beta1 replicasets} resources
I0408 11:50:44.099461 1 main.go:134] Initiating watch for { v1 replicationcontrollers} resources
I0408 11:50:44.197193 1 main.go:134] Initiating watch for { v1 services} resources
I0408 11:50:44.197348 1 main.go:134] Initiating watch for {apps v1 statefulsets} resources
I0408 11:50:44.197363 1 main.go:142] All resources are being watched, agent has started successfully
I0408 11:50:44.197374 1 main.go:145] No statusz port provided; not starting a server
I0408 11:50:45.197164 1 binarylog.go:95] Starting disk-based binary logging
I0408 11:50:45.197238 1 binarylog.go:265] rpc: flushed binary log to ""
Edit 2
The issue in edit 1 was fixed using the answer in:
https://stackoverflow.com/a/60549732/4869599
But still the hpa can't fetch the metric.
Edit 3
It seems like the issue is caused by custom-metrics-stackdriver-adapter under the custom-metrics namespace which is stuck in CrashLoopBack.
The logs of the machine:
E0419 13:36:48.036494 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:48.832653 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:48.832692 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:49.433150 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:49.433191 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:51.032656 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:51.032694 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:51.235248 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
A related issue:
https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/303
The problem was with the custom-metrics-stackdriver-adapter. It was crashing in the metrics-server namespace.
Using the resource found here:
https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
And using this image for the deployment (my version was v0.10.2):
gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.1
This fixed the crashing pod, and now the hpa fetch the custom metric.
Check metrics server pod running in your kube-system namespace. or else you can use this.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp