kafka datadog not sending metrics correctly - apache-kafka

i am trying to send kafka consumer metrics to datadog but its not showing in monitoring when I select the node. The server is giving below check in status
Instance ID: kafka_consumer:d6........f5 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
Total Runs: 567
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 162ms
Last Execution Date : 2021-01-14 10:49:06.000000 UTC
Last Successful Execution Date : 2021-01-14 10:49:06.000000 UTC
metadata:
version.major: 2
version.minor: 5
version.patch: 0
version.raw: 2.5.0
version.scheme: semver
JMXFetch
runtime_version : 11.0.9.1
version : 0.40.3
Initialized checks
kafka
instance_name : kafka-10.128.0.105-9999
message : <no value>
metric_count : 99
service_check_count : 0
status : OK
Failed checks
no checks
JMX is as above. Please help in finding what could be wrong.

Nothing is wrong there. DataDog conceals that the Kafka integration uses Dogstatsd under the hood. When use_dogstatsd: 'true within /etc/datadog-agent/datadog.yaml is set, metrics do appear in DataDog webUI. If that option is not set the default Broker data is available via JMXFetch using sudo -u dd-agent datadog-agent status as also via sudo -u dd-agent datadog-agent check kafka but not in the webUI.

Related

gcloud container clusters upgrade --node-pool: it timed out; how do i use gcloud to tell when it's done?

I'm trying to upgrade my gke cluster from this command:
gcloud container clusters upgrade CLUSTER_NAME --cluster-version=1.15.11-gke.3 \
--node-pool=default-pool --zone=ZONE
I get the following output:
Upgrading test-upgrade-172615287... Done with 0 out of 5 nodes (0.0%): 2 being processed...done.
Timed out waiting for operation <Operation
clusterConditions: []
detail: u'Done with 0 out of 5 nodes (0.0%): 2 being processed'
name: u'operation-NUM-TAG'
nodepoolConditions: []
operationType: OperationTypeValueValuesEnum(UPGRADE_NODES, 4)
progress: <OperationProgress
metrics: [<Metric
intValue: 5
name: u'NODES_TOTAL'>, <Metric
intValue: 0
name: u'NODES_FAILED'>, <Metric
intValue: 0
name: u'NODES_COMPLETE'>, <Metric
intValue: 0
name: u'NODES_DONE'>]
stages: []>
…
status: StatusValueValuesEnum(RUNNING, 2)
…>
ERROR: (gcloud.container.clusters.upgrade) Operation [DATA_SAME_AS_IN_TIMEOUT] is still running
I just discovered gcloud config set builds/timeout 3600 so I hope this doesn't happen again, like in my CI. But if it does, is there a gcloud command that lets me know that the upgrade is still in progress? These two didn't provide that:
gcloud container clusters describe CLUSTER_NAME --zone=ZONE
gcloud container node-pools describe default-pool --cluster=CLUSTER_NAME --zone=ZONE
Note: Doing this upgrade in the console took 2 hours so I'm not surprised the command-line attempt timed out. This is for a CI, so I'm fine looping and sleeping for 4 hours or so before giving up. But what's the command that will let me know when the cluster is being upgraded, and when it either finishes or fails? The UI is showing the cluster is still undergoing the upgrade, so I assume there is some command.
TIA as usual
Bumped into the same issue.
All gcloud commands, including gcloud container operations wait OPERATION_ID(https://cloud.google.com/sdk/gcloud/reference/container/operations/wait), have the same 1-hour timeout.
At this point, there is no other way to wait for the upgrade to complete than to query gcloud container operations list and check the STATUS in a loop.

Kubernetes 1.15.5 and romana 2.0.2 getting network errors when ANY pods added or removed

I have encountered some mysterious network errors in our kubernetes cluster. Although I originally encountered these errors using ingress, there are even more errors when I bypass our load balancer, bypass kube-proxy and bypass nginx-ingress. The most errors are present when going directly to services and straight to the pod IPs. I believe this is because the load balancer and nginx have some better error handling than the raw iptable routing.
To test the error I use apache benchmark from VM on same subnet, any concurrency level, no keep-alive, connect to the pod IP and use a high enough request number to give me time to either scale up or scale down a deployment. Odd thing is it doesn't matter at all which deployment I modify since it always causes the same sets of errors even when its not related to the pod I am modifying. ANY additions or removals of pods will trigger apache benchmark errors. Manual deletions, scaling up/down, auto-scaling all trigger errors. If there are no pod changes while the ab test is running then no errors get reported. Note keep-alive does seem to greatly reduce if not eliminate the errors, but I only tested that a handful of times and never saw an error.
Other than some bizarre iptable conflict, I really don't see how deleting pod A can effect network connections of pod B. Since the errors are brief and go away within seconds it seems more like a brief network outage.
Sample ab test: ab -n 5000 -c 2 https://10.112.0.24/
Errors when using HTTPS:
SSL handshake failed (5).
SSL read failed (5) - closing connection
Errors when using HTTP:
apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)
Example ab output. I ctl-C after encountering first errors:
$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C
Server Software: nginx
Server Hostname: 10.112.0.24
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 2575 bytes
Concurrency Level: 2
Time taken for tests: 21.670 seconds
Complete requests: 1824
Failed requests: 2
(Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred: 5142683 bytes
HTML transferred: 4694225 bytes
Requests per second: 84.17 [#/sec] (mean)
Time per request: 23.761 [ms] (mean)
Time per request: 11.881 [ms] (mean, across all concurrent requests)
Transfer rate: 231.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 15 9.8 12 82
Processing: 1 9 9.0 6 130
Waiting: 0 8 8.9 6 129
Total: 7 23 14.4 19 142
Percentage of the requests served within a certain time (ms)
50% 19
66% 24
75% 28
80% 30
90% 40
95% 54
98% 66
99% 79
100% 142 (longest request)
Current sysctl settings that may be relevant:
net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050 65500
I didn't see any conntrack "full" errors. Best I could tell there isn't packet loss. We recently upgraded from 1.14 and didn't notice the issue but I can't say for certain it wasn't there. I believe we will be forced to migrate away from romana soon since it doesn't seem to be maintained anymore and as we upgrade to kube 1.16.x we are encountering problems with it starting up.
I have searched the internet all day today looking for similar problems and the closest one that resembles our problem is https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 but I have no idea how to implement the iptable masquerade --random-fully option given we use romana and I read (https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153) that random-fully is the default for linux kernel 5 which we are using. Any ideas?
kubernetes 1.15.5
romana 2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
====== Update Nov 5, 2019 ======
It has been suggested to test an alternate CNI. I chose calico since we used that in an older Debian based kube cluster. I rebuilt a VM with our most basic Centos 7 template (vSphere) so there is a little baggage coming from our customizations. I can't list everything we customized in our template but the most notable change is the kernel 5 upgrade yum --enablerepo=elrepo-kernel -y install kernel-ml.
After starting up the VM these are the minimal steps to install kubernetes and run the test:
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum -y install docker-ce-3:18.09.6-3.el7.x86_64
systemctl start docker
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0
systemctl enable --now kubelet
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: test
spec:
selector:
matchLabels:
app: test
replicas: 1
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginxdemos/hello
ports:
- containerPort: 80
EOF
# wait for control plane to become healthy
kubectl apply -f /tmp/test-deploy.yml
Now the setup is ready and this is the ab test:
$ docker run --rm jordi/ab -n 100 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed
The ab test gives up after this error. If I decrease the number of requests to see avoid the timeout this is what you would see:
$ docker run --rm jordi/ab -n 10 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient).....done
Server Software: nginx/1.13.8
Server Hostname: 192.168.4.4
Server Port: 80
Document Path: /
Document Length: 7227 bytes
Concurrency Level: 1
Time taken for tests: 0.029 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 74140 bytes
HTML transferred: 72270 bytes
Requests per second: 342.18 [#/sec] (mean)
Time per request: 2.922 [ms] (mean)
Time per request: 2.922 [ms] (mean, across all concurrent requests)
Transfer rate: 2477.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 1 3
Processing: 1 2 1.2 1 4
Waiting: 0 1 1.3 0 4
Total: 1 3 1.4 3 5
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 4
80% 5
90% 5
95% 5
98% 5
99% 5
100% 5 (longest request)
This issue is technically different than the original issue I reported but this is a different CNI and there are still network issues. It does have the timeout error in common when I run the same test in the kube/romana cluster: run the ab test on the same node as the pod. Both encountered the same timeout error but in romana I could get a few thousand requests to finish before hitting the timeout. Calico encounters the timeout error before reaching a dozen requests.
Other variants or notes:
- net.netfilter.nf_conntrack_tcp_be_liberal=0/1 doesn't seem to make a difference
- higher -n values sometimes work but it is largely random.
- running the 'ab' test at low -n values several times in a row can sometimes trigger the timeout
At this point I am pretty sure it is some issue with our centos installation but I can't even guess what it could be. Are there any other limits, sysctl or other configs that could cause this?
====== Update Nov 6, 2019 ======
I observer that we had an older kernel installed in so I upgraded my kube/calico test VM with the same newer kernel 5.3.8-1.el7.elrepo.x86_64. After the update and a few reboots I can no longer reproduce the "apr_pollset_poll: The timeout specified has expired (70007)" timout errors.
Now that the timeout error is gone I was able to repeat the original test where I load test pod A and kill pod B on my vSphere VMs. On the romana environments the problem still existed but only when the load test is on a different host than where the pod A is located. If I run the test on the same host, no errors at all. Using Calico instead of romana, there are no load test errors on either host so the problem was gone. There may still be some setting to tweak that can help romana but I think this is "strike 3" for romana so I will start transitioning a full environment to Calico and do some acceptance testing there to ensure there are no hidden gotchas.
You mentioned that if there are no pod changes while the ab test is running, then no errors get reported. So it means that errors occur when you add pod or delete one.
This is normal behaviour as when pod gets deleted; it takes time for iptable rules changes to propagate. It may happen that container got removed, but iptable rules haven't got changed yet end packets are being forwarded to the nonexisting container, and this causes errors (it is sort of like a race condition).
The first thing you can do is always to create readiness probe as it will make sure that traffic will not be forwarded to the container until it is ready to handle requests.
The second thing to do is to handle deleting the container properly. This is a bit harder task because it may be handled at many levels, but the easiest thing you can do is adding PreStop hook to your container like this:
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 5"
PreStop hook gets executed at the moment of the pod deletion request. From this moment, k8s start changing iptable rules and it should stop forwarding new traffic to the container that's about to get deleted. While sleeping you give some time for k8s to propagate iptable changes in the cluster while not interrupting already existing connections. After PreStop handle exits, the container will receive SIGTERM signal.
My suggestion would be to apply both of these mechanisms together and check if it helps.
You also mentioned that bypassing ingress is causing more errors. I would assume that this is due to the fact that ingress has implemented retries mechanism. If it's unable to open a connection to a container, it will try several times, and hopefully will get to a container that can handle its request.

configuring kafka with JMX-exporter- centos 7

I want to enable kafka monitoring and I am starting with single node deployment as test. I am following steps from https://alex.dzyoba.com/blog/jmx-exporter/
i tried following steps; the last command which checks for jmx-exporter HTTP server reports blank. i believe this is the reason, why I am not seeing metrics from kafka.(more on this below)
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.6/jmx_prometheus_javaagent-0.6.jar
wget https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/kafka-0-8-2.yml
export KAFKA_OPTS='-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent-0.6.jar=7071:/etc/jmx-exporter/kafka-0-8-2.yml'
/opt/kafka_2.11-0.10.1.0/bin/kafka-server-start.sh /opt/kafka_2.11-0.10.1.0/conf/server.properties
netstat -plntu | grep 7071
kafka broker log on console does not have any ERROR message.
i have Prometheus running in a container and http://IP:9090/metrics shows bunch of metrics.
when i searched for "kafka" it returned following
# TYPE net_conntrack_dialer_conn_attempted_total counter
net_conntrack_dialer_conn_attempted_total{dialer_name="kafka"} 79
# TYPE net_conntrack_dialer_conn_closed_total counter
net_conntrack_dialer_conn_closed_total{dialer_name="kafka"} 0
net_conntrack_dialer_conn_established_total{dialer_name="kafka"} 0
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="kafka",reason="refused"} 79
net_conntrack_dialer_conn_failed_total{dialer_name="kafka",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="kafka",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="kafka",reason="unknown"} 79
# TYPE prometheus_sd_discovered_targets gauge
prometheus_sd_discovered_targets{config="kafka",name="scrape"} 1
# HELP prometheus_target_sync_length_seconds Actual interval to sync the scrape pool.
# TYPE prometheus_target_sync_length_seconds summary
prometheus_target_sync_length_seconds{scrape_job="kafka",quantile="0.01"} NaN
prometheus_target_sync_length_seconds{scrape_job="kafka",quantile="0.05"} NaN
prometheus_target_sync_length_seconds{scrape_job="kafka",quantile="0.5"} NaN
prometheus_target_sync_length_seconds{scrape_job="kafka",quantile="0.9"} NaN
prometheus_target_sync_length_seconds{scrape_job="kafka",quantile="0.99"} NaN
prometheus_target_sync_length_seconds_sum{scrape_job="kafka"} 0.000198245
prometheus_target_sync_length_seconds_count{scrape_job="kafka"} 1
My guess is prometheous is not getting any metrics on port 7071; which aligns with earlier finding that JMX server is not respond on port 7071.
can you help me enabling kafka monitoring using JMX-exporter and Prometheus?
I have Prometheus running in a container
You need to configure Prometheus to scrape your external LAN IP, then because you're running Kafka outside of a container.
You can see on this line that the connection is being refused with your current setup
net_conntrack_dialer_conn_failed_total{dialer_name="kafka",reason="refused"} 79
You should either run Prometheus on your host and scrape localhost:7071
Or run Kafka in a container if you want kafka:7071 to be discoverable by Prometheus

Zalenium Readiness probe failed: HTTP probe failed with statuscode: 502

I am trying to deploy the zalenium helm chart in my newly deployed aks Kuberbetes (1.9.6) cluster in Azure. But I don't get it to work. The pod is giving the log below:
[bram#xforce zalenium]$ kubectl logs -f zalenium-zalenium-hub-6bbd86ff78-m25t2 Kubernetes service account found. Copying files for Dashboard... cp: cannot create regular file '/home/seluser/videos/index.html': Permission denied cp: cannot create directory '/home/seluser/videos/css': Permission denied cp: cannot create directory '/home/seluser/videos/js': Permission denied Starting Nginx reverse proxy... Starting Selenium Hub... ..........08:49:14.052 [main] INFO o.o.grid.selenium.GridLauncherV3 - Selenium build info: version: '3.12.0', revision: 'unknown' 08:49:14.120 [main] INFO o.o.grid.selenium.GridLauncherV3 - Launching Selenium Grid hub on port 4445 ...08:49:15.125 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - Initialising Kubernetes support ..08:49:15.650 [main] WARN d.z.e.z.c.k.KubernetesContainerClient - Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:206) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.(KubernetesContainerClient.java:87) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:35) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: javax.net.ssl.SSLPeerUnverifiedException: Hostname kubernetes.default.svc not verified: certificate: sha256/OyzkRILuc6LAX4YnMAIGrRKLmVnDgLRvCasxGXDhSoc= DN: CN=client, O=system:masters subjectAltNames: [10.0.0.1] at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:308) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:268) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:160) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:256) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:134) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:113) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:56) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:107) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) at okhttp3.RealCall.execute(RealCall.java:77) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:313) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:296) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:770) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:195) ... 16 common frames omitted 08:49:15.651 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - About to clean up any left over selenium pods created by Zalenium Usage: [options] Options: --debug, -debug : enables LogLevel.FINE. Default: false --version, -version Displays the version and exits. Default: false -browserTimeout in seconds : number of seconds a browser session is allowed to hang while a WebDriver command is running (example: driver.get(url)). If the timeout is reached while a WebDriver command is still processing, the session will quit. Minimum value is 60. An unspecified, zero, or negative value means wait indefinitely. -matcher, -capabilityMatcher class name : a class implementing the CapabilityMatcher interface. Specifies the logic the hub will follow to define whether a request can be assigned to a node. For example, if you want to have the matching process use regular expressions instead of exact match when specifying browser version. ALL nodes of a grid ecosystem would then use the same capabilityMatcher, as defined here. -cleanUpCycle in ms : specifies how often the hub will poll running proxies for timed-out (i.e. hung) threads. Must also specify "timeout" option -custom : comma separated key=value pairs for custom grid extensions. NOT RECOMMENDED -- may be deprecated in a future revision. Example: -custom myParamA=Value1,myParamB=Value2 -host IP or hostname : usually determined automatically. Most commonly useful in exotic network configurations (e.g. network with VPN) Default: 0.0.0.0 -hubConfig filename: a JSON file (following grid2 format), which defines the hub properties -jettyThreads, -jettyMaxThreads : max number of threads for Jetty. An unspecified, zero, or negative value means the Jetty default value (200) will be used. -log filename : the filename to use for logging. If omitted, will log to STDOUT -maxSession max number of tests that can run at the same time on the node, irrespective of the browser used -newSessionWaitTimeout in ms : The time after which a new test waiting for a node to become available will time out. When that happens, the test will throw an exception before attempting to start a browser. An unspecified, zero, or negative value means wait indefinitely. Default: 600000 -port : the port number the server will use. Default: 4445 -prioritizer class name : a class implementing the Prioritizer interface. Specify a custom Prioritizer if you want to sort the order in which new session requests are processed when there is a queue. Default to null ( no priority = FIFO ) -registry class name : a class implementing the GridRegistry interface. Specifies the registry the hub will use. Default: de.zalando.ep.zalenium.registry.ZaleniumRegistry -role options are [hub], [node], or [standalone]. Default: hub -servlet, -servlets : list of extra servlets the grid (hub or node) will make available. Specify multiple on the command line: -servlet tld.company.ServletA -servlet tld.company.ServletB. The servlet must exist in the path: /grid/admin/ServletA /grid/admin/ServletB -timeout, -sessionTimeout in seconds : Specifies the timeout before the server automatically kills a session that hasn't had any activity in the last X seconds. The test slot will then be released for another test to use. This is typically used to take care of client crashes. For grid hub/node roles, cleanUpCycle must also be set. -throwOnCapabilityNotPresent true or false : If true, the hub will reject all test requests if no compatible proxy is currently registered. If set to false, the request will queue until a node supporting the capability is registered with the grid. -withoutServlet, -withoutServlets : list of default (hub or node) servlets to disable. Advanced use cases only. Not all default servlets can be disabled. Specify multiple on the command line: -withoutServlet tld.company.ServletA -withoutServlet tld.company.ServletB org.openqa.grid.common.exception.GridConfigurationException: Error creating class with de.zalando.ep.zalenium.registry.ZaleniumRegistry : null at org.openqa.grid.web.Hub.(Hub.java:97) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: java.lang.ExceptionInInitializerError at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) ... 3 more Caused by: java.lang.NullPointerException at java.util.TreeMap.putAll(TreeMap.java:313) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:411) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:48) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.deleteSeleniumPods(KubernetesContainerClient.java:393) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.initialiseContainerEnvironment(KubernetesContainerClient.java:339) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:38) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) ... 11 more ...........................................................................................................................................................................................GridLauncher failed to start after 1 minute, failing... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 182 100 182 0 0 36103 0 --:--:-- --:--:-- --:--:-- 45500
A describe pod gives:
Warning Unhealthy 4m (x12 over 6m) kubelet, aks-agentpool-93668098-0 Readiness probe failed: HTTP probe failed with statuscode: 502
Zalenium Image Version(s):
dosel/zalenium:3
If using Kubernetes, specify your environment, and if relevant your manifests:
I use the templates as is from https://github.com/zalando/zalenium/tree/master/docs/k8s/helm
I guess it has to do something with rbac because of this part
"Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at "
I created a clusterrole and clusterrolebinding for the service account zalenium-zalenium that is automatically created by the Helm chart.
kubectl create clusterrole zalenium --verb=get,list,watch,update,delete,create,patch --resource=pods,deployments,secrets
kubectl create clusterrolebinding zalenium --clusterrole=zalnium --serviceaccount=zalenium-zalenium --namespace=default
Issue had to do with Azure's AKS and Kubernetes. It has been fixed.
See github issue 399
If the typo, mentioned by Ignacio, is not the case why don't you just do
kubectl create clusterrolebinding zalenium --clusterrole=cluster-admin serviceaccount=zalenium-zalenium
Note: no need to specify a namespace if creating cluster role binding, as it is cluster wide (role binding goes with namespaces).

Which jmx metric should be used to monitor the status of a connector in kafka connect?

I'm using the following jmx metrics for kafka connect.
Have a look at Connect Monitoring section in the Kafka docs, it lists all the Kafka Connect specific metrics.
For example there are overall metrics for each connector:
kafka.connect:type=connector-metrics,connector="{connector}" which contains a connector status (running, failed, etc)
kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}" which contains the status of individual tasks
If you want more than just the status, there are also additional metrics for both sink and source tasks:
kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}"
kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
I still don't have enough rep to comment but I can answer...
Elaborating on Mickael's answer, be careful: currently task metrics disappear when a task is in a failed state rather than show up with the FAILED status. A Jira can be found here and PR can be found here
Connector status is available under kafka.connect:type=connector-metrics.
With jmxterm, you may notice that the attributes are described as doubles instead of strings:
$>info
#mbean = kafka.connect:connector=dev-kafka-connect-mssql,type=connector-metrics
#class name = org.apache.kafka.common.metrics.JmxReporter$KafkaMbean
# attributes
%0 - connector-class (double, r)
%1 - connector-type (double, r)
%2 - connector-version (double, r)
%3 - status (double, r)
$>get status
#mbean = kafka.connect:connector=dev-kafka-connect-mssql,type=connector-metrics:
status = running;
This resulted in WARN logs from my monitoring agent:
2018-05-23 14:35:53,966 | WARN | JMXAttribute | Unable to get metrics from kafka.connect:type=connector-metrics,connector=dev-kafka-connect-rabbitmq-orders - status
java.lang.NumberFormatException: For input string: "running"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at org.datadog.jmxfetch.JMXAttribute.castToDouble(JMXAttribute.java:270)
at org.datadog.jmxfetch.JMXSimpleAttribute.getMetrics(JMXSimpleAttribute.java:32)
at org.datadog.jmxfetch.JMXAttribute.getMetricsCount(JMXAttribute.java:226)
at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:332)
at org.datadog.jmxfetch.Instance.init(Instance.java:193)
at org.datadog.jmxfetch.App.instantiate(App.java:604)
at org.datadog.jmxfetch.App.init(App.java:658)
at org.datadog.jmxfetch.App.main(App.java:140)
Each monitoring system may have different fixes, but I suspect the cause may be the same?