Unable to resolve address for Kubernetes service - kubernetes

I have installed Kafka single-node using Confluent. There is an error in Kafka pod :
[WARN] 2022-04-26 14:29:47,008 [main-SendThread(zookeeper.confluent.svc.cluster.local:2181)] org.apache.zookeeper.ClientCnxn run - Session 0x0 for sever zookeeper.confluent.svc.cluster.local:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
java.lang.IllegalArgumentException: Unable to canonicalize address zookeeper.confluent.svc.cluster.local:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:78)
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1161)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1210)
[INFO] 2022-04-26 14:29:47,273 [main] kafka.zookeeper.ZooKeeperClient info - [ZooKeeperClient Kafka server] Closing.
[ERROR] 2022-04-26 14:29:48,112 [main-SendThread(zookeeper.confluent.svc.cluster.local:2181)] org.apache.zookeeper.client.StaticHostProvider resolve - Unable to resolve address: zookeeper.confluent.svc.cluster.local:2181
java.net.UnknownHostException: zookeeper.confluent.svc.cluster.local
Error messages :
Unable to canonicalize address zookeeper.confluent.svc.cluster.local:2181 because it's not resolvable
Unable to resolve address: zookeeper.confluent.svc.cluster.local:2181
I checked my zookeper ... it's good and works without a problem. Also, check DNS using dnsutils :
$ kubectl -n default exec -it dnsutils -- nslookup zookeeper.confluent.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: zookeeper.confluent.svc.cluster.local
Address: 192.168.0.111
What can I do? Is this a k8s related problem?

This happens with me, but on docker-compose project
finally, I found that no space left on server causes this issue.
clean some spaces and it worked.

Related

ERROR: for rabbitmq Cannot start service rabbitmq: driver failed programming external connectivity

WARNING: Host is already in use by another container
ERROR: for rabbitmq Cannot start service rabbitmq: driver failed programming external connectivity
Creating shopy-backend_db_1 ... done
Creating shopy-backend_redis_1 ... done
ERROR: for rabbitmq Cannot start service rabbitmq: driver failed programming external connectivity on endpoint rabbitmq (4a8903254c20ef83fdd912bc2e22653ad1980de4a85021ce3bb993bab57993ba):
Error starting userland proxy: listen tcp4 0.0.0.0:5672: bind: address already in use
ERROR: Encountered errors while bringing up the project.ca-backend_redis_1 ... done
ERROR: for rabbitmq Cannot start service rabbitmq: driver failed programming external connectivity on endpoint rabbitmq (4a8903254c20ef83fdd912bc2e22653ad1980de4a85021ce3bb993bab57993ba): Error starting userland proxy: listen tcp4 0.0.0.0:5672: bind: address already in use
ERROR: Encountered errors while bringing up the project.
If you face this error it means that another rabbitmq service instance is running already and will be occupying port 5672.
You should stop the service as follows:
sudo systemctl stop rabbitmq-server.service

Hyperledger fabric chaincode connection with peer getting dropped

I have a hyperledger fabric network version 2.4.4 running on Kubernetes. The peers and other components are running behind istio ingress. The chaincode is running on dind (docker-in-docker) container and connects to peer through its URL. The problem is the chaincode connection is being dropped after few minutes. Below are the logs:
2022-07-14T04:31:13.057Z info [c-api:lib/handler.js] [assetschannel-ddc183b4] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
2022-07-14T04:33:04.197Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: Connection dropped\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:187:78\n at processTicksAndRejections (node:internal/process/task_queues:78:11)"
I did set the following environment variables in the peer pod to keep the connection alive:
CORE_CHAINCODE_KEEPALIVE: 60000
CORE_PEER_KEEPALIVE_CLIENT_INTERVAL: 600s
CORE_PEER_KEEPALIVE_CLIENT_TIMEOUT: 2s
CORE_PEER_KEEPALIVE_DELIVERYCLIENT_INTERVAL: 20s
CORE_PEER_KEEPALIVE_MININTERVAL: 15s
but this did not resolve the issue.
Any suggestions would be appreciated.
It appears to be an issue with aws elb. The idle timeout was set to 60s which was breaking the connection between chaincode and peer when there was no communication between them. Increasing this time fixed the issue.

OC Cluster never goes up Error: timed out waiting for the condition

When ever i try to get the cluster up using "oc cluster up"
Below is the error I get. Kindly help on how to fix this
[mano#mano ~]$ oc cluster up
Getting a Docker client ...
Checking if image openshift/origin-control-plane:v3.11 is available ...
Checking type of volume mount ...
Determining server IP ...
Checking if OpenShift is already running ...
Checking for supported Docker version (=>1.22) ...
Checking if insecured registry is configured properly in Docker ...
Checking if required ports are available ...
Checking if OpenShift client is configured properly ...
Checking if image openshift/origin-control-plane:v3.11 is available ...
Starting OpenShift using openshift/origin-control-plane:v3.11 ...
I0923 13:40:32.364326 15396 config.go:40] Running "create-master-config"
I0923 13:40:59.938492 15396 config.go:46] Running "create-node-config"
I0923 13:41:10.721711 15396 flags.go:30] Running "create-kubelet-flags"
I0923 13:41:18.241285 15396 run_kubelet.go:49] Running "start-kubelet"
I0923 13:41:23.016238 15396 run_self_hosted.go:181] Waiting for the kube-apiserver to be ready ...
E0923 13:46:23.023479 15396 run_self_hosted.go:571] API server error: Get https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443: connect: connection refused ()
Error: timed out waiting for the condition
OC version
[mano#mano` ~]$ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
followed the article :https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md
yet no luck

Error restoring Rancher: This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready

I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?
You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool

Failed to connect to backoff(async(tcp://ip:5044)): dial tcp ip:5044: i/o timeout

Filebeat is running on Machine B which read logs and push to ELK logstash on Machine A.
But in the Machine B filebeat log, it shows the error i/o timeout
2019-08-24T12:13:10.065+0800 ERROR pipeline/output.go:100 Failed to connect to backoff(async(tcp://example.com:5044)): dial tcp xx.xx.xx.xx:5044: i/o timeout
2019-08-24T12:13:10.065+0800 INFO pipeline/output.go:93 Attempting to reconnect to backoff(async(tcp://example.com:5044)) with 1 reconnect attempt(s)
I've check the logstash on Machine A which running well, can listening on 0 0.0.0.0:5044
Here is the logstash log
[INFO ] 2019-08-24 12:09:35.217 [[main]-pipeline-manager] beats - Beats inputs: Starting input listener {:address=>"0.0.0.0:5044"}
And here is netstat output,
$ sudo netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:5044 0.0.0.0:* LISTEN 20668/java
I also check the firewall on Machine A is closed.
$ firewall-cmd --list-all
FirewallD is not running
$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy DROP)
target prot opt source destination
I also use telnet to connect Machine A, But I get this,
$ telnet example.com 5044
Trying xx.xx.xx.xx...
telnet: connect to address xx.xx.xx.xx: Connection timed out
I run the filebeat with same config on Machine A(local) to check it the config for filebeat on Machine B(remote) is wrong, it works well.
2019-08-24T14:17:35.195+0800 INFO pipeline/output.go:95 Connecting to backoff(async(tcp://localhost:5044))
2019-08-24T14:17:35.198+0800 INFO pipeline/output.go:105 Connection to backoff(async(tcp://localhost:5044)) established
At last I find it's caused by the VPS Provider aliyun, it only open some common port such 22, 80,443.
I need to login to aliyun VPS management page, and open 5044 to make VPS Provider bypass the 5044 port.
*Note: * Attachment: some other issues I encountered when config filebeat with ELK.
**Issue 1: ** Failed to connect to backoff(async(tcp://ip:5044)): dial tcp ip:5044: connect: connection refused
2019-08-26T10:25:41.955+0800 ERROR pipeline/output.go:100 Failed to connect to backoff(async(tcp://example.com:5044)): dial tcp xx.xx.xx.xx:5044: connect: connection refused
2019-08-26T10:25:41.955+0800 INFO pipeline/output.go:93 Attempting to reconnect to backoff(async(tcp://example:5044)) with 2 reconnect attempt(s)
Issue 2: Failed to publish events caused by: write tcp ip:46890->ip:5044: write: connection reset by peer
2019-08-26T10:28:32.274+0800 ERROR logstash/async.go:256 Failed to publish events caused by: write tcp xx.xx.xx.xx:46890->xx.xx.xx.xx:5044: write: connection reset by peer
2019-08-26T10:28:33.311+0800 ERROR pipeline/output.go:121 Failed to publish events: write tcp xx.xx.xx.xx:46890->xx.xx.xx.xx:5044: write: connection reset by peer
Issue 3: Filebeat error: lumberjack protocol error and Logstash error: OPENSSL_internal:WRONG_VERSION_NUMBER
Filebeat log error,
2019-08-26T08:49:09.505+0800 INFO pipeline/output.go:95 Connecting to backoff(async(tcp://example.com:5044))
2019-08-26T08:49:09.588+0800 INFO pipeline/output.go:105 Connection to backoff(async(tcp://example.com:5044)) established
2019-08-26T08:49:09.605+0800 ERROR logstash/async.go:256 Failed to publish events caused by: lumberjack protocol error
2019-08-26T08:49:09.606+0800 ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
Logstash log,
[INFO ] 2019-08-26 08:49:29.444 [defaultEventExecutorGroup-4-2] BeatsHandler - [local: 0.0.0.0:5044, remote: undefined] Handling exception: javax.net.ssl.SSLHandshakeException: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
[WARN ] 2019-08-26 08:49:29.445 [nioEventLoopGroup-2-7] DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-all-4.1.30.Final.jar:4.1.30.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-all-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-all-4.1.30.Final.jar:4.1.30.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-all-4.1.30.Final.jar:4.1.30.Final]
...
All the three issues are caused by miss configuration, here is the workable config,
logstash version,
/usr/share/logstash/bin/logstash -V
logstash 7.3.1
filebeat version,
/usr/share/filebeat/bin/filebeat version
filebeat version 7.3.1 (amd64), libbeat 7.3.1 [a4be71b90ce3e3b8213b616adfcd9e455513da45 built 2019-08-19 19:30:50 +0000 UTC]
logstash conf file /etc/logstash/conf.d/beat.conf
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => "/etc/pki/tls/certs/logstash-forwarder.crt"
ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
ssl_verify_mode => "peer"
}
}
output {
elasticsearch {
hosts => "http://127.0.0.1:9200"
manage_template => false
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
}
filebeat conf file /etc/filebeat/filebeat.yml
#=========================== Filebeat inputs =============================
filebeat.inputs:
# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.
- type: log
# Change to true to enable this input configuration.
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
- /data/error_logs/Log_error_201908
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["example.com:5044"]
# Optional SSL. By default is off.
# List of root certificates for HTTPS server verifications
ssl.certificate_authorities: ["/etc/pki/tls/certs/logstash-forwarder.crt"]
# Certificate for SSL client authentication
ssl.certificate: "/etc/pki/tls/certs/logstash-forwarder.crt"
# Client Certificate Key
ssl.key: "/etc/pki/tls/private/logstash-forwarder.key"