Can't talk to HBase from different kubernetes namespace: java.net.UnknownHostException: hregion-0.hregion - kubernetes

I am using kubernetes, where I have a Hadoop cluster running in namespace 'platform'.
I have an example application running in namespace 'example'
The example application needs to talk to HBase. When it does so, we see the following error:
java.net.UnknownHostException: hregion-0.hregion
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
at org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1192)
at org.apache.hadoop.hbase.client.ClientServiceCallable.setStubByServiceName(ClientServiceCallable.java:44)
at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:229)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1078)
at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:403)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:445)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:442)
at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3084)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3076)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:442)
The command
> nslookup hregion-0.hregion
on the client machine fails, because the hregion service is in the platform namespace (where that command will succeed).
We suspected that the HBase region server has registered itself with zookeeper using an incomplete name, and verified by connecting to the zookeeper server:
[zk: localhost:2181(CONNECTED) 8] ls /hbase/rs
[hregion-0.hregion,16020,1560851357442]
The ConnectionUtils.getStubKey method simply uses java.net.InetAddress.getByName(hostname) and it is this method which fails.
Here is some zookeeper debugging info (this from the HBase master):
hbase(main):001:0> zk_dump
HBase is rooted at /hbase
Active master address: hmaster-0.hmaster.platform.svc.cluster.local,16000,1560851357485
Backup master addresses:
Region server holding hbase:meta: hregion-0.hregion,16020,1560851357442
Region servers:
hregion-0.hregion,16020,1560851357442
On the hregion-0 server, we have the following entries in /etc/hosts:
# Kubernetes-managed hosts file.
127.0.0.1 localhost
10.1.14.53 hregion-0.hregion.platform.svc.cluster.local hregion-0
And the /etc/resolv.conf file looks like this:
nameserver 10.96.0.10
search platform.svc.cluster.local svc.cluster.local cluster.local mycompany.com
options ndots:5
How do I fix this? I assume I need to tell HBase to register its nodes in zookeeper using their fully qualified domain name - how?

Related

Consul agent on kubernetes, on node or pod?

I deployed an aws eks cluster via terraform. I also deployed Consul following hasicorp’s tutorial and I see the nodes in consul’s UI.
Now I’m wondering how al the consul agents will know about the pods I deploy? I deploy something and it’s not shown anywhere on consul.
I can’t find any documentation as to how to register pods (services) on consul via the node’s consul agent, do I need to configure that somewhere? Should I not use the node’s agent and register the service straight from the pod? Hashicorp discourages this since it may increase resource utilization depending on how many pods one deploy on a given node. But then how does the node’s agent know about my services deployed on that node?
Moreover, when I deploy a pod in a node and ssh into the node, and install consul, consul’s agent can’t find the consul server (as opposed from the node, which can find it)
EDIT:
Bottom line is I can't find WHERE to add the configuration. If I execute ON THE POD:
consul members
It works properly and I get:
Node Address Status Type Build Protocol DC Segment
consul-consul-server-0 10.0.103.23:8301 alive server 1.10.0 2 full <all>
consul-consul-server-1 10.0.101.151:8301 alive server 1.10.0 2 full <all>
consul-consul-server-2 10.0.102.112:8301 alive server 1.10.0 2 full <all>
ip-10-0-101-129.ec2.internal 10.0.101.70:8301 alive client 1.10.0 2 full <default>
ip-10-0-102-175.ec2.internal 10.0.102.244:8301 alive client 1.10.0 2 full <default>
ip-10-0-103-240.ec2.internal 10.0.103.245:8301 alive client 1.10.0 2 full <default>
ip-10-0-3-223.ec2.internal 10.0.3.249:8301 alive client 1.10.0 2 full <default>
But if i execute:
# consul agent -datacenter=voip-full -config-dir=/etc/consul.d/ -log-file=log-file -advertise=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)
I get the following error:
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'f10070e7-9910-06c7-0e12-6edb6cc4c9b9'
Node name: 'ip-10-0-3-223.ec2.internal'
Datacenter: 'voip-full' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.0.3.223 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-08-16T18:23:06.936Z [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.936Z [WARN] agent: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.946Z [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.947Z [WARN] agent.auto_config: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.948Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ip-10-0-3-223.ec2.internal 10.0.3.223
2021-08-16T18:23:06.948Z [INFO] agent.router: Initializing LAN area manager
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2021-08-16T18:23:06.950Z [WARN] agent.client.serf.lan: serf: Failed to re-join any previously known node
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-08-16T18:23:06.951Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-08-16T18:23:06.951Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-08-16T18:23:06.953Z [INFO] agent: started state syncer
2021-08-16T18:23:06.953Z [INFO] agent: Consul agent running!
2021-08-16T18:23:06.953Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:06.954Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2021-08-16T18:23:34.169Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:34.169Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
So where to add the config?
I also tried adding a service in k8s pointing to the pod, but the service doesn't come up on consul's UI...
What do you guys recommend?
Thanks
Consul knows where these services are located because each service
registers with its local Consul client. Operators can register
services manually, configuration management tools can register
services when they are deployed, or container orchestration platforms
can register services automatically via integrations.
if you planning to use manual option you have to register the service into the consul.
Something like
echo '{
"service": {
"name": "web",
"tags": [
"rails"
],
"port": 80
}
}' > ./consul.d/web.json
You can find the good example at : https://thenewstack.io/implementing-service-discovery-of-microservices-with-consul/
Also this is a very nice document for having detailed configuration of the health check and service discovery : https://cloud.spring.io/spring-cloud-consul/multi/multi_spring-cloud-consul-discovery.html
Official document : https://learn.hashicorp.com/tutorials/consul/get-started-service-discovery
BTW, I was finally able to figure out the issue.
consul-dns is not deployed by default, i had to manually deploy it, then forward all .consul requests from coredns to consul-dns.
All is working now. Thanks!

JMX Connection refused on Kubernetes with AdoptOpenJDK OpenJ9

With my team we are trying to move our micro-services to openj9, they are running on kubernetes. However, we encounter a problem on the configuration of JMX. (openjdk8-openj9)
We have a connection refused when we try a connection with jvisualvm (and a port-forwarding with Kubernetes).
We haven't changed our configuration, except for switching from Hotspot to OpenJ9.
The error :
E0312 17:09:46.286374 17160 portforward.go:400] an error occurred forwarding 1099 -> 1099: error forwarding port 1099 to pod XXXXXXX, uid : exit status 1: 2020/03/12 16:09:45 socat[31284] E connect(5, AF=2 127.0.0.1:1099, 16): Connection refused
The java options that we use :
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=1099
-Dcom.sun.management.jmxremote.rmi.port=1099
We are using the last adoptopenjdk/openjdk8-openj9 docker image.
Do you have any ideas?
Thank you !
Regards.
I managed to figure out why it wasn't working.
It turns out that to pass the JMX options to the service we were using the Kubernetes service descriptor in YAML. It looks like this:
- name: _JAVA_OPTIONS
value: -Dzipkinserver.listOfServers=http://zipkin:9411 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099
I realized that the JMX properties were not taken into account from _JAVA_OPTIONS when the application is not launch with ENTRYPOINT in the docker container.
So I pass the properties directly into the Dockerfile like this and it works.
CMD ["java", "-Dcom.sun.management.jmxremote", "-Dcom.sun.management.jmxremote.authenticate=false", "-Dcom.sun.management.jmxremote.ssl=false", "-Dcom.sun.management.jmxremote.local.only=false", "-Dcom.sun.management.jmxremote.port=1099", "-Dcom.sun.management.jmxremote.rmi.port=1099", "-Djava.rmi.server.hostname=127.0.0.1", "-cp","app:app/lib/*","OurMainClass"]
It's also possible to keep _JAVA_OPTIONS and setup an ENTRYPOINT in the dockerfile.
Thanks!

IBM Cloud Private 2.1.0 CE - Hostname should be resolved to a valid IP address

I'm trying to install IBM Cloud Private on a VM. I've created master, proxy, and worker nodes and I'm at the last stage installing the ICP. However, I'm having a problem with the hostnames. The errors are shown below:
The error is below:
TASK [check : Validating Hostname is resolvable]
*******************************************************************
skipping: [172.16.22.190]
fatal: [172.16.22.82] => Hostname should be resolved to a valid IP address
fatal: [172.16.22.81] => Hostname should be resolved to a valid IP address
NO MORE HOSTS LEFT
********************************************************************************
NO MORE HOSTS LEFT
********************************************************************************
PLAY RECAP
********************************************************************************
172.16.22.190 : ok=4 changed=3 unreachable=0 failed=0
172.16.22.81 : ok=4 changed=3 unreachable=0 failed=1
172.16.22.82 : ok=4 changed=3 unreachable=0 failed=1
Playbook run took 0 days, 0 hours, 0 minutes, 4 seconds
My /etc/hosts file:
172.16.22.190 icp
172.16.22.81 proxy
172.16.22.82 worker
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
My cluster hosts file:
[master]
172.16.22.190
[worker]
172.16.22.82
[proxy]
172.16.22.81
#[management]
#4.4.4.4
All nodes see and ping each other and I'm using Ubuntu.
It turned out that hostnames of each node had to be same everywhere, not just on the /etc/hosts files of different nodes but also in /etc/hostname file of the node itself. It's a minor but an important mistake you can do if you are installing the server OS from a pre-loaded image :)
what about the /etc/hosts file on host 82 and 81?
You haven't defined a master in the icp hosts file.
The host file needs to be updated on each node, not just master.
My working icp hosts and /etc/hosts and look like this, respectively:
icp hosts file:
[master]
10.121.9.226
[worker]
10.143.76.132
10.143.76.134
[proxy]
10.121.9.226
#[management]
#4.4.4.4
and
10.121.9.226 icpdemo1.xxx.com icpdemo1
10.143.76.132 icpdemo2.xxx.com icpdemo2
10.143.76.134 icpdemo3.xxx.com icpdemo3
I found out that with Ubuntu it maps the hostname to the localhost address.
You need to modify the /etc/hosts file and remove the line that points 127.0.1.1 IP to your hostname and make sure it points to your public IP.
All nodes of your cluster should be resolved to each other.
Make entry of hostnames of cluster nodes in each node /etc/hosts file. IBM OFFICIAL/configuring_your_cluster

Received AliveMessage from a peer with the same PKI-ID as myself

I am attempting to port the Hyperledger Fabric Getting Started to Kubernetes. But am struggling to get peer1's to deploy. If I enable CORE_PEER_GOSSIP_BOOTSTRAP, I receive errors "Received AliveMessage from a peer with the same PKI-ID as myself".
How can I debug a peer reportedly having the same PKI-ID as another?
Using this as a starting point:
https://hyperledger-fabric.readthedocs.io/en/latest/getting_started.html
I am able to create:
orderer and cli pods in default namespace
peer0's one in each org1|org2 namespace.
peer1's but only if I disable (comment out) CORE_PEER_GOSSIP_BOOTSTRAP
If I enable CORE_PEER_GOSSIP_BOOTSTRAP for the peer1's, I receive the following warning and error:
[gossip/gossip#10.0.0.10:7051] NewGossipService -> WARN 01c External endpoint is empty, peer will not be accessible outside of its organization
...
[gossip/discovery#10.0.0.10:7051] handleAliveMessage -> ERRO 02a Bad configuration detected: Received AliveMessage from a peer with the same PKI-ID as myself: tag:EMPTY alive_msg:<membership:<pki_id:"[[REDACTED]]" > timestamp:<inc_number:1495468533769417608 seq_num:416 > >
In order to better map the Orderer, Peers to DNS names, I'm using Kubernetes Namespaces and this configuration:
OrdererOrgs:
- Name: Orderer
Domain: default.svc.cluster.local
Specs:
- Hostname: orderer
PeerOrgs:
- Name: Org1
Domain: org1.svc.cluster.local
Template:
Count: 2
Users:
Count: 2
- Name: Org2
Domain: org2.svc.cluster.local
Template:
Count: 2
Users:
Count: 2
In order to expose the peer0's to the other peers in the org and to expose the orderer, I have ClusterIP services for the peer0's (selecting only the peer0's) and orderer. It's inelegant but I'm trying to get it to work before I get it working more beautifully.
I am able to resolve orderer.default.svc.cluster.local, peer0.org1.svc.cluster.local, `peer0.org2.svc.cluster.local' using nslookup from within a pod deployed to default on the cluster.
Absent a curl-like tool for gPRC, I am able to open sockets against these endpoints on 7051 and 7053.
First, make sure you are using the right certificates.
Second, verify that your environment/configuration for gossip is set correctly
environment:
- CORE_PEER_GOSSIP_EXTERNALENDPOINT=peer1.org1.example.com:8051
- CORE_PEER_GOSSIP_BOOTSTRAP=peer0.org1.example.com:7051
- CORE_PEER_GOSSIP_ENDPOINT=peer0.org1.example.com:7051
OR in core.yaml
peer:
gossip:
bootstrap: peer0.org1.example.com:7051
externalEndpoint: peer1.org1.example.com:8051
endpoint: peer0.org1.example.com:7051
Edited: Also make sure that you have properly setup your CA
Hope this helps, it worked for me. And I was successfully able to connect peers.
If the peers are started from the same node, its possible that you are mounting the same crypto-material (path to mspconfig directory) for both the peers. If that is the case, separate the directory structures for both the peers and keep their respective certificates in them, update the respective paths for msp in docker-compose file and try to run.

The Datastax cassandra community server 2.1.10 service on local computer started and then stopped

I am trying to configure a two node cluster with cassandra in windows r2 2008
So i installed cassandra community version in one server (10.xxx.0.1,10.xxx.0.2)
And then I stopped the service and then edited the configuraton.yaml file in the conf folder.
The changes are:
cluster_name
commented the num_tokens
gave the tokens in initial_token,
seeds as 10.xxx.0.1,10.xxx.0.2,
listen_addresses are their respective ip addresses which are 10.xxx.0.1,10.xxx.0.2,
rpc_addresses as 0.0.0.0,
endpointsnitch as gossip
I also changed the cassandra rackdc.properties file to dc=DC1 rack=RAC1.
I then saved and started back the service and opened the cqlsh, but it is not connecting. Below is the error:
2015-10-12 16:20:13 Commons Daemon procrun stderr initialized
If rpc_address is set to a wildcard address (0.0.0.0), then you must set broadcast_rpc_address to a value other than 0.0.0.0
Fatal configuration error; unable to start. See log for stacktrace.
..
ERROR 21:20:14 Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: If rpc_address is set to a wildcard address (0.0.0.0), then you must set broadcast_rpc_address to a value other than 0.0.0.0
at org.apache.cassandra.config.DatabaseDescriptor.applyAddressConfig(DatabaseDescriptor.java:285) ~[apache-cassandra-2.1.10.jar:2.1.10]
at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:443) ~[apache-cassandra-2.1.10.jar:2.1.10]
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:136) ~[apache-cassandra-2.1.10.jar:2.1.10]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:168) [apache-cassandra-2.1.10.jar:2.1.10]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562) [apache-cassandra-2.1.10.jar:2.1.10]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651) [apache-cassandra-2.1.10.jar:2.1.10]
If you out 0.0.0.0 to the rpc_address you have to change the broadcast_rpc_address like in http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html , I think that the right broadcast_rpc_address can be the own ip address.