Consul agent on kubernetes, on node or pod? - kubernetes

I deployed an aws eks cluster via terraform. I also deployed Consul following hasicorp’s tutorial and I see the nodes in consul’s UI.
Now I’m wondering how al the consul agents will know about the pods I deploy? I deploy something and it’s not shown anywhere on consul.
I can’t find any documentation as to how to register pods (services) on consul via the node’s consul agent, do I need to configure that somewhere? Should I not use the node’s agent and register the service straight from the pod? Hashicorp discourages this since it may increase resource utilization depending on how many pods one deploy on a given node. But then how does the node’s agent know about my services deployed on that node?
Moreover, when I deploy a pod in a node and ssh into the node, and install consul, consul’s agent can’t find the consul server (as opposed from the node, which can find it)
EDIT:
Bottom line is I can't find WHERE to add the configuration. If I execute ON THE POD:
consul members
It works properly and I get:
Node Address Status Type Build Protocol DC Segment
consul-consul-server-0 10.0.103.23:8301 alive server 1.10.0 2 full <all>
consul-consul-server-1 10.0.101.151:8301 alive server 1.10.0 2 full <all>
consul-consul-server-2 10.0.102.112:8301 alive server 1.10.0 2 full <all>
ip-10-0-101-129.ec2.internal 10.0.101.70:8301 alive client 1.10.0 2 full <default>
ip-10-0-102-175.ec2.internal 10.0.102.244:8301 alive client 1.10.0 2 full <default>
ip-10-0-103-240.ec2.internal 10.0.103.245:8301 alive client 1.10.0 2 full <default>
ip-10-0-3-223.ec2.internal 10.0.3.249:8301 alive client 1.10.0 2 full <default>
But if i execute:
# consul agent -datacenter=voip-full -config-dir=/etc/consul.d/ -log-file=log-file -advertise=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)
I get the following error:
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'f10070e7-9910-06c7-0e12-6edb6cc4c9b9'
Node name: 'ip-10-0-3-223.ec2.internal'
Datacenter: 'voip-full' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.0.3.223 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-08-16T18:23:06.936Z [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.936Z [WARN] agent: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.946Z [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.947Z [WARN] agent.auto_config: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.948Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ip-10-0-3-223.ec2.internal 10.0.3.223
2021-08-16T18:23:06.948Z [INFO] agent.router: Initializing LAN area manager
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2021-08-16T18:23:06.950Z [WARN] agent.client.serf.lan: serf: Failed to re-join any previously known node
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-08-16T18:23:06.951Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-08-16T18:23:06.951Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-08-16T18:23:06.953Z [INFO] agent: started state syncer
2021-08-16T18:23:06.953Z [INFO] agent: Consul agent running!
2021-08-16T18:23:06.953Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:06.954Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2021-08-16T18:23:34.169Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:34.169Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
So where to add the config?
I also tried adding a service in k8s pointing to the pod, but the service doesn't come up on consul's UI...
What do you guys recommend?
Thanks

Consul knows where these services are located because each service
registers with its local Consul client. Operators can register
services manually, configuration management tools can register
services when they are deployed, or container orchestration platforms
can register services automatically via integrations.
if you planning to use manual option you have to register the service into the consul.
Something like
echo '{
"service": {
"name": "web",
"tags": [
"rails"
],
"port": 80
}
}' > ./consul.d/web.json
You can find the good example at : https://thenewstack.io/implementing-service-discovery-of-microservices-with-consul/
Also this is a very nice document for having detailed configuration of the health check and service discovery : https://cloud.spring.io/spring-cloud-consul/multi/multi_spring-cloud-consul-discovery.html
Official document : https://learn.hashicorp.com/tutorials/consul/get-started-service-discovery

BTW, I was finally able to figure out the issue.
consul-dns is not deployed by default, i had to manually deploy it, then forward all .consul requests from coredns to consul-dns.
All is working now. Thanks!

Related

Openshift 3.11 cloud integration fails with lookup RequestError: send request failed\\ncaused by: Post https://ec2.eu-west-.amazonaws.com

Following the docs: https://docs.openshift.com/container-platform/3.11/install_config/configuring_aws.html#aws-cluster-labeling
Configuring the cloud integration after the cluster build.
When the cluster services are restarted on the masters it fails looking up AWS instances:
22 16:32:10.112895 75995 server.go:261] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0c5cbd50923f9c6d2: "error listing AWS instances: \"Request.service: main process exited, code=exited, status=255/n/a Error: send request failed\\ncaused by: Post https://ec2.eu-west-.amazonaws.com/: dial tcp: lookup ec2.eu-west-.amazonaws.com: no such host\""
On closer inspection seems to be due to incorrect hostname:
https://ec2.eu-west-.amazonaws.com/ VS https://ec2.eu-west-2.amazonaws.com/
So I double checked the config, which seems to be correct:
# cat /etc/origin/cloudprovider/aws.conf
[Global]
Zone = eu-west-2
Had a google and it seems to be a similar issue to this:
https://github.com/kubernetes-sigs/kubespray/issues/4345
Is there a way to work around this? Moving off 3.11 isn't an option right now.
Thanks.
Looks as though it needs to be zone, rather than the region.
# cat /etc/origin/cloudprovider/aws.conf
[Global]
Zone = eu-west-2a

Can't talk to HBase from different kubernetes namespace: java.net.UnknownHostException: hregion-0.hregion

I am using kubernetes, where I have a Hadoop cluster running in namespace 'platform'.
I have an example application running in namespace 'example'
The example application needs to talk to HBase. When it does so, we see the following error:
java.net.UnknownHostException: hregion-0.hregion
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
at org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1192)
at org.apache.hadoop.hbase.client.ClientServiceCallable.setStubByServiceName(ClientServiceCallable.java:44)
at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:229)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1078)
at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:403)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:445)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:442)
at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3084)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3076)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:442)
The command
> nslookup hregion-0.hregion
on the client machine fails, because the hregion service is in the platform namespace (where that command will succeed).
We suspected that the HBase region server has registered itself with zookeeper using an incomplete name, and verified by connecting to the zookeeper server:
[zk: localhost:2181(CONNECTED) 8] ls /hbase/rs
[hregion-0.hregion,16020,1560851357442]
The ConnectionUtils.getStubKey method simply uses java.net.InetAddress.getByName(hostname) and it is this method which fails.
Here is some zookeeper debugging info (this from the HBase master):
hbase(main):001:0> zk_dump
HBase is rooted at /hbase
Active master address: hmaster-0.hmaster.platform.svc.cluster.local,16000,1560851357485
Backup master addresses:
Region server holding hbase:meta: hregion-0.hregion,16020,1560851357442
Region servers:
hregion-0.hregion,16020,1560851357442
On the hregion-0 server, we have the following entries in /etc/hosts:
# Kubernetes-managed hosts file.
127.0.0.1 localhost
10.1.14.53 hregion-0.hregion.platform.svc.cluster.local hregion-0
And the /etc/resolv.conf file looks like this:
nameserver 10.96.0.10
search platform.svc.cluster.local svc.cluster.local cluster.local mycompany.com
options ndots:5
How do I fix this? I assume I need to tell HBase to register its nodes in zookeeper using their fully qualified domain name - how?

ejabberd clustering problems and solutions

Setup Details
2 ejabberd nodes with postgresql as database (OS : Ubuntu 16.04)
Trying to do clustering of two ejabberd as mentioned in
https://docs.ejabberd.im/admin/guide/clustering/
After starting the master node the following steps have been performed on the slave node
copy .erlang.cookie to the slave node
copy ejabbed.yml from master to slave.
slave started successfully but shows the below error.
=====Error=========
Eshell V9.2 (abort with ^G)
(ejabberd#gim-Veriton-M6650G)1> 18:29:41.856 [notice] Changed loghwm of /usr/local/var/log/ejabberd/error.log to 100
18:29:41.856 [notice] Changed loghwm of /usr/local/var/log/ejabberd/ejabberd.log to 100
18:29:41.857 [info] Application lager started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.860 [info] Application crypto started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.865 [info] Application sasl started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.871 [info] Application asn1 started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.871 [info] Application public_key started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.880 [info] Application ssl started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.881 [info] Application p1_utils started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.883 [info] Application fast_yaml started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.888 [info] Application fast_tls started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.892 [info] Application fast_xml started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.895 [info] Application stringprep started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.899 [info] Application xmpp started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.903 [info] Application cache_tab started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.910 [info] Application eimp started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.910 [info] Loading configuration from /usr/local/etc/ejabberd/ejabberd.yml
18:29:41.913 [error] CRASH REPORT Process <0.67.0> with 0 neighbours exited with reason: no case clause matching <<>> in ejabberd_config:get_config_option_key/2 line 473 in application_master:init/4 line 134
18:29:41.913 [info] Application ejabberd exited with reason: no case clause matching <<>> in ejabberd_config:get_config_option_key/2 line 473
(ejabberd#gim-Veriton-M6650G)1>
I've tried re creating mnesia DB also but didn't help.
ejabberdctl status shows ejabberd is not running in that node
Can some oe please look into the issue and help.
Finally I found the solution to the problem
The issue is with the node name as the node name of the master ia FQ name
but the slave node's name is without a domain.
Also added both the node names in the /etc/hosts file
For ejabberd clustering ,Please refer the below steps.
Before starting , configure proper entries in the /etc/hosts files of both nodes.
ie the nodes should resolve each other using their host names.
set ejaberd node name in ejabberd.cfg file , both the nodes should have different node names.
1.cofigure ejabberd in one master node with a proper node name (either a FQDN or just a name of your convenience)
2.Configure slave node with the same config as that of master ie. bot the nodes should have the same configuration in ejabberd.yml file)
3.copy erlang.cookie from master node to slave and the ejabberd user should be ale to read the cookie file.
4.Start the master node in live mode (ejabberdctl live )
5.Start slave node in live mode
6.Check the cookie value in erlang console of both the nodes using the command 'erlang:get_cookie().' , both the nodes should have the same value.
7.If bot the nodes have same value then execute "ejabberdctl --not-timeout join_cluser ejabberd#nodename" in the slave.
change ejabberd#nodename according to your environment.
In my case I ran ejabberd with 'ejabberd' user with node name as ejabberd#cluster-node1 (If you want you can use a FQDN also like ejabberd#example.com)
8.If the abode command executed without any error then the nodes are in cluster
9.Confirm the cluster in any of the erlang console using the command mnesia:info(). here you will get the node details in "running_db_nodes"
10.Hurrayyyy you are done...
For load balancing the cluster you can use HAProxy
Please refer https://blog.onefellow.com/post/76702632637/haproxy-and-ejabberd for details
I've not done load balancing using any hardware load balancer , need to check on that
If anyone have done that please do post here ..

Received AliveMessage from a peer with the same PKI-ID as myself

I am attempting to port the Hyperledger Fabric Getting Started to Kubernetes. But am struggling to get peer1's to deploy. If I enable CORE_PEER_GOSSIP_BOOTSTRAP, I receive errors "Received AliveMessage from a peer with the same PKI-ID as myself".
How can I debug a peer reportedly having the same PKI-ID as another?
Using this as a starting point:
https://hyperledger-fabric.readthedocs.io/en/latest/getting_started.html
I am able to create:
orderer and cli pods in default namespace
peer0's one in each org1|org2 namespace.
peer1's but only if I disable (comment out) CORE_PEER_GOSSIP_BOOTSTRAP
If I enable CORE_PEER_GOSSIP_BOOTSTRAP for the peer1's, I receive the following warning and error:
[gossip/gossip#10.0.0.10:7051] NewGossipService -> WARN 01c External endpoint is empty, peer will not be accessible outside of its organization
...
[gossip/discovery#10.0.0.10:7051] handleAliveMessage -> ERRO 02a Bad configuration detected: Received AliveMessage from a peer with the same PKI-ID as myself: tag:EMPTY alive_msg:<membership:<pki_id:"[[REDACTED]]" > timestamp:<inc_number:1495468533769417608 seq_num:416 > >
In order to better map the Orderer, Peers to DNS names, I'm using Kubernetes Namespaces and this configuration:
OrdererOrgs:
- Name: Orderer
Domain: default.svc.cluster.local
Specs:
- Hostname: orderer
PeerOrgs:
- Name: Org1
Domain: org1.svc.cluster.local
Template:
Count: 2
Users:
Count: 2
- Name: Org2
Domain: org2.svc.cluster.local
Template:
Count: 2
Users:
Count: 2
In order to expose the peer0's to the other peers in the org and to expose the orderer, I have ClusterIP services for the peer0's (selecting only the peer0's) and orderer. It's inelegant but I'm trying to get it to work before I get it working more beautifully.
I am able to resolve orderer.default.svc.cluster.local, peer0.org1.svc.cluster.local, `peer0.org2.svc.cluster.local' using nslookup from within a pod deployed to default on the cluster.
Absent a curl-like tool for gPRC, I am able to open sockets against these endpoints on 7051 and 7053.
First, make sure you are using the right certificates.
Second, verify that your environment/configuration for gossip is set correctly
environment:
- CORE_PEER_GOSSIP_EXTERNALENDPOINT=peer1.org1.example.com:8051
- CORE_PEER_GOSSIP_BOOTSTRAP=peer0.org1.example.com:7051
- CORE_PEER_GOSSIP_ENDPOINT=peer0.org1.example.com:7051
OR in core.yaml
peer:
gossip:
bootstrap: peer0.org1.example.com:7051
externalEndpoint: peer1.org1.example.com:8051
endpoint: peer0.org1.example.com:7051
Edited: Also make sure that you have properly setup your CA
Hope this helps, it worked for me. And I was successfully able to connect peers.
If the peers are started from the same node, its possible that you are mounting the same crypto-material (path to mspconfig directory) for both the peers. If that is the case, separate the directory structures for both the peers and keep their respective certificates in them, update the respective paths for msp in docker-compose file and try to run.

How debug akka association porcess?

Here is a scenario:
I have packaged scala project with spray into jar file.
Launch jar file on RedHat 6.5 on Virtual Box (ip - 192.168.1.38)
Launch jar file on RedHat 6.5 on Virtual Box (ip - 192.168.1.41)
Everything works locally - I can send REST request to each virtual machine and get response.
Problem
Akka systems can not became to cluster. I run 192.168.1.38 with default settings, but 192.168.1.41 have an additional property - akka.cluster.seed-nodes which is set to akka.tcp://mySystem#192.168.1.38:2551. So I get:
[WARN] [12/09/2014 17:10:24.043] [mySystem-akka.remote.default-remote-dispatcher-8] [akka.tcp://mySystem#192.168.1.41:2551/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FmySystem%40192.168.1.38%3A2551-0] Association with remote system [akka.tcp://mySystem#192.168.1.38:2551] has failed, address is now gated for [5000] ms. Reason is: [Association failed with [akka.tcp://mySystem#192.168.1.38:2551]].
No other errors or warning. Also how can I test akka association or print debug akka association settings?
Also can linux settings influence to akka association?
Most probably iptables is blocking particular port, if it's your test configuration just disable iptables.
service iptables save
service iptables stop
chkconfig iptables off
service ip6tables save
service ip6tables stop
chkconfig ip6tables off
If it will not help try to check you SELinux configuration using command getenforce and the same for test purposes you can completely disable it. SELinux manual
In case of your application.conf, try using following configuration for each node:
akka {
log-dead-letters = on
loglevel = "debug"
actor
{
provider = "akka.cluster.ClusterActorRefProvider"
}
extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
remote {
log-remote-lifecycle-events = off
netty.tcp {
port = 6001
}
}
cluster {
seed-nodes = [
"akka.tcp://ActorSystem#192.168.1.38:6001",
"akka.tcp://ActorSystem#192.168.1.41:6001"
]
auto-down-unreachable-after = 10s
}
}
All the logs related to the cluster nodes are logged as info but having debug log level in test environment is in general good idea.
When the second, node will join the cluster, you should notice following log:
INFO [ActorSystem-akka.actor.default-dispatcher-4] [Cluster(akka://ActorSystem)] - Cluster Node [akka.tcp://ActorSystem#10.0.1.41:6001] - Marking node(s) as REACHABLE [Member(address = akka.tcp://ActorSystem#10.0.1.41:6001, status = Up)]
Cluster state could be also monitored using jmx akka.Cluster MXBean
{ "self-address": "akka.tcp://ActorSystem#10.0.1.82:6001", "members": [ { "address": "akka.tcp://ActorSystem#10.0.1.82:6001", "status": "Up" } ], "unreachable": [ ] }