mesos slaves are not connecting with mesos masters cluster - apache-zookeeper

i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.
now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.
selinux, firewall, iptalbes all are disabled. able to perform ssh between node.
[cloud-user#slave1 ~]$ sudo systemctl status mesos-slave -l
mesos-slave.service - Mesos Slave
Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
Main PID: 2483 (mesos-slave)
CGroup: /system.slice/mesos-slave.service
├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
├─2493 logger -p user.info -t mesos-slave[2483]
└─2494 logger -p user.err -t mesos-slave[2483]
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670 2497 detector.cpp:482] A new leading master (UPID=master#127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732 2497 slave.cpp:729] New master detected at master#127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825 2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844 2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872 2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107 2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected

So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478

Related

PSQL timeline conflict prevent start of master

We had an outage on one of our PSQL 14 (managed by Zalando) due to k8s control plane being unreachable for 30min.
Control plane is now ok but master PSQL does not want to start:
LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432"
LOG,00000,"listening on IPv6 address ""::"", port 5432"
LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432"""
LOG,00000,"database system was shut down at 2023-01-30 02:51:10 UTC"
WARNING,01000,"specified neither primary_conninfo nor restore_command",,"The database server will regularly poll the pg_wal subdirectory to check for files placed there."
LOG,00000,"entering standby mode"
FATAL,XX000,"requested timeline 5 is not a child of this server's history","Latest checkpoint is at 2/82000028 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 0/530000A0."
LOG,00000,"startup process (PID 23007) exited with exit code 1"
LOG,00000,"aborting startup due to startup process failure"
LOG,00000,"database system is shut down"
We can see in archive_status folder:
-rw-------. 1 postgres postgres 0 Jan 30 02:51 000000040000000200000081.ready
-rw-------. 1 postgres postgres 0 Jan 30 02:51 00000005.history.done
Would you know how we can recover safely from this?
I guess switching back to timeline 4 would be enough as timeline 5 was made after start of outage.
The server is started in standby mode. Remove standby.signal if you want to start the server as primary server.

connect to TDengine via restful, but the port is not monitored

When I try to connect TDengine via its restful API but its 6041 port is not monitored.
Following is more detail info.
systemctl status taosd
● taosd.service - TDengine server service
Loaded: loaded (/etc/systemd/system/taosd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-10-20 15:08:33 CST; 48min ago
Process: 16246 ExecStartPre=/usr/local/taos/bin/startPre.sh (code=exited, status=0/SUCCESS)
Main PID: 16257 (taosd)
Tasks: 57
Memory: 30.1M
CGroup: /system.slice/taosd.service
└─16257 /usr/bin/taosd
Oct 20 15:08:33 ecs-29b3 systemd[1]: Starting TDengine server service...
Oct 20 15:08:33 ecs-29b3 systemd[1]: Started TDengine server service.
Oct 20 15:08:33 ecs-29b3 TDengine:[16257]: Starting TDengine service...
Oct 20 15:08:33 ecs-29b3 TDengine:[16257]: Started TDengine service successfully.
netstat -antp|grep 6030
tcp 0 0 0.0.0.0:6030 0.0.0.0:* LISTEN 16257/taosd
netstat -antp|grep 6041
Any suggestion?
you can check whether taosadapter is running with ps -e | grep taosadapter, If there is no taosadapter running, you should start it.

PG::ConnectionBad Postgres Cluster down

Digitalocean disabled my droplet's internet access. After fixing the error (rollback to older backup) they restored the internet access. But afterwards I constantly get an error when deploying, I can't seem to get my Postgres database up and running.
I'm getting an error each time I try to deploy my application.
PG::ConnectionBad: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
So I used SSH to login to my server and check if my Postgres was actually running with:
pg_lsclusters
Results into:
Ver Cluster Port Status Owner Data directory Log file
9.5 main 5432 down postgres /var/lib/postgresql/9.5/main /var/log/postgresql/postgresql-9.5-main.log
Postgres server status
So my Postgres server seems to be down. I tried putting it 'up' again with:
pg_ctlcluster 9.5 main start After doing so I got the error: Insecure directory in $ENV{PATH} while running with -T switch at /usr/bin/pg_ctlcluster line 403.
And /usr/bin/pg_ctlcluster on line 403 says:
system 'systemctl', 'is-active', '-q', "postgresql\#$version-$cluster";
But I'm not to sure what the problem could be here and how I could fix this.
Update
I also tried updating the permissions on /bin to 755 as mentioned here. Sadly that did not fix my problem.
Update 2
I changed the /usr/bin to 755. Now when I try pg_ctlcluster 9.5 main start, I get this:
Job for postgresql#9.5-main.service failed because the control process exited with error code. See "systemctl status postgresql#9.5-main.service" and "journalctl -xe" for details.
And inside the systemctl status postgresql#9.5-main.service:
postgresql#9.5-main.service - PostgreSQL Cluster 9.5-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2018-01-28 17:32:38 EST; 45s ago
Process: 22473 ExecStart=postgresql#%i --skip-systemctl-redirect %i start (code=exited, status=1/FAILURE)
Jan 28 17:32:08 *url* systemd[1]: Starting PostgreSQL Cluster 9.5-main...
Jan 28 17:32:38 *url* postgresql#9.5-main[22473]: The PostgreSQL server failed to start.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Control process exited, code=exited status=1
Jan 28 17:32:38 *url* systemd[1]: Failed to start PostgreSQL Cluster 9.5-main.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Unit entered failed state.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Failed with result 'exit-code'.
Thanks!
You better not mix systemctl and pg_ctlcluster. Let systemctl makes the calls to pg_ctlcluster with the right user and permissions. You should start your postgresql instance with
sudo systemctl start postgresql#9.5-main.service
Also, check the errors in the startup log. You can post them too, to help you figure out what's going on.
Your systemctl status also outputs that the service is disable, so, when the server reboots, you will have to start the service manually. To enable it run:
sudo systemctl enable postgresql#9.5-main.service
I hope it helps
It is mainly because /etc/hosts file is somehow changed.I have removed extra space inside /etc/hosts file.Use cat /etc/hosts
Add these lines into the file
127.0.0.1 localhost
127.0.1.1 your-host-name
::1 ip6-localhost ip6-loopback
And I have given permission 644 to /etc/hosts file.It is working for me even after the reboot of the system.

Kubernetes starts giving errors after few hours of uptime

I have installed K8S on OpenStack following this guide.
The installation went fine and I was able to run pods but after some time my applications stops working. I can still create pods but request won't reach the services from outside the cluster and also from within the pods. Basically, something in networking gets messed up. The iptables -L -vnt nat still shows the proper configuration but things won't work.
To make it working, I have to rebuild cluster, removing all services and replication controllers doesn't work.
I tried to look into the logs. Below is the journal for kube-proxy:
Dec 20 02:12:18 minion01.novalocal systemd[1]: Started Kubernetes Proxy.
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.269784 1030 proxier.go:487] Opened iptables from-containers public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.278952 1030 proxier.go:498] Opened iptables from-host public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 03:05:11 minion01.novalocal kube-proxy[1030]: W1220 03:05:11.806927 1030 api.go:224] Got error status on WatchEndpoints channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1433/544]) [2432] Reason: Details:<nil> Code:0}
Dec 20 03:06:08 minion01.novalocal kube-proxy[1030]: W1220 03:06:08.177225 1030 api.go:153] Got error status on WatchServices channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1476/207]) [2475] Reason: Details:<nil> Code:0}
..
..
..
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448570 1030 proxier.go:161] Failed to ensure iptables: error creating chain "KUBE-PORTALS-CONTAINER": fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.448749 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448868 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/kubernetes:"
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448906 1030 proxier.go:176] Failed to ensure portal for "default/kubernetes:": error checking rule: fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.449006 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.449133 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/repo-client:"
I found few posts relating to "failed to install iptables" but they don't seem to be relevant as initially everything works but after few hours it gets messed up.
What version of Kubernetes is this? A long time ago (~1.0.4) we had a bug in the kube-proxy where it leaked sockets/file-descriptors.
If you aren't running a 1.1.3 binary, consider upgrading.
Also, you should be able to use lsof to figure out who has all of the files open.

Changing port of OpenLdap on Centos installed with yum

I am trying to change the default port of openldap (not so experienced with openldap so I might be doing something incorrectly).
Currently I am installing it through yum package manager on CentOS 7.1.1503 as follows :
yum install openldap-servers
After installing 'openldap-servers' I can start the openldap server by invoking service slapd start
however when I try to change the port by editing /etc/sysconfig/slapd for instance by changing SLAPD_URLS to the following :
# OpenLDAP server configuration
# see 'man slapd' for additional information
# Where the server will run (-h option)
# - ldapi:/// is required for on-the-fly configuration using client tools
# (use SASL with EXTERNAL mechanism for authentication)
# - default: ldapi:/// ldap:///
# - example: ldapi:/// ldap://127.0.0.1/ ldap://10.0.0.1:1389/ ldaps:///
SLAPD_URLS="ldapi:/// ldap://127.0.0.1:3421/"
# Any custom options
#SLAPD_OPTIONS=""
# Keytab location for GSSAPI Kerberos authentication
#KRB5_KTNAME="FILE:/etc/openldap/ldap.keytab"
(see SLAPD_URLS="ldapi:/// ldap://127.0.0.1:3421/" )..
it is failing to start
service slapd start
Redirecting to /bin/systemctl start slapd.service
Job for slapd.service failed. See 'systemctl status slapd.service' and 'journalctl -xn' for details.
service slapd status
Redirecting to /bin/systemctl status slapd.service
slapd.service - OpenLDAP Server Daemon
Loaded: loaded (/usr/lib/systemd/system/slapd.service; disabled)
Active: failed (Result: exit-code) since Fri 2015-07-31 07:49:06 EDT; 10s ago
Docs: man:slapd
man:slapd-config
man:slapd-hdb
man:slapd-mdb
file:///usr/share/doc/openldap-servers/guide.html
Process: 41704 ExecStart=/usr/sbin/slapd -u ldap -h ${SLAPD_URLS} $SLAPD_OPTIONS (code=exited, status=1/FAILURE)
Process: 41675 ExecStartPre=/usr/libexec/openldap/check-config.sh (code=exited, status=0/SUCCESS)
Main PID: 34363 (code=exited, status=0/SUCCESS)
Jul 31 07:49:06 osboxes runuser[41691]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes runuser[41693]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes runuser[41695]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes runuser[41697]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes runuser[41699]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes runuser[41701]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
Jul 31 07:49:06 osboxes slapd[41704]: #(#) $OpenLDAP: slapd 2.4.39 (Mar 6 2015 04:35:49) $
mockbuild#worker1.bsys.centos.org:/builddir/build/BUILD/openldap-2.4.39/openldap-2.4.39/servers/slapd
Jul 31 07:49:06 osboxes systemd[1]: slapd.service: control process exited, code=exited status=1
Jul 31 07:49:06 osboxes systemd[1]: Failed to start OpenLDAP Server Daemon.
Jul 31 07:49:06 osboxes systemd[1]: Unit slapd.service entered failed state.
ps I also disabled firewalld
the solution was provided when I ran journalctl -xn which basically says:
SELinux is preventing /usr/sbin/slapd from name_bind access on the tcp_socket port 9312.
***** Plugin bind_ports (92.2 confidence) suggests ************************
If you want to allow /usr/sbin/slapd to bind to network port 9312
Then you need to modify the port type.
Do
# semanage port -a -t ldap_port_t -p tcp 9312
***** Plugin catchall_boolean (7.83 confidence) suggests ******************
If you want to allow nis to enabled
Then you must tell SELinux about this by enabling the 'nis_enabled' boolean.
You can read 'None' man page for more details.
Do
setsebool -P nis_enabled 1
***** Plugin catchall (1.41 confidence) suggests **************************
If you believe that slapd should be allowed name_bind access on the port 9312 tcp_socket by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
# grep slapd /var/log/audit/audit.log | audit2allow -M mypol
# semodule -i mypol.pp