Im new to GlusterFS and am trying to deploy GlusterFS as a DaemonSet in a new K8S cluster.
My K8S cluster is setup on Bare Metal and all the host machines are Debian9 based.
Im getting the GlusterFS DaemonSet from the official Kubernetes Incubator repo which is here. The image being used is based off of CentOS.
Now when I deploy the DaemonSet, all the pods stay in Pending state. When I do a describe on the Pods, I get livelinessProbe/ReadinessProbe failed with the following ERRORs.
[glusterfspod-6h85 /]# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2018-11-10 19:41:53 UTC; 2min 2s ago
Process: 68 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE)
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: glusterd.service: control process exited, code=exited status=1
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Failed to start GlusterFS, a clustered file-system server.
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Unit glusterd.service entered failed state.
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: glusterd.service failed.
Then I exec into the Pods and check the logs and they say :
-- Unit sshd.service has begun starting up.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: error: Bind to port 2222 on 0.0.0.0 failed: Address already in use.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: error: Bind to port 2222 on :: failed: Address already in use.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: fatal: Cannot bind any address.
Nov 10 19:35:24 kubernetes-agent-4 systemd[1]: sshd.service: main process exited, code=exited, status=255/n/a
Nov 10 19:35:24 kubernetes-agent-4 systemd[1]: Failed to start OpenSSH server daemon.
And
[2018-11-10 19:34:42.330154] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2018-11-10 19:34:42.330165] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-11-10 19:34:42.333893] E [socket.c:802:__socket_server_bind] 0-socket.management: binding to failed: Address already in use
[2018-11-10 19:34:42.333911] E [socket.c:805:__socket_server_bind] 0-socket.management: Port is already in use
[2018-11-10 19:34:42.333926] W [rpcsvc.c:1788:rpcsvc_create_listener] 0-rpc-service: listening on transport failed
[2018-11-10 19:34:42.333938] E [MSGID: 106244] [glusterd.c:1757:init] 0-management: creation of listener failed
[2018-11-10 19:34:42.333949] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2018-11-10 19:34:42.333965] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed
[2018-11-10 19:34:42.333974] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
[2018-11-10 19:34:42.334371] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55adc15817dd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x55adc1581683] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55adc1580b8b] ) 0-: received signum (-1), shutting down
And
[2018-11-10 19:34:03.299298] I [MSGID: 100030] [glusterfsd.c:2691:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 5.0 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-11-10 19:34:03.330091] I [MSGID: 106478] [glusterd.c:1435:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-11-10 19:34:03.330125] I [MSGID: 106479] [glusterd.c:1491:init] 0-management: Using /var/lib/glusterd as working directory
[2018-11-10 19:34:03.330135] I [MSGID: 106479] [glusterd.c:1497:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-11-10 19:34:03.334414] W [MSGID: 103071] [rdma.c:4475:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2018-11-10 19:34:03.334435] W [MSGID: 103055] [rdma.c:4774:init] 0-rdma.management: Failed to initialize IB Device
[2018-11-10 19:34:03.334444] W [rpc-transport.c:339:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2018-11-10 19:34:03.334537] W [rpcsvc.c:1789:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-11-10 19:34:03.334549] E [MSGID: 106244] [glusterd.c:1798:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-11-10 19:34:05.496746] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2018-11-10 19:34:05.496843] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2018-11-10 19:34:05.496846] I [MSGID: 106514] [glusterd-store.c:2304:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 50000
[2018-11-10 19:34:05.513644] I [MSGID: 106194] [glusterd-store.c:3983:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Is there something I have missed ? The volumes section on the DaemonSet manifest mounts volumes from hostPath. Should I deploy glusterfs-server on my host machines aswell ? Or is this a CentOS / Debian mismatch issue ?
Related
I installed Mongodb 4.4.1 following the documentation of the mongodb site. installation was smooth. But when I am checking the status of the mongod service, I am getting an error as below:
Loaded: loaded (/lib/systemd/system/mongod.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2020-09-16 22:22:23 IST; 6s ago
Docs: https://docs.mongodb.org/manual
Process: 36655 ExecStart=/usr/bin/mongod --config /etc/mongod.conf (code=exited, stat> Main PID: 36655 (code=exited, status=14)
Sep 16 22:22:22 Ubuntu-PC systemd[1]: Started MongoDB Database Server.
Sep 16 22:22:23 Ubuntu-PC systemd[1]: mongod.service: Main process
exited, code=exited, s> Sep 16 22:22:23 Ubuntu-PC systemd[1]:
mongod.service: Failed with result 'exit-code'.
Even when I try to open mongo shell typing mongo, I get the following error:
MongoDB shell version v4.4.1 connecting to:
mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Error: couldn't connect to server 127.0.0.1:27017, connection attempt
failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused
by :: Connection refused : connect#src/mongo/shell/mongo.js:374:17
#(connect):2:6 exception: connect failed exiting with code 1
The service is showing as started and enabled. Pleas help. Where have I done the mistake. Thanks in advance.
Try this:
service mongod stop
systemctl start mongod
To start using MongoDB:
mongo
I try to uncomment 'unix_socket_directories = '/var/run/postgresql' in postgresql.conf.
But after doing that, when I try to restart postgresql, I receive
Job for postgresql.service failed because the control process exited with error code. See
"systemctl status postgresql.service" and "journalctl -xe" for details.
Looking for journalctl, I see:
LOG: could not bind Unix socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, remove socket file
"/var/run/postgresql/.s.PGSQL...and retry.
: WARNING: could not create Unix-domain socket in directory "/var/run/postgresql"
Full stack:
Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled;
vendor preset: disabled)
Active: failed (Result: exit-code) since Ср 2020-01-29 13:39:52 MSK;
27s ago
Process: 2937 ExecStop=/usr/bin/pg_ctl stop -D ${PGDATA} -s -m fast
(code=exited, status=1/FAILURE)
Process: 2550 ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o -p
${PGPORT} -w -t 300 (code=exited, status=0/SUCCESS)
Process: 2544 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA}
(code=exited, status=0/SUCCESS)
Main PID: 2554 (code=killed, signal=KILL)
plesk.iline.pro systemd[1]: Starting PostgreSQL database server...
plesk.iline.pro pg_ctl[2550]: LOG: could not bind Unix socket:
Address already in use
plesk.iline.pro pg_ctl[2550]: HINT: Is another postmaster already
running on port 5432? If not, remove socket file
"/var/run/postgresql/.s.PGSQL...and retry.
plesk.iline.pro pg_ctl[2550]: WARNING: could not create Unix-
domain socket in directory "/var/run/postgresql"
plesk.iline.pro systemd[1]: Started PostgreSQL database server.
2 plesk.iline.pro systemd[1]: postgresql.service: main process
exited, code=killed, status=9/KILL
plesk.iline.pro pg_ctl[2937]: pg_ctl: could not send stop signal
(PID: 2554): No such process
plesk.iline.pro systemd[1]: postgresql.service: control process
exited, code=exited status=1
plesk.iline.pro systemd[1]: Unit postgresql.service entered failed
state.
plesk.iline.pro systemd[1]: postgresql.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
Likely there is another PostgreSQL server running on the same port.
You should do what PostgreSQL recommends:
ls -l /var/run/postgresql/.s.PGSQL.5432
There should be a socket file present.
Then, as user root, see if there is a PostgreSQL process listening on the port:
sudo fuser /var/run/postgresql/.s.PGSQL.5432
If there is a result, there is really another PostgreSQL server running on port 5432. Either stop that server or choose a different port for your cluster.
If there is no result, the socket may be left over. Remove it and try again.
I am tring to install kubenetes on debian 9.3, I followed the instructions on this document https://kubernetes.io/docs/setup/independent/install-kubeadm/, it failed to create the cluster with timeout error, the commands I used are as follows:
export HTTP_PROXY=http://192.168.56.1:1080 # this is my internet proxy
export HTTPS_PROXY=http://192.168.56.1:1080
export NO_PROXY=127.0.0.1,192.168.56.*,10.244.*,10.96.*
kubeadm init --apiserver-advertise-address=192.168.56.101 --pod-network-cidr=10.244.0.0/16
the last command hangs up for 1hour and failed with timeout, I found that several container had been running by command docker ps, the running containers included kube-controller-manager-amd64,etcd-amd64,kube-apiserver-amd64,kube-scheduler-amd64,4 instances of pause-amd64.
the error messages are as follows
duler-debvm01_kube-system(660259102d57385a8043d025ac189c87)": Get https://192.168.56.101:6443/api/v1/namespaces/kube-system/pods/kube-scheduler-debvm01: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.923017 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://192.168.56.101:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddebvm01&limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.924966 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://192.168.56.101:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.925892 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get xxx/api/v1/pods?fieldSelector=spec.nodeName%3Ddebvm01&limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:50 DebVM01 kubelet[10665]: E0406 21:44:50.029333 10665 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "debvm01" not found
Apr 06 21:44:50 DebVM01 kubelet[10665]: E0406 21:44:50.379543 10665 kubelet_node_status.go:106] Unable to register node "debvm01" with API server: Post xxx: net/http: TLS handshake timeout
Apr 06 21:44:52 DebVM01 kubelet[10665]: E0406 21:44:52.575452 10665 event.go:209] Unable to write event: 'Post xxxx: net/http: TLS handshake timeout' (may retry after sleeping)
Apr 06 21:44:57 DebVM01 kubelet[10665]: I0406 21:44:57.380498 10665 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
Apr 06 21:44:57 DebVM01 kubelet[10665]: I0406 21:44:57.430059 10665 kubelet_node_status.go:82] Attempting to register node debvm01
Apr 06 21:45:00 DebVM01 kubelet[10665]: E0406 21:45:00.030635 10665 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "debvm01" not found
Apr 06 21:45:01 DebVM01 kubelet[10665]: I0406 21:45:01.484580 10665 kubelet_node_status.go:85] Successfully registered node debvm01
the above error messages has been processed and eliminated a lot of repeated lines as follows:
Apr 06 22:46:20 DebVM01 kubelet[10665]: E0406 22:46:20.773690 10665 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Apr 06 22:46:25 DebVM01 kubelet[10665]: W0406 22:46:25.779141 10665 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Kubernetes v1.9.3
could anyone help me?
kubeadm init --apiserver-advertise-address=192.168.56.101
--pod-network-cidr=10.244.0.0/16
From kubeadm documentation:
--apiserver-advertise-address ip-address The IP address the API Server will advertise it's listening on. Specify '0.0.0.0' to use the
address of the default network interface.
Unless otherwise specified, kubeadm uses the default gateway’s network
interface to advertise the master’s IP. If you want to use a different
network interface, specify --apiserver-advertise-address=ip-address
From kubernetes api-server documentation:
--advertise-address ip-address The IP address on which to advertise the apiserver to members of the cluster. This address must
be reachable by the rest of the cluster. If blank, the --bind-address
will be used. If --bind-address is unspecified, the host's default
interface will be used.
I've done a couple of experiments which confirm that it is necessary for ip-address to be configured (or added as a secondary IP) to one of the master's instance interfaces.
Just double check if that interface is up.
The last error message,
network plugin is not ready: cni config uninitialized
means that kubernetes networking subsystem is absent or broken. Try to install/reinstall it with
kubectl apply -f https://docs.projectcalico.org/v3.0/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml
This part described in section "(3/4) Installing a pod network" in the document you've mentioned.
If you are stuck, try to reinstall your cluster following this manual.
Digitalocean disabled my droplet's internet access. After fixing the error (rollback to older backup) they restored the internet access. But afterwards I constantly get an error when deploying, I can't seem to get my Postgres database up and running.
I'm getting an error each time I try to deploy my application.
PG::ConnectionBad: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
So I used SSH to login to my server and check if my Postgres was actually running with:
pg_lsclusters
Results into:
Ver Cluster Port Status Owner Data directory Log file
9.5 main 5432 down postgres /var/lib/postgresql/9.5/main /var/log/postgresql/postgresql-9.5-main.log
Postgres server status
So my Postgres server seems to be down. I tried putting it 'up' again with:
pg_ctlcluster 9.5 main start After doing so I got the error: Insecure directory in $ENV{PATH} while running with -T switch at /usr/bin/pg_ctlcluster line 403.
And /usr/bin/pg_ctlcluster on line 403 says:
system 'systemctl', 'is-active', '-q', "postgresql\#$version-$cluster";
But I'm not to sure what the problem could be here and how I could fix this.
Update
I also tried updating the permissions on /bin to 755 as mentioned here. Sadly that did not fix my problem.
Update 2
I changed the /usr/bin to 755. Now when I try pg_ctlcluster 9.5 main start, I get this:
Job for postgresql#9.5-main.service failed because the control process exited with error code. See "systemctl status postgresql#9.5-main.service" and "journalctl -xe" for details.
And inside the systemctl status postgresql#9.5-main.service:
postgresql#9.5-main.service - PostgreSQL Cluster 9.5-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2018-01-28 17:32:38 EST; 45s ago
Process: 22473 ExecStart=postgresql#%i --skip-systemctl-redirect %i start (code=exited, status=1/FAILURE)
Jan 28 17:32:08 *url* systemd[1]: Starting PostgreSQL Cluster 9.5-main...
Jan 28 17:32:38 *url* postgresql#9.5-main[22473]: The PostgreSQL server failed to start.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Control process exited, code=exited status=1
Jan 28 17:32:38 *url* systemd[1]: Failed to start PostgreSQL Cluster 9.5-main.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Unit entered failed state.
Jan 28 17:32:38 *url* systemd[1]: postgresql#9.5-main.service: Failed with result 'exit-code'.
Thanks!
You better not mix systemctl and pg_ctlcluster. Let systemctl makes the calls to pg_ctlcluster with the right user and permissions. You should start your postgresql instance with
sudo systemctl start postgresql#9.5-main.service
Also, check the errors in the startup log. You can post them too, to help you figure out what's going on.
Your systemctl status also outputs that the service is disable, so, when the server reboots, you will have to start the service manually. To enable it run:
sudo systemctl enable postgresql#9.5-main.service
I hope it helps
It is mainly because /etc/hosts file is somehow changed.I have removed extra space inside /etc/hosts file.Use cat /etc/hosts
Add these lines into the file
127.0.0.1 localhost
127.0.1.1 your-host-name
::1 ip6-localhost ip6-loopback
And I have given permission 644 to /etc/hosts file.It is working for me even after the reboot of the system.
i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.
now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.
selinux, firewall, iptalbes all are disabled. able to perform ssh between node.
[cloud-user#slave1 ~]$ sudo systemctl status mesos-slave -l
mesos-slave.service - Mesos Slave
Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
Main PID: 2483 (mesos-slave)
CGroup: /system.slice/mesos-slave.service
├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
├─2493 logger -p user.info -t mesos-slave[2483]
└─2494 logger -p user.err -t mesos-slave[2483]
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670 2497 detector.cpp:482] A new leading master (UPID=master#127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732 2497 slave.cpp:729] New master detected at master#127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825 2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844 2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872 2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107 2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478