I was trying to build an customized image of postgresql+repmgr+timescaledb on docker.
Here is my dockerfile:
FROM bitnami/postgresql-repmgr:12.4.0-debian-10-r90
USER root
RUN apt-get update \
&& apt-get -y install \
gcc cmake git clang-format clang-tidy openssl libssl-dev \
&& git clone https://github.com/timescale/timescaledb.git
RUN cd timescaledb \
&& git checkout 2.8.1 \
&& ./bootstrap -DREGRESS_CHECKS=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
&& cd build \
&& make \
&& make install
RUN echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen && locale-gen
USER 1001
build command:
docker build -f dockerfile -t my/pg-repmgr-12-tsdb:12.4.0-debian-10-r90 .
When I tested it, it ran perfectly for the primary node, but when I tried to establish a stand by node, the instance stopped almost immediately after starting up and leaving the logs to be:
postgresql-repmgr 18:51:11.00
postgresql-repmgr 18:51:11.00 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:51:11.00 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:51:11.00 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:51:11.01
postgresql-repmgr 18:51:11.03 INFO ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 18:51:11.05 INFO ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 18:51:11.06 INFO ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 18:51:11.06 INFO ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 18:51:11.13 INFO ==> Auto-detected primary node: 'pg-0:5432'
postgresql-repmgr 18:51:11.14 INFO ==> Preparing PostgreSQL configuration...
postgresql-repmgr 18:51:11.14 INFO ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 18:51:11.26 INFO ==> Preparing repmgr configuration...
postgresql-repmgr 18:51:11.27 INFO ==> Initializing Repmgr...
postgresql-repmgr 18:51:11.28 INFO ==> Waiting for primary node...
postgresql-repmgr 18:51:11.30 INFO ==> Cloning data from primary node...
postgresql-repmgr 18:51:12.11 INFO ==> Initializing PostgreSQL database...
postgresql-repmgr 18:51:12.11 INFO ==> Cleaning stale /bitnami/postgresql/data/standby.signal file
postgresql-repmgr 18:51:12.12 INFO ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 18:51:12.13 INFO ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 18:51:12.16 INFO ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 18:51:12.19 INFO ==> Configuring replication parameters
postgresql-repmgr 18:51:12.23 INFO ==> Configuring fsync
postgresql-repmgr 18:51:12.25 INFO ==> Setting up streaming replication slave...
postgresql-repmgr 18:51:12.28 INFO ==> Starting PostgreSQL in background...
postgresql-repmgr 18:51:12.52 INFO ==> Unregistering standby node...
postgresql-repmgr 18:51:12.59 INFO ==> Registering Standby node...
postgresql-repmgr 18:51:12.64 INFO ==> Running standby follow...
postgresql-repmgr 18:51:12.71 INFO ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
while normal logs continues with several restarts. The logs were confusing because no error is thrown.
Thanks to the first comment, I found that the postgres logs (which I used volumes to access later) said:
2022-10-17 12:37:51.070 GMT [171] LOG: pgaudit extension initialized
2022-10-17 12:37:51.070 GMT [171] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-10-17 12:37:51.072 GMT [171] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-10-17 12:37:51.072 GMT [171] LOG: listening on IPv6 address "::", port 5432
2022-10-17 12:37:51.074 GMT [171] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-10-17 12:37:51.106 GMT [171] LOG: redirecting log output to logging collector process
2022-10-17 12:37:51.106 GMT [171] HINT: Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2022-10-17 12:37:51.119 GMT [173] LOG: database system was interrupted; last known up at 2022-10-17 12:37:49 GMT
2022-10-17 12:37:51.242 GMT [173] LOG: entering standby mode
2022-10-17 12:37:51.252 GMT [173] LOG: redo starts at 0/E000028
2022-10-17 12:37:51.266 GMT [173] LOG: consistent recovery state reached at 0/E000100
2022-10-17 12:37:51.266 GMT [171] LOG: database system is ready to accept read only connections
2022-10-17 12:37:51.274 GMT [177] LOG: started streaming WAL from primary at 0/F000000 on timeline 1
2022-10-17 12:37:51.579 GMT [171] LOG: received fast shutdown request
2022-10-17 12:37:51.580 GMT [171] LOG: aborting any active transactions
2022-10-17 12:37:51.580 GMT [177] FATAL: terminating walreceiver process due to administrator command
2022-10-17 12:37:51.581 GMT [174] LOG: shutting down
2022-10-17 12:37:51.601 GMT [171] LOG: database system is shut down
Can someone please tell me where I did wrong? Much appreciated!
Additional information on reproducing:
the command used for the primary node instance:
docker run --detach --name pg-0 --network my-network --env REPMGR_PARTNER_NODES=pg-0,pg-1 --env REPMGR_NODE_NAME=pg-0 --env REPMGR_NODE_NETWORK_NAME=pg-0 --env REPMGR_PRIMARY_HOST=pg-0 --env REPMGR_PASSWORD=repmgrpass --env POSTGRESQL_POSTGRES_PASSWORD=adminpassword --env POSTGRESQL_USERNAME=customuser --env POSTGRESQL_PASSWORD=custompassword --env POSTGRESQL_DATABASE=customdatabase --env POSTGRESQL_SHARED_PRELOAD_LIBRARIES=repmgr,pgaudit,timescaledb -p 5420:5432 -v /etc/localtime:/etc/localtime:ro my/pg-repmgr-12-tsdb:12.4.0-debian-10-r90
the command used for the standby node instance:
docker run --name pg-1 --network my-network --env REPMGR_PARTNER_NODES=pg-0,pg-1 --env REPMGR_NODE_NAME=pg-1 --env REPMGR_NODE_NETWORK_NAME=pg-1 --env REPMGR_PRIMARY_HOST=pg-0 --env REPMGR_PASSWORD=repmgrpass --env POSTGRESQL_POSTGRES_PASSWORD=adminpassword --env POSTGRESQL_USERNAME=customuser --env POSTGRESQL_PASSWORD=custompassword --env POSTGRESQL_DATABASE=customdatabase --env POSTGRESQL_SHARED_PRELOAD_LIBRARIES=repmgr,pgaudit,timescaledb -v /etc/localtime:/etc/localtime:ro -p 5421:5432 my/pg-repmgr-12-tsdb:12.4.0-debian-10-r90
Sometimes the best way through is just to find another...
I changed the dockerfile to
FROM bitnami/postgresql-repmgr:13.6.0-debian-10-r90
USER root
RUN apt-get update \
&& apt-get -y install \
gcc cmake git clang-format clang-tidy openssl libssl-dev \
&& git clone https://github.com/timescale/timescaledb.git
RUN cd timescaledb \
&& git checkout 2.8.0 \
&& ./bootstrap -DREGRESS_CHECKS=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
&& cd build \
&& make \
&& make install
RUN echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen && locale-gen
USER 1001
and the problem is solved.
I even have no clue whether the version of base image or the version of timescaledb did the magic, but anyhow my problem is solved. Hope any one who encountered the same issue later can benefit from my struggle. >3<
I have two fresh ubuntu VM(s)
VM-1 (65.0.54.158)
VM-2 (65.2.136.2)
I am trying to set up a HA k3s cluster with embedded ETCD. I am referring to the official document
Here is what I have executed on VM-1
curl -sfL https://get.k3s.io | K3S_TOKEN=AtJMEyWR8pE3HR4RWgT6IsqglOkBm0sLC4n0aDBkng9VE1uqyNevR6oCMNCqQNaF sh -s - server --cluster-init
Here is the response from VM-1
curl -sfL https://get.k3s.io | K3S_TOKEN=AtJMEyWR8pE3HR4RWgT6IsqglOkBm0sLC4n0aDBkng9VE1uqyNevR6oCMNCqQNaF sh -s - server --cluster-init
[INFO] Finding release for channel stable
[INFO] Using v1.24.4+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.24.4+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.24.4+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
additionally, I have checked
sudo kubectl get nodes
and this worked perfectly
NAME STATUS ROLES AGE VERSION
ip-172-31-41-34 Ready control-plane,etcd,master 18m v1.24.4+k3s1
Now I am going to ssh into VM-2 and make it join the server running on VM-1
curl -sfL https://get.k3s.io | K3S_TOKEN=AtJMEyWR8pE3HR4RWgT6IsqglOkBm0sLC4n0aDBkng9VE1uqyNevR6oCMNCqQNaF sh -s - server --server https://65.0.54.158:6443
response
[INFO] Finding release for channel stable
[INFO] Using v1.24.4+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.24.4+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.24.4+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xe" for details
here is the contents of /var/log/syslog
Sep 6 19:10:00 ip-172-31-46-114 systemd[1]: Starting Lightweight Kubernetes...
Sep 6 19:10:00 ip-172-31-46-114 sh[9516]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Sep 6 19:10:00 ip-172-31-46-114 sh[9517]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Sep 6 19:10:00 ip-172-31-46-114 k3s[9520]: time="2022-09-06T19:10:00Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Sep 6 19:10:00 ip-172-31-46-114 k3s[9520]: time="2022-09-06T19:10:00Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/577968fa3d58539cc4265245941b7be688833e6bf5ad7869fa2afe02f15f1cd2"
Sep 6 19:10:02 ip-172-31-46-114 k3s[9520]: time="2022-09-06T19:10:02Z" level=info msg="Starting k3s v1.24.4+k3s1 (c3f830e9)"
Sep 6 19:10:22 ip-172-31-46-114 k3s[9520]: time="2022-09-06T19:10:22Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://65.0.54.158:6443/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Sep 6 19:10:22 ip-172-31-46-114 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 6 19:10:22 ip-172-31-46-114 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 6 19:10:22 ip-172-31-46-114 systemd[1]: Failed to start Lightweight Kubernetes.
I am stuck at this for two days. I would really appreciate some help. Thank you.
I have PostgreSQL running in a Docker container (Docker 17.09.0-ce-mac35 on OS X 10.11.6) and I'm inserting data from a Python application on the host. After a while I consistently get the following error in Python while there is still plenty of disk space available on the host:
psycopg2.OperationalError: could not extend file "base/16385/24599.49": wrote only 4096 of 8192 bytes at block 6543502
HINT: Check free disk space.
This is my docker-compose.yml:
version: "2"
services:
rabbitmq:
container_name: rabbitmq
build: ../messaging/
ports:
- "4369:4369"
- "5672:5672"
- "25672:25672"
- "15672:15672"
- "5671:5671"
database:
container_name: database
build: ../database/
ports:
- "5432:5432"
The database Dockerfile looks like this:
FROM ubuntu:17.04
RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ zesty-pgdg main" > /etc/apt/sources.list.d/pgdg.list
RUN apt-get update && apt-get install -y --allow-unauthenticated python-software-properties software-properties-common postgresql-10 postgresql-client-10 postgresql-contrib-10
USER postgres
RUN /etc/init.d/postgresql start &&\
psql --command "CREATE USER ****** WITH SUPERUSER PASSWORD '******';" &&\
createdb -O ****** ******
RUN echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/10/main/pg_hba.conf
RUN echo "listen_addresses='*'" >> /etc/postgresql/10/main/postgresql.conf
EXPOSE 5432
VOLUME ["/etc/postgresql", "/var/log/postgresql", "/var/lib/postgresql"]
CMD ["/usr/lib/postgresql/10/bin/postgres", "-D", "/var/lib/postgresql/10/main", "-c", "config_file=/etc/postgresql/10/main/postgresql.conf"]
df -k output:
Filesystem 1024-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk2 1088358016 414085004 674017012 39% 103585249 168504253 38% /
devfs 190 190 0 100% 658 0 100% /dev
map -hosts 0 0 0 100% 0 0 100% /net
map auto_home 0 0 0 100% 0 0 100% /home
Update 1:
It seems like the container has now shut down. I'll start over and try to df -k in the container before it shuts down.
2017-11-14 14:48:25.117 UTC [18] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-11-14 14:48:25.120 UTC [17] WARNING: terminating connection because of crash of another server process
2017-11-14 14:48:25.120 UTC [17] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-11-14 14:48:25.120 UTC [17] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-11-14 14:48:25.132 UTC [1] LOG: all server processes terminated; reinitializing
2017-11-14 14:48:25.175 UTC [1] FATAL: could not access status of transaction 0
2017-11-14 14:48:25.175 UTC [1] DETAIL: Could not write to file "pg_notify/0000" at offset 0: No space left on device.
2017-11-14 14:48:25.181 UTC [1] LOG: database system is shut down
Update 2:
This is df -k on the container, /dev/vda2 seems to be filling up quickly:
$ docker exec -it database df -k
Filesystem 1K-blocks Used Available Use% Mounted on
none 61890340 15022448 43700968 26% /
tmpfs 65536 0 65536 0% /dev
tmpfs 1023516 0 1023516 0% /sys/fs/cgroup
/dev/vda2 61890340 15022448 43700968 26% /etc/postgresql
shm 65536 8 65528 1% /dev/shm
tmpfs 1023516 0 1023516 0% /sys/firmware
Update 3:
This seems to be related to the ~64 GB file size limit on Docker.qcow2. Solved using qemu and gparted as follows:
cd ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/
qemu-img info Docker.qcow2
qemu-img resize Docker.qcow2 +200G
qemu-img info Docker.qcow2
qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live-0.30.0-1-i686.iso -boot d -device usb-mouse -usb
I have successfully configured mpi with mpi4py support across three nodes, as per testing of the hellowworld.py script in the mpi4py demo directory:
gms#host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
I am now trying to get this working in ipython and have added my machinefile to my $IPYTHON_DIR/profile_mpi/ipcluster_config.py file, as follows:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
I then start iPython notebook on my head node using the command: ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &
and, voila, I can access it just fine from another device on my local network. However, when I run helloworld.py from iPython notebook, I only get a response from the head node: Hello, World! I am process 0 of 10 on host.
I started mpi from iPython with 10 engines, but...
I further configured these parameters, just in case
in $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
in $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
in $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
However, these did not help, either.
What am I missing to get this working correctly?
EDIT UPDATE 1
I now have NFS mounted directories on my worker nodes, and thus, am fulfilling the requirement "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." to be able to use ipcluster to start my controller and engines, using the command ipcluster start --profile=mpi -n 6 &.
So, I issue this on my head node, and then get:
2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully
Then, proceed to issue the same command to start the engines on the other nodes, but I get:
2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+ Stopped ipcluster start --profile=mpi -n 6
with no confirmation that the Engines appear to have started successfully ...
Even more confusing, when I do a ps au on the worker nodes, I get:
gms 3862 0.1 2.5 38684 23740 pts/0 T 20:31 0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms 3874 0.1 1.7 21428 16772 pts/0 T 20:31 0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts/0 T 20:31 0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts/0 T 20:31 0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
Where the ip addresses in processes 3376 and 3378 are from the other hosts in the cluster. But...
When I run a similar test directly using ipython, all I get is a response from the localhost (even though, minus, ipython, this works directly with mpi and mpi4py as noted in my original post):
gms#head:~/development/mpi$ ipython test.py
head[3834]: 0/1
gms#head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
I still seem to be missing something obvious, although I am convinced my configuration is now correct. One thing that pops out, is when I start ipcluster on my worker nodes, I get this: 2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
EDIT UPDATE 2
This is more to document what is happening and, hopefully, ultimately what gets this working:
I cleaned out my log files and reissued ipcluster start --profile=mpi -n 6 &
And now see 6-log files for my engines, and 1 for my controller:
drwxr-xr-x 2 gms gms 12288 Mar 6 03:28 .
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r--r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log
Looking in the log for ipcontroller it looks like only one engine registered:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
Shouldn't each of the 6 engines be registered?
2 of the engine's logs look like they registered fine:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
while the other registered with id 0
But, the other 4 engines gave a time out error:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
Hmmm... I think I may try a reinstall of ipython tomorrow.
EDIT UPDATE 3
Conflicting versions of ipython were installed (looks like through apt-get and pip). Uninstalling and reinstall using pip install ipython[all]...
EDIT UPDATE 4
I hope someone is finding this useful AND I hope someone can weigh in at some point to help clarify a few things.
Anywho, I installed a virtualenv to deal isolate my environment, and it looks like some degree of success, I think. I fired up 'ipcluster start -n 4 --profile=mpi' on each of my nodes, then ssh'ed back into my head node and ran a test script, which first calls ipcluster. The following output: So, it is doing some parallel computing.
However, when I run my test script that queries all the nodes, I just get the head node:
But, again, if I just run the straight up mpiexec command, everything is hunky dory.
To add to the confusion, if I look at the processes on the nodes, I see all sorts of behavior to indicate they are working together:
And nothing out of the ordinary in my logs. Why am I not getting nodes returned in my second test script (code included here:):
# test_mpi.py
import os
import socket
from mpi4py import MPI
MPI = MPI.COMM_WORLD
print("{host}[{pid}]: {rank}/{size}".format(
host=socket.gethostname(),
pid=os.getpid(),
rank=MPI.rank,
size=MPI.size,
))
Not sure why, but I recreated my ipcluster_config.py file and again added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise
gunicorn (version 19.1.1)
nginx version: nginx/1.1.19
My gunicorn conf:
bind = ["unix:///tmp/someproj1.sock", "unix:///tmp/someproj2.sock"]
pythonpath = "/home/deploy/someproj/someproj"
workers = 5
worker_class = "eventlet"
worker_connections = 25
timeout = 3600
graceful_timeout = 3600
We started getting 502s at around 2PM yesterday in our dev env. This was in the Nginx error log:
connect() to unix:///tmp/someproj1.sock failed (2: No such file or directory) while connecting to upstream"
Both gunicorn sockets were missing from /tmp.
At 11:55AM today I ran ps -eo pid,cmd,etime|grep gunicorn to get the uptime:
4156 gunicorn: master [myproj. 22:53:54
4161 gunicorn: worker [myproj. 22:53:54
4162 gunicorn: worker [myproj. 22:53:54
4163 gunicorn: worker [myproj. 22:53:54
4164 gunicorn: worker [myproj. 22:53:54
4165 gunicorn: worker [myproj. 22:53:53
5207 grep --color=auto gunicorn 00:00
So gunicorn and all its workers have been running uninterrupted since ~1:01PM yesterday. The Nginx access log confirm that requests were successfully being served for about an hour after gunicorn was started. Then it seems for some reason both gunicorn sockets disappeared, and gunicorn continued running without writing any error logs.
Any ideas on what could cause that? Or how to fix it?
It turns out this was indeed a bug, where eventlet workers would delete the socket when they themselves would restart.
The fix has already been merged into master branch, but it has unfortunately not been released yet (version 19.3 still has the problem).