Filebeat failed connect to backoff? - docker-compose

I have 2 servers A and B. When I run filebeat by docker-compose on the server A it's working well. But on the server B I have the following error:
pipeline/output.go:154 Failed to connect to backoff(async(tcp://logstash_ip:5044)): dial tcp logstash_ip:5044: connect: no route to host
So I think I missed some config on the server B. So how can I figure out my problem and fix them.
[Edited] Add filebeat.yml and docker-compose
Notice: I ran filebeat on the server A and got failed, so I tested it on the server B and it is still working. So I guess I have some problems with server config
filebeat.yml
logging.level: error
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
filebeat.inputs:
- type: log
enabled: true
paths:
- /usr/share/filebeat/mylog/**/*.log
processors:
- decode_json_fields:
fields: ['message']
target: 'json'
output.logstash:
hosts: ['logstash_ip:5044']
console.pretty: true
processors:
- add_docker_metadata:
host: 'unix:///host_docker/docker.sock'
docker-compose
version: '3.3'
services:
filebeat:
user: root
container_name: filebeat
image: docker.elastic.co/beats/filebeat:7.9.3
volumes:
- /var/run/docker.sock:/host_docker/docker.sock
- /var/lib/docker:/host_docker/var/lib/docker
- ./logs/progress:/usr/share/filebeat/mylog
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml:z
command: ['--strict.perms=false']
ulimits:
memlock:
soft: -1
hard: -1
stdin_open: true
tty: true
network_mode: bridge
deploy:
mode: global
logging:
driver: 'json-file'
options:
max-size: '10m'
max-file: '50'
Thanks in advance

Assumptions:
Mentioned docker-compose file is for filebeat "concentration" server
which is running in docker on server B.
Both server are running in same network space and/or are accessible between themselves
Server B as filebeat server have correct firewall setting to accept connection on port 5044 (check with telnet from server A after starting container)
docker-compose (assuming server B)
version: '3.3'
services:
filebeat:
user: root
container_name: filebeat
ports: ##################
- 5044:5044 # <- see open port
image: docker.elastic.co/beats/filebeat:7.9.3
volumes:
- /var/run/docker.sock:/host_docker/docker.sock
- /var/lib/docker:/host_docker/var/lib/docker
- ./logs/progress:/usr/share/filebeat/mylog
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml:z
command: ['--strict.perms=false']
ulimits:
memlock:
soft: -1
hard: -1
stdin_open: true
tty: true
network_mode: bridge
deploy:
mode: global
logging:
driver: 'json-file'
options:
max-size: '10m'
max-file: '50'
filebeat.yml (assuming both servers)
logging.level: error
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
filebeat.inputs:
- type: log
enabled: true
paths:
- /usr/share/filebeat/mylog/**/*.log
processors:
- decode_json_fields:
fields: ['message']
target: 'json'
output.logstash:
hosts: ['<SERVER-B-IP>:5044'] ## <- see server IP
console.pretty: true
processors:
- add_docker_metadata:
host: 'unix:///host_docker/docker.sock'

Related

What settings should I add to send external data to Graylog?

When I run the graylog web service on localhost, I get it at 127.0.0.1:9000. I installed Docker on a server.
GRAYLOG_HTTP_EXTERNAL_URI
GRAYLOG_HTTP_BIND_ADDRESS
GRAYLOG_HTTP_PUBLISH_URI
I wrote the ip addresses, but graylog works in server local. What do I need to do to send data from Graylog outside?
version: '3'
services:
# MongoDB: https://hub.docker.com/_/mongo/
mongo:
image: mongo:5.0.13
networks:
- graylog
volumes:
- /var/lib/docker/volumes/mongo:/data/db
# Elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docker.html
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
environment:
- http.host=0.0.0.0
- transport.host=localhost
- network.host=0.0.0.0
- "ES_JAVA_OPTS=-Dlog4j2.formatMsgNoLookups=true -Xms1024m -Xmx1024m"
ulimits:
memlock:
soft: -1
hard: -1
deploy:
resources:
limits:
memory: 1.5G
volumes:
- /var/lib/docker/volumes/elk:/usr/share/elasticsearch/data
networks:
- graylog
# Graylog: https://hub.docker.com/r/graylog/graylog/
graylog:
image: graylog/graylog:5.0
environment:
- TZ=Europe/Istanbul
# CHANGE ME (must be at least 16 characters)!
- GRAYLOG_PASSWORD_SECRET=908bd4dee1
- Password=Y71
- GRAYLOG_ROOT_PASSWORD_SHA2=8c6976e5b541
- GRAYLOG_HTTP_EXTERNAL_URI=http://10.90.104.143:9000/ #example ip
- GRAYLOG_HTTP_BIND_ADDRESS=10.90.104.143:9000 #example ip
- GRAYLOG_HTTP_PUBLISH_URI=http://10.90.104.143:9000/ #example ip
#- GRAYLOG_TRAGRAYLOG_TRANSPORT_EMAIL_ENABLED: "true"
#- GRAYLOG_TRANSPORT_EMAIL_HOSTNAME: smtp
#- GRAYLOG_TRANSPORT_EMAIL_PORT: 25
#- GRAYLOG_TRANSPORT_EMAIL_USE_AUTH: "false"
#- GRAYLOG_TRANSPORT_EMAIL_USE_TLS: "false"
#- GRAYLOG_TRANSPORT_EMAIL_USE_SSL: "false"
volumes:
- graylog_data:/usr/share/graylog/data
- graylog_journal:/usr/share/graylog/journal
networks:
- graylog
restart: always
depends_on:
- mongo
- elasticsearch
ports:
# Graylog web interface and REST API
- 9000:9000
# Syslog TCP
- 1514:1514
# Syslog UDP
- 1514:1514/udp
# GELF TCP
- 12201:12201
# GELF UDP
- 12201:12201/udp
networks:
graylog:
driver: bridge
volumes:
graylog_data:
graylog_journal:
What settings should I add to send external data to Graylog?

Error 504 Gateway Timeout when trying to access a homeserver service through an SSH tunnel and traefik

Situation: I run Home Assistant on an Ubuntu server on my home LAN network. Because my home network is behind a double NAT, I have set up an SSH tunnel to tunnel the Home Assistant web interface to a VPS server running Ubuntu as well.
When I run the following on the VPS, I notice that the SSH tunnel works as expected:
$ curl localhost:8045 | grep -iPo '(?<=<title>)(.*)(?=</title>)'
Home Assistant
On the VPS, I run a bunch of web services via docker-compose and traefik. The other services (caddy, portainer) run without problems.
When I try to serve the Home Assistant service through traefik and access https://ha.mydomain.com through a web browser, I get an Error 504 Gateway Timeout.
Below are my configuration files. What am I doing wrong?
docker-compose yaml file:
version: "3.7"
services:
traefik:
container_name: traefik
image: traefik:latest
networks:
- proxy
extra_hosts:
- host.docker.internal:host-gateway
ports:
- "80:80"
- "443:443"
volumes:
- /etc/localtime:/etc/localtime:ro
- ${HOME}/docker/data/traefik/traefik.yml:/traefik.yml:ro
- ${HOME}/docker/data/traefik/credentials.txt:/credentials.txt:ro
- ${HOME}/docker/data/traefik/config:/config
- ${HOME}/docker/data/traefik/letsencrypt/acme.json:/acme.json
- /var/run/docker.sock:/var/run/docker.sock:ro
restart: unless-stopped
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.routers.dashboard.rule=Host(`traefik.mydomain.com`) && (PathPrefix(`/api`) || PathPrefix(`/dashboard`))"
- "traefik.http.routers.dashboard.tls=true"
- "traefik.http.routers.dashboard.tls.certresolver=letsencrypt"
- "traefik.http.routers.dashboard.tls.domains[0].main=traefik.mydomain.com"
- "traefik.http.routers.dashboard.tls.domains[0].sans=traefik.mydomain.com"
- "traefik.http.routers.dashboard.service=api#internal"
- "traefik.http.routers.dashboard.middlewares=auth"
- "traefik.http.middlewares.auth.basicauth.usersfile=/credentials.txt"
caddy:
image: caddy:latest
container_name: caddy
restart: unless-stopped
networks:
- proxy
volumes:
- ${HOME}/docker/data/caddy/Caddyfile:/etc/caddy/Caddyfile
- ${HOME}/docker/data/caddy/site:/srv
- ${HOME}/docker/data/caddy/data:/data
- ${HOME}/docker/data/caddy/config:/config
labels:
- "traefik.http.routers.caddy-secure.rule=Host(`vps.mydomain.com`)"
- "traefik.http.routers.caddy-secure.service=caddy"
- "traefik.http.services.caddy.loadbalancer.server.port=80"
portainer:
image: portainer/portainer-ce
container_name: portainer
networks:
- proxy
command: -H unix:///var/run/docker.sock --http-enabled
volumes:
- /etc/localtime:/etc/localtime:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ${HOME}/docker/data/portainer:/data
labels:
- "traefik.http.routers.portainer-secure.rule=Host(`portainer.mydomain.com`)"
- "traefik.http.routers.portainer-secure.service=portainer"
- "traefik.http.services.portainer.loadbalancer.server.port=9000"
restart: unless-stopped
networks:
# proxy is the network used for traefik reverse proxy
proxy:
external: true
traefik static configuration file:
api:
dashboard: true
insecure: false
debug: true
entryPoints:
web:
address: :80
http:
redirections:
entryPoint:
to: web_secure
web_secure:
address: :443
http:
middlewares:
- secureHeaders#file
tls:
certResolver: letsencrypt
providers:
docker:
network: proxy
endpoint: "unix:///var/run/docker.sock"
file:
filename: /config/dynamic.yml
watch: true
certificatesResolvers:
letsencrypt:
acme:
email: myname#mydomain.com
storage: acme.json
keyType: EC384
httpChallenge:
entryPoint: web
traefik dynamic configuration file:
# dynamic.yml
http:
middlewares:
secureHeaders:
headers:
sslRedirect: true
forceSTSHeader: true
stsIncludeSubdomains: true
stsPreload: true
stsSeconds: 31536000
user-auth:
basicAuth:
users:
- "username:hashedpassword"
routers:
home-assistant-secure:
rule: "Host(`ha.mydomain.com`)"
service: home-assistant
services:
home-assistant:
loadBalancer:
passHostHeader: true
servers:
- url: http://host.docker.internal:8045
tls:
options:
default:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
minVersion: VersionTLS12

How to configure fluent-bit, Fluentd, Loki and Grafana using docker-compose?

I am trying to run Fluent-bit in docker and view logs in Grafana using Loki but I can't see any labels in Grafana. The Loki data source reports that it works and found labels.
I need to figure out how to get docker logs from fluent-bit -> loki -> grafana. Any logs.
Here is my docker-compose.yaml
version: "3.3"
networks:
loki:
external: true
services:
fluent-bit:
image: grafana/fluent-bit-plugin-loki:latest
container_name: fluent-bit
environment:
LOKI_URL: http://loki:3100/loki/api/v1/push
networks:
- loki
volumes:
- ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
logging:
options:
tag: infra.monitoring
Here is my config file.
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
[Output]
Name loki
Match *
Url ${LOKI_URL}
RemoveKeys source
Labels {job="fluent-bit"}
LabelKeys container_name
BatchWait 1
BatchSize 1001024
LineFormat json
LogLevel info
Here are my Grafana and Loki setups
grafana:
image: grafana/grafana
depends_on:
- prometheus
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana:rw
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
- GF_RENDERING_SERVER_URL=http://renderer:8081/render
- GF_RENDERING_CALLBACK_URL=http://grafana:3000/
- GF_LOG_FILTERS=rendering:debug
restart: unless-stopped
networks:
- traefik
- loki
labels:
- "traefik.enable=true"
- "traefik.http.routers.grafana.rule=Host(`grafana-int.mydomain.com`)"
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
- "traefik.docker.network=traefik"
loki:
image: grafana/loki:latest
container_name: loki
expose:
- "3100"
networks:
- loki
renderer:
image: grafana/grafana-image-renderer:2.0.0
container_name: grafana-image-renderer
expose:
- "8081"
environment:
ENABLE_METRICS: "true"
networks:
- loki
I have tried using the following config as described in the docs linked in a comment below but still no labels.
[SERVICE]
Flush 1
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name syslog
Path /tmp/in_syslog
Buffer_Chunk_Size 32000
Buffer_Max_Size 64000
[OUTPUT]
Name loki
Match *
Url ${LOKI_URL}
RemoveKeys source
Labels {job="fluent-bit"}
LabelKeys container_name
BatchWait 1
BatchSize 1001024
LineFormat json
LogLevel info
I tried this config but still no labels.
[INPUT]
#type tail
format json
read_from_head true
path /var/log/syslog
pos_file /tmp/container-logs.pos
[OUTPUT]
Name loki
Match *
Url ${LOKI_URL}
RemoveKeys source
LabelKeys container_name
BatchWait 1
BatchSize 1001024
LineFormat json
LogLevel info
After playing around with this for a while I figured the best way was to collect the logs in fluent-bit and forward them to Fluentd, then output to Loki and read those files in Grafana.
Here is a config which will work locally.
docker-compose.yaml for Fluentd and Loki.
version: "3.8"
networks:
appnet:
external: true
volumes:
host_logs:
services:
fluentd:
image: grafana/fluent-plugin-loki:master
command:
- "fluentd"
- "-v"
- "-p"
- "/fluentd/plugins"
environment:
LOKI_URL: http://loki:3100
LOKI_USERNAME:
LOKI_PASSWORD:
container_name: "fluentd"
restart: always
ports:
- '24224:24224'
networks:
- appnet
volumes:
- host_logs:/var/log
# Needed for journald log ingestion:
- /etc/machine-id:/etc/machine-id
- /dev/log:/dev/log
- /var/run/systemd/journal/:/var/run/systemd/journal/
- type: bind
source: ./config/fluent.conf
target: /fluentd/etc/fluent.conf
- type: bind
source: /var/lib/docker/containers
target: /fluentd/log/containers
logging:
options:
tag: docker.monitoring
loki:
image: grafana/loki:master
container_name: "loki"
restart: always
networks:
- appnet
ports:
- 3100
volumes:
- type: bind
source: ./config/loki.conf
target: /loki/etc/loki.conf
depends_on:
- fluentd
fluent.conf
<source>
#type forward
bind 0.0.0.0
port 24224
</source>
<match **>
#type loki
url "http://loki:3100"
flush_interval 1s
flush_at_shutdown true
buffer_chunk_limit 1m
extra_labels {"job":"localhost_logs", "host":"localhost", "agent":"fluentd"}
<label>
fluentd_worker
</label>
</match>
loki.conf
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-10-16
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /tmp/loki/index
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
docker-compose.yaml for fluent-bit
version: "3.8"
networks:
appnet:
external: true
services:
fluent-bit:
image: fluent/fluent-bit:latest
container_name: "fluent-bit"
restart: always
ports:
- '2020:2020'
networks:
- appnet
volumes:
- type: bind
source: ./config/fluent-bit.conf
target: /fluent-bit/etc/fluent-bit.conf
read_only: true
- type: bind
source: ./config/parsers.conf
target: /fluent-bit/etc/parsers.conf
read_only: true
- type: bind
source: /var/log/
target: /var/log/
- type: bind
source: /var/lib/docker/containers
target: /fluent-bit/log/containers
fluent-bit.conf
[SERVICE]
Flush 2
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /fluent-bit/log/containers/*/*-json.log
Tag docker.logs
Parser docker
[OUTPUT]
Name forward
Match *
Host fluentd
parsers.conf
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On

Docker Containers up and running but cannot load the localhost

I could able to run Wiki_js and MongoDB containers using docker-compose. Both are running without any errors.
This is my docker-compose file.
version: '3'
services:
wikidb:
image: mongo:3
expose:
- '27017'
command: '--smallfiles --bind_ip ::,0.0.0.0'
environment:
- 'MONGO_LOG_DIR=/dev/null'
volumes:
- $HOME/mongo/db:/data/db
wikijs:
image: 'requarks/wiki:1.0'
links:
- wikidb
depends_on:
- wikidb
ports:
- '8000:3000'
environment:
WIKI_ADMIN_EMAIL: myemail#gmail.com
volumes:
- $HOME/wiki/config.yml:/var/wiki/config.yml
This is config.yml file. I didn't set up git for this project.
title: Wiki
host: http://localhost
port: 8000
paths:
repo: ./repo
data: ./data
uploads:
maxImageFileSize: 3
maxOtherFileSize: 100
db: mongodb://wikidb:27017/wiki
git:
url: https://github.com/Organization/Repo
branch: master
auth:
type: ssh
username: marty
password: MartyMcFly88
privateKey: /etc/wiki/keys/git.pem
sslVerify: true
serverEmail: marty#example.com
showUserEmail: true
But I cannot load the localhost under the port 8000. Is there a specific reason for this?

Postgres not starting on swarm server reboot

I'm trying to run an app using docker swarm. The app is designed to be completely local running on a single computer using docker swarm.
If I SSH into the server and run a docker stack deploy everything works, as seen here running docker service ls:
When this deployment works, the services generally go live in this order:
Registry (a private registry)
Main (an Nginx service) and Postgres
All other services in random order (all Node apps)
The problem I am having is on reboot. When I reboot the server, I pretty consistently have the issue of the services failing with this result:
I am getting some errors that could be helpful.
In Postgres: docker service logs APP_NAME_postgres -f:
In Docker logs: sudo journalctl -fu docker.service
Update: June 5th, 2019
Also, By request from a GitHub issue docker version output:
Client:
Version: 18.09.5
API version: 1.39
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:43:57 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false
And docker info output:
Containers: 28
Running: 9
Paused: 0
Stopped: 19
Images: 14
Server Version: 18.09.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: pbouae9n1qnezcq2y09m7yn43
Is Manager: true
ClusterID: nq9095ldyeq5ydbsqvwpgdw1z
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 1
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.0.47
Manager Addresses:
192.168.0.47:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-50-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.68GiB
Name: oeemaster
ID: 76LH:BH65:CFLT:FJOZ:NCZT:VJBM:2T57:UMAL:3PVC:OOXO:EBSZ:OIVH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
And finally, My docker swarm stack/compose file:
secrets:
jwt-secret:
external: true
pg-db:
external: true
pg-host:
external: true
pg-pass:
external: true
pg-user:
external: true
ssl_dhparam:
external: true
services:
accounts:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
JWT_SECRET_FILE: /run/secrets/jwt-secret
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-accounts:v0.8.0
secrets:
- source: jwt-secret
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
graphs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-graphs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
health:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-health:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
live-data:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
image: 127.0.0.1:5000/local-oee-master-live-data:v0.8.0
ports:
- published: 32000
target: 80
main:
depends_on:
- accounts
- graphs
- health
- live-data
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
MAIN_CONFIG_FILE: nginx.local.conf
image: 127.0.0.1:5000/local-oee-master-nginx:v0.8.0
ports:
- published: 80
target: 80
- published: 443
target: 443
modbus-logger:
depends_on:
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
CONTROLLER_ADDRESS: 192.168.2.100
SERVER_ADDRESS: http://point-logs
image: 127.0.0.1:5000/local-oee-master-modbus-logger:v0.8.0
point-logs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
ENV_TYPE: local
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-point-logs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
postgres:
depends_on:
- registry
deploy:
restart_policy:
condition: on-failure
window: 120s
environment:
POSTGRES_PASSWORD: password
image: 127.0.0.1:5000/local-oee-master-postgres:v0.8.0
ports:
- published: 5432
target: 5432
volumes:
- /media/db_main/postgres_oee_master:/var/lib/postgresql/data:rw
registry:
deploy:
restart_policy:
condition: on-failure
image: registry:2
ports:
- mode: host
published: 5000
target: 5000
volumes:
- /mnt/registry:/var/lib/registry:rw
version: '3.2'
Things I've tried
Action: Added restart_policy > window: 120s
Result: No Effect
Action: Postgres restart_policy > condition: none & crontab #reboot redeploy
Result: No Effect
Action: Set all containers stop_grace_period: 2m
Result: No Effect
Current Workaround
Currently, I have hacked together a solution that is working just so I can move on to the next thing. I just wrote a shell script called recreate.sh that will kill the failed first boot version of the server, wait for it to break down, and the "manually" run docker stack deploy again. I am then setting the script to run at boot with crontab #reboot. This is working for shutdowns and reboots, but I don't accept this as the proper answer, so I won't add it as one.
It looks to me that you need to check is who/what kills postgres service. From logs you posted it seems that postrgres receives smart shutdown signal. Then, postress stops gently. Your stack file has restart policy set to "on-failure", and since postres process stops gently (exit code 0), docker does not consider this as failue and as instructed, it does not restart.
In conclusion, I'd recommend changing restart policy to "any" from "on-failure".
Also, have in mind that "depends_on" settings that you use are ignored in swarm and you need to have your services/images own way of ensuring proper startup order or to be able to work when dependent services are not up yet.
There's also thing you could try - healthchecks. Perhaps your postgres base image has a healthcheck defined and it's terminating container by sending a kill signal to it. And as wrote earlier, postgres shuts down gently and there's no error exit code and restart policy does not trigger. Try disabling healthcheck in yaml or go to dockerfiles to see for the healthcheck directive and figure out why it triggers.