I am attempting to run a bash script using ACI. Occasionally the container stalls or hangs. CPU activity drops to 0, memory flattens(see below) and network activity drops to ~50bytes but the script never completes. The container never terminates that I can tell. A bash window can be opened on the container. The logs suggest the hang occurs during wget.
Possible clue:
How can I verify my container is using SMB3.0 to connect to my share or is that handled at the host server level and I have to assume ACI uses SMB 3.0 ?
This script:
Dequeues an item from ServiceBus.
Runs an exe to obtain a URL
Performs a wget using the URL; writes the output to a StorageAccount Fileshare.
Exits, terminating the container.
wget is invoked with a 4min timeout. Data is written directly to the share so the run can be retried if it fails and the wget can resume. The timeout command should force wget to end if it hangs. The logs suggest the container hangs at wget.
timeout 4m wget -c -O "/aci/mnt/$ID/$ID.sra" $URL
I have 100 items in queue. I have 5 ACI's running 5 containers each (25 total containers.)
A Logic App checks the Queue and if items are present, runs the containers.
Approximately 95% of the download runs work as expected.
Many of the runs simply hang as far as I can tell at 104GB total downloads.
I am using a Premium Storage Account with a 300GB Fileshare using SMB Multichannel=Enabled
It seems on some of the large files (>3GB) the Container Instance will hang.
A successful run looks somethin like this:
PeekLock Message (5min)...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 353 0 353 0 0 1481 0 --:--:-- --:--:-- --:--:-- 1476
Wed Dec 28 14:31:41 UTC 2022: prefetch-01-0: ---------- BEGIN RUN ----------
./pipeline.sh: line 80: can't create /aci/mnt/SRR10575111/SRR10575111.txt: nonexistent directory
Wed Dec 28 14:31:41 UTC 2022: prefetch-01-0: vdb-dump [SRR10575111]...
Wed Dec 28 14:31:44 UTC 2022: prefetch-01-0: wget [SRR10575111]...
Wed Dec 28 14:31:44 UTC 2022: prefetch-01-0: URL=https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR10575111/SRR10575111
Connecting to sra-pub-run-odp.s3.amazonaws.com (54.231.131.113:443)
saving to '/aci/mnt/SRR10575111/SRR10575111.sra'
SRR10575111.sra 0% | | 12.8M 0:03:39 ETA
SRR10575111.sra 1% | | 56.1M 0:01:39 ETA
...
SRR10575111.sra 99% |******************************* | 2830M 0:00:00 ETA
SRR10575111.sra 100% |********************************| 2833M 0:00:00 ETA
'/aci/mnt/SRR10575111/SRR10575111.sra' saved
Wed Dec 28 14:35:42 UTC 2022: prefetch-01-0: wget exit...
Wed Dec 28 14:35:43 UTC 2022: prefetch-01-0: wget Success!
Wed Dec 28 14:35:43 UTC 2022: prefetch-01-0: Delete Message...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Wed Dec 28 14:35:43 UTC 2022: prefetch-01-0: POST to [orchestrator] queue...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 325 0 0 100 325 0 1105 --:--:-- --:--:-- --:--:-- 1109
Wed Dec 28 14:35:44 UTC 2022: prefetch-01-0: exit RESULTCODE=0
A Hung run looks like this:
PeekLock Message (5min)...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 352 0 352 0 0 1252 0 --:--:-- --:--:-- --:--:-- 1252
Wed Dec 28 14:31:41 UTC 2022: prefetch-01-1: ---------- BEGIN RUN ----------
./pipeline.sh: line 80: can't create /aci/mnt/SRR9164212/SRR9164212.txt: nonexistent directory
Wed Dec 28 14:31:41 UTC 2022: prefetch-01-1: vdb-dump [SRR9164212]...
Wed Dec 28 14:31:44 UTC 2022: prefetch-01-1: wget [SRR9164212]...
Wed Dec 28 14:31:44 UTC 2022: prefetch-01-1: URL=https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR9164212/SRR9164212
Connecting to sra-pub-run-odp.s3.amazonaws.com (52.216.146.75:443)
saving to '/aci/mnt/SRR9164212/SRR9164212.sra'
SRR9164212.sra 0% | | 2278k 0:55:44 ETA
SRR9164212.sra 0% | | 53.7M 0:04:30 ETA
SRR9164212.sra 1% | | 83.9M 0:04:18 ETA
...
SRR9164212.sra 44% |************** | 3262M 0:04:55 ETA
SRR9164212.sra 44% |************** | 3292M 0:04:52 ETA
SRR9164212.sra 45% |************** | 3326M 0:04:47 ETA
The container is left in a Running State.
CPU goes to 0; Network activity goes to ~ 50B received/transmitted.
I'm running a distributed Airflow setup using docker-compose. The main part of the services are run on one server and the celery workers are run on multiple servers. I have few hundred tasks running every five minutes and I started to run out of db connections which was indicated byt his error message in task logs.
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "SERVER" (IP), port XXXXX failed: FATAL: sorry, too many clients already
I'm using Postgres as metastore and and the max_connections is set to the default value of 100. I didn't want to raise the max_connections value since I thought, that there should be a better solution for this. At some point I'll run thousands of tasks every 5 min and the number of connections is guaranteed to run out again. So I added pgbouncer to my configuration.
Here's how I configured pgbouncer
pgbouncer:
image: "bitnami/pgbouncer:1.16.0"
restart: always
environment:
POSTGRESQL_HOST: "postgres"
POSTGRESQL_USERNAME: ${POSTGRES_USER}
POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRESQL_PORT: ${PSQL_PORT}
PGBOUNCER_DATABASE: ${POSTGRES_DB}
PGBOUNCER_AUTH_TYPE: "trust"
PGBOUNCER_IGNORE_STARTUP_PARAMETERS: "extra_float_digits"
ports:
- '1234:1234'
depends_on:
- postgres
pgbouncer logs look like this:
pgbouncer 13:29:13.87
pgbouncer 13:29:13.87 Welcome to the Bitnami pgbouncer container
pgbouncer 13:29:13.87 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-pgbouncer
pgbouncer 13:29:13.87 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-pgbouncer/issues
pgbouncer 13:29:13.88
pgbouncer 13:29:13.89 INFO ==> ** Starting PgBouncer setup **
pgbouncer 13:29:13.91 INFO ==> Validating settings in PGBOUNCER_* env vars...
pgbouncer 13:29:13.91 WARN ==> You set the environment variable PGBOUNCER_AUTH_TYPE=trust. For safety reasons, do not use this flag in a production environment.
pgbouncer 13:29:13.91 INFO ==> Initializing PgBouncer...
pgbouncer 13:29:13.92 INFO ==> Waiting for PostgreSQL backend to be accessible
pgbouncer 13:29:13.92 INFO ==> Backend postgres:9876 accessible
pgbouncer 13:29:13.93 INFO ==> Configuring credentials
pgbouncer 13:29:13.93 INFO ==> Creating configuration file
pgbouncer 13:29:14.06 INFO ==> Loading custom scripts...
pgbouncer 13:29:14.06 INFO ==> ** PgBouncer setup finished! **
pgbouncer 13:29:14.08 INFO ==> ** Starting PgBouncer **
2022-10-25 13:29:14.089 UTC [1] LOG kernel file descriptor limit: 1048576 (hard: 1048576); max_client_conn: 100, max expected fd use: 152
2022-10-25 13:29:14.089 UTC [1] LOG listening on 0.0.0.0:1234
2022-10-25 13:29:14.089 UTC [1] LOG listening on unix:/tmp/.s.PGSQL.1234
2022-10-25 13:29:14.089 UTC [1] LOG process up: PgBouncer 1.16.0, libevent 2.1.8-stable (epoll), adns: c-ares 1.14.0, tls: OpenSSL 1.1.1d 10 Sep 2019
2022-10-25 13:30:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:31:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:32:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:33:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:34:14.089 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:35:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:36:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:37:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:38:14.090 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-25 13:39:14.089 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
The service seems to run ok, but I think it doesn't do anything. There was very little information about this in the Airflow documentation and I'm unsure what to change.
Should I change the pgbouncer setup in my docker-compose file?
Should I change AIRFLOW__DATABASE__SQL_ALCHEMY_CONN variable?
Update 1:
I edited the docker-compose.yml for the worker nodes and changed the db port to be the pgbouncer port. After this I got some traffic on the bouncer logs. Airflow tasks are queued and not procesessed with this configuration so there's still something wrong. I didn't edit the docker-compose yaml that launches the webserver, scheduler etc., didn't know how.
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://<XXX>#${AIRFLOW_WEBSERVER_URL}:${PGBOUNCER_PORT}/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://<XXX>#${AIRFLOW_WEBSERVER_URL}:${PGBOUNCER_PORT}/airflow
pgbouncer log after the change:
2022-10-26 11:46:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:47:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:48:22.517 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:49:22.519 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:50:22.518 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:51:22.516 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-10-26 11:51:52.356 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:52.359 UTC [1] LOG S-0x5602cf8b1f20: <XXX>#<IP:PORT> new connection to server (from <IP:PORT>)
2022-10-26 11:51:52.410 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> closing because: client close request (age=0s)
2022-10-26 11:51:52.834 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:52.845 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> closing because: client close request (age=0s)
2022-10-26 11:51:56.752 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:57.393 UTC [1] LOG C-0x5602cf8ab3b0: <XXX>#<IP:PORT> login attempt: db=airflow user=airflow tls=no
2022-10-26 11:51:57.394 UTC [1] LOG S-0x5602cf8b2150: <XXX>#<IP:PORT> new connection to server (from <IP:PORT>)
2022-10-26 11:51:59.906 UTC [1] LOG C-0x5602cf8ab180: <XXX>#<IP:PORT> closing because: client close request (age=3s)
2022-10-26 11:52:00.642 UTC [1] LOG C-0x5602cf8ab3b0: <XXX>#<IP:PORT> closing because: client close request (age=3s)
I have a golang program running on centos that usually has around 5k tcp clients connected. Every once in a while this number goes to around 15k for about an hour and still everything is fine.
The program has a slow shutdown mode where it stops accepting new clients and slowly kills all currently connected clients over the course of 20 mins. During these slow shutdown periods if the machine has 15k clients, sometimes I get:
Wed Oct 31 21:28:23 2018] net_ratelimit: 482 callbacks suppressed
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
I have tried adding:
echo "net.ipv4.tcp_max_syn_backlog=5000" >> /etc/sysctl.conf
echo "net.ipv4.tcp_fin_timeout=10" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_recycle=1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_reuse=1" >> /etc/sysctl.conf
sysctl -f /etc/sysctl.conf
And these values are set, I see them with their correct new values. A typical sockstat is:
cat /proc/net/sockstat
sockets: used 31682
TCP: inuse 17286 orphan 5 tw 3874 alloc 31453 mem 15731
UDP: inuse 8 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
And ideas how to stop the too many orphaned socket error and crash? Should I increase the 20 min slow shutdown period to 40 mins? Increase tcp_mem? Thanks!
backend geoserver
balance roundrobin
log global
#option httpclose
#option httplog
#option forceclose
mode http
#option tcplog
#monitor-uri /tiny
#*************** health check ********************************
option tcpka
option external-check
option log-health-checks
external-check path "/usr/bin:/bin:/tmp"
#external-check path "/usr/bin:/bin"
external-check command /bin/true
#external-check command /var/lib/haproxy/ping.sh
timeout queue 60s
timeout server 60s
timeout connect 60s
#************** cookiee *******************************
cookie NV_HAPROXY insert indirect nocache
server web1_Controller_ASHISH 10.10.60.15:9002 check cookie
web1_Controller_ASHISH
server web2_controller_jagjeet 10.10.60.15:7488 check cookie
web2_Controller_jagjeet
Previously following errors was encountered :
backend geoserver has no server available!
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
After that I have changed my chroot directory(/var/lib/haproxy) to /etc/haproxy where all my configuration file was stored but still "No server is available to handle this request" error is coming. please let me know what is the issue.
I have a cluster of ZooKeeper with just 2 nodes, each zoo.conf has the following
# Servers
server.1=10.138.0.8:2888:3888
server.2=10.138.0.9:2888:3888
the same two lines are present in both configs
[root#zk1-prod supervisor.d]# echo mntr | nc 10.138.0.8 2181
zk_version 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 5
zk_packets_sent 4
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 28
zk_max_file_descriptor_count 4096
[root#zk1-prod supervisor.d]# echo mntr | nc 10.138.0.9 2181
zk_version 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 3
zk_packets_sent 2
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 29
zk_max_file_descriptor_count 4096
zk_followers 1
zk_synced_followers 1
zk_pending_syncs 0
so why zk_znode_count == 4 ?
Znodes are not Zookeeper servers.
From Hadoop Definitive Guide:
Zookeeper doesn’t have files and directories, but a unified concept of
a node, called a znode, which acts both as a container of data (like a
file) and a container of other znodes (like a directory).
zk_znode_count refers to number of znodes available in that Zookeeper server. In your ZK ensemble, each server has four znodes.