PostgreSQL 9.5 Replication Lag running on EC2 - postgresql

I have a series of PostgreSQL 9.5 servers running on r4.16xlarge instances and Amazon Linux 1 that started experiencing replication lag of several seconds starting this week. The configurations were changed but the old configs weren't saved so I'm not sure what the previous settings were. Here's the custom values:
max_connections = 1500
shared_buffers = 128GB
effective_cache_size = 132GB
maintenance_work_mem = 128MB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
#effective_io_concurrency = 10
work_mem = 128MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 64
synchronous_commit = off
The drive layout is as follows - 4 disks for the xlog drive and 10 for the regular partition, all gp2 disk type.
Personalities : [raid0]
md126 : active raid0 xvdo[3] xvdn[2] xvdm[1] xvdl[0]
419428352 blocks super 1.2 512k chunks
md127 : active raid0 xvdk[9] xvdj[8] xvdi[7] xvdh[6] xvdg[5] xvdf[4] xvde[3] xvdd[2] xvdc[1] xvdb[0]
2097146880 blocks super 1.2 512k chunks
The master server is a smaller c4.8xlarge instance with this setup:
max_connections = 1500
shared_buffers = 15GB
effective_cache_size = 45GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 16
work_mem = 26MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 36
With this drive layout:
Personalities : [raid0]
md126 : active raid0 xvdd[2] xvdc[1] xvdb[0] xvde[3]
419428352 blocks super 1.2 512k chunks
md127 : active raid0 xvdr[12] xvdg[1] xvdo[9] xvdl[6] xvdh[2] xvdf[0] xvdp[10] xvdu[15] xvdm[7] xvdj[4] xvdn[8] xvdk[5] xvdi[3] xvds[13] xvdt[14] xvdq[11]
3355435008 blocks super 1.2 512k chunks
I guess I'm looking for optimal settings for these two instance types so I can eliminate the replication lag. None of the servers are what I would call heavily loaded.

With further digging I found that the following setting fixed the replication lag:
hot_standby_feedback = on
This may cause some WAL bloating on the master but now the backlog is gone.

Related

Postgresql-12 too many wal files

We have a Postgresql-12 DB in our production. Today we realized our disk usage was increased against last month in master server (last month: 4.4TB out of 14TB, now: 9.8TB out of 14TB). When i run ncdu command our actual postgresql data is just 3.4TB other 6.4TB is used by just wal files. We have a standby server as well. Wal archiving is enabled on our deployment and we continuously storing wal files to our another backup server near the basebackup of our DB. So all this wal files are necessary even the after we backup them? If not what should we do for free our disk space from unnecessary wal files? Here is the our postgresql.conf:
Our server's specs:
Centos 7
48 Core Intel Xenon Gold CPU
256 GB Memory
4*7,6TB SSD RAID1+0 (Total ~14TB)
listen_addresses = '*'
max_connections = 500
superuser_reserved_connections = 10
password_encryption = md5
shared_buffers = 64GB
max_prepared_transactions = 100
work_mem = 83886kB
maintenance_work_mem = 2GB
max_stack_depth = 2MB
dynamic_shared_memory_type = posix
bgwriter_delay = 100ms
bgwriter_lru_maxpages = 10000
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 200
max_worker_processes = 48
max_parallel_maintenance_workers = 4
max_parallel_workers_per_gather = 4
max_parallel_workers = 48
wal_level = replica
wal_sync_method = fdatasync
wal_compression = on
wal_log_hints = on
wal_buffers = 32MB
commit_delay = 0
max_wal_size = 16GB
min_wal_size = 4GB
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'test ! -f /pgdata/wal_backup/%f && cp %p /pgdata/wal_backup/%f && /var/lib/pgsql/backup_wal.sh'
archive_timeout = 3600
random_page_cost = 1.1
effective_cache_size = 192GB
default_statistics_target = 100
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_min_duration_statement = 4000
log_checkpoints = on
log_line_prefix = '<user=%u db=%d host=%h pid=%p app=%a time=%m > '
log_lock_waits = on
log_temp_files = 0
log_timezone = 'Europe/Istanbul'
cluster_name = 'pg12/primary'
track_io_timing = on
track_functions = all
log_autovacuum_min_duration = 0
statement_timeout = 3600000
datestyle = 'iso, mdy'
timezone = 'Europe/Istanbul'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'pg_stat_statements'
max_locks_per_transaction = 128
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = on
pg_stat_statements.save = on
We have a standby server as well. Wal archiving is enabled on our deployment and we continuously storing wal files to our another backup server near the basebackup of our DB.
One option is that there is an unused replication slot (it has to be in primary_slot_name on the standby). Consult pg_replication_slots.
The other is that your archiver is failing. Consult pg_stat_archiver.

Async slave node missing WAL files on Postgres11

I have 3 VM nodes running Master-Slave Postgres-11. They are being managed by Pacemaker.
Node Attributes:
* Node node04:
+ master-pgsqlins : 1000
+ pgsqlins-data-status : LATEST
+ pgsqlins-master-baseline : 000000C0D8000098
+ pgsqlins-status : PRI
* Node node05:
+ master-pgsqlins : -INFINITY
+ pgsqlins-data-status : STREAMING|ASYNC
+ pgsqlins-status : HS:async
* Node node06:
+ master-pgsqlins : 100
+ pgsqlins-data-status : STREAMING|SYNC
+ pgsqlins-status : HS:sync
Async node throws an error at times that the required WAL file is missing. It then stops the replication and starts it again.
On the master node, WAL archiving is enabled and they are synced to another folder named wal_archive. There is another process that keeps removing the files from that wal_archive folder. So I understand why the slave node would throw that error, but what I want to understand is that how is it able to start back again without that missing file?
The postgresql.conf
# Connection settings
# -------------------
listen_addresses = '*'
port = 5432
max_connections = 600
tcp_keepalives_idle = 0
tcp_keepalives_interval = 0
tcp_keepalives_count = 0
# Memory-related settings
# -----------------------
shared_buffers = 2GB # Physical memory 1/4
##DEBUG: mmap(1652555776) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
#huge_pages = try # on, off, or try
#temp_buffers = 16MB # depends on DB checklist
work_mem = 8MB # Need tuning
effective_cache_size = 4GB # Physical memory 1/2
maintenance_work_mem = 512MB
wal_buffers = 64MB
# WAL/Replication/HA settings
# --------------------
wal_level = logical
synchronous_commit = remote_write
archive_mode = on
archive_command = 'rsync -a %p /xxxxx/wal_archive/%f'
#archive_command = ':'
max_wal_senders=5
hot_standby = on
restart_after_crash = off
wal_sender_timeout = 60000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_feedback = on
random_page_cost = 1.5
max_wal_size = 5GB
min_wal_size = 200MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
# Logging settings
# ----------------
log_destination = 'csvlog,syslog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql_%Y%m%d.log'
log_truncate_on_rotation = off
log_rotation_age = 1h
log_rotation_size = 0
log_timezone = 'Japan'
log_line_prefix = '%t [%p]: [%l-1] %h:%u#%d:[XXXPG]:CODE:%e '
log_statement = ddl
log_min_messages = info # DEBUG5
log_min_error_statement = info # DEBUG5
log_error_verbosity = default
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_disconnections = on
log_duration = off
log_min_duration_statement = 1000
log_autovacuum_min_duration = 3000ms
track_functions = pl
track_activity_query_size = 8192
# Locale/display settings
# -----------------------
lc_messages = 'C'
lc_monetary = 'en_US.UTF-8' # ja_JP.eucJP
lc_numeric = 'en_US.UTF-8' # ja_JP.eucJP
lc_time = 'en_US.UTF-8' # ja_JP.eucJP
timezone = 'Asia/Tokyo'
bytea_output = 'escape'
# Auto vacuum settings
# -----------------------
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_cost_limit = 200
#shared_preload_libraries = 'pg_stat_statements,auto_explain' <------------------check this
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
On the async slave node, this is the recovery.conf
primary_conninfo = 'host=1xx.xx.xx.xx port=5432 user=replica application_name=node05 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /xxxxx/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'
The logs about the error from master
2021-07-05 23:35:02.321 JST,,,28926,,60e16b42.70fe,122,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint complete: wrote 2897 buffers (1.1%); 0 WAL file(s) added, 0 removed
, 2 recycled; write=106.770 s, sync=0.050 s, total=106.827 s; sync files=251, longest=0.017 s, average=0.001 s; distance=20262 kB, estimate=46658 kB",,,,,,,,,""
2021-07-05 23:35:02.322 JST,,,28926,,60e16b42.70fe,123,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint starting: immediate force wait",,,,,,,,,""
2021-07-05 23:35:02.347 JST,,,28926,,60e16b42.70fe,124,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint complete: wrote 173 buffers (0.1%); 0 WAL file(s) added, 0 removed,
1 recycled; write=0.007 s, sync=0.012 s, total=0.026 s; sync files=43, longest=0.005 s, average=0.001 s; distance=14410 kB, estimate=43434 kB",,,,,,,,,""
2021-07-05 23:35:02.348 JST,"replica","",3451,"1xx.xx.xx.xxx:45120",60e16bfc.d7b,3,"streaming C1/97C3E000",2021-07-04 17:06:20 JST,116/0,0,ERROR,XX000,"requested WAL segment 00000001000000C100000097 has already been removed",,,,,,,,,"node05"
2021-07-05 23:35:02.361 JST,"replica","",3451,"1xx.xx.xx.xxx:45120",60e16bfc.d7b,4,"idle",2021-07-04 17:06:20 JST,,0,LOG,00000,"disconnection: session time: 30:28:41.550 user=replica database= host=172.17.48.141 port=45120",,,,,,,,,"node05"
2021-07-05 23:35:02.399 JST,,,24896,"1xx.xx.xx.xxx:49278",60e31896.6140,1,"",2021-07-05 23:35:02 JST,,0,LOG,00000,"connection received: host=1xx.xx.xx.xxx port=49278",,,,,,,,,""
2021-07-05 23:35:02.401 JST,"postgres","postgres",24851,"[local]",60e31896.6113,3,"idle",2021-07-05 23:35:02 JST,,0,LOG,00000,"disconnection: session time: 0:00:00.251 user=postgres database=postgres host=[local]",,,,,,,,,"postgres#node04"
2021-07-05 23:35:02.403 JST,"replica","",24896,"1xx.xx.xx.xxx:49278",60e31896.6140,2,"authentication",2021-07-05 23:35:02 JST,116/72,0,LOG,00000,"replication connection authorized: user=replica",,,,,,,,,""
The logs about the error from async slave node
2021-07-05 23:35:02.359 JST,,,2541,,60e16bfc.9ed,2,,2021-07-04 17:06:20 JST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000C100000097 has already been removed",,,,,,,,,""
2021-07-05 23:35:02.408 JST,,,4703,,60e31896.125f,1,,2021-07-05 23:35:02 JST,,0,LOG,00000,"started streaming WAL from primary at C1/98000000 on timeline 1",,,,,,,,,""
2021-07-05 23:35:03.318 JST,,,4835,"[local]",60e31897.12e3,1,"",2021-07-05 23:35:03 JST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
Sync slave node doesn't throw this error, only async slave node, and that too recovers without any manual intervention. Is there a way to avoid this error other than by not removing the archived wal files from the wal_archive folder every 2 mins?

pg_wal is taking a lot of disk space

I have installed postgres in a containerized environment using docker-compose, for that I have used this docker image crunchydata/crunchy-postgres-gis:centos7-11.5-2.4.2, all was running right till I realized that PG_DIR/pg_wal is taking a lot of disk space, I don't want to use pg_archivecleanup every time nor in a cron job, but I want to configure postgres to do that automatically. please, what is the correct configuration for that?
This is my postgresql.conf file.
listen_addresses = '*' # what IP address(es) to listen on;
port = 5432 # (change requires restart)
unix_socket_directories = '/tmp' # comma-separated list of directories
unix_socket_permissions = 0777 # begin with 0 to use octal notation
temp_buffers = 8MB # min 800kB
max_connections = 400
shared_buffers = 1536MB
effective_cache_size = 4608MB
maintenance_work_mem = 384MB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 4MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
unix_socket_directories = '/tmp' # comma-separated list of directories
unix_socket_permissions = 0777 # begin with 0 to use octal notation
shared_preload_libraries = 'pg_stat_statements.so' # (change requires restart)
#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
wal_level = hot_standby # minimal, archive, or hot_standby
max_wal_senders = 6 # max number of walsender processes
wal_keep_segments = 400 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 30s # max delay before canceling queries
max_standby_streaming_delay = 30s # max delay before canceling queries
wal_receiver_status_interval = 10s # send replies at least this often
archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
archive_command = 'pgbackrest archive-push %p' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
archive_timeout = 60 # force a logfile segment switch after this
# number of seconds; 0 disables
#------------------------------------------------------------------------------
# ERROR REPORTING AND LOGGING
#------------------------------------------------------------------------------
log_destination = 'stderr' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 0 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_connections = on
log_disconnections = on
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h'
log_lock_waits = on # log lock waits >= deadlock_timeout
log_timezone = 'US/Eastern'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'US/Eastern'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
Thanks
You haven't shown us any evidence that pgbackrest has anything to do with this. If it is failing, you should see messages about that in the server's log file. If it is succeeding, then it should be taking up space in the archive, wherever that is, not in pg_wal.
But wal_keep_segments = 400 will lead to over 6.25GB of pg_wal being retained. I don't know if that constitutes "a lot" or not.
pg_archivecleanup isn't for cleaning up pg_wal, it is for cleaning up the archive.

Postgres performance degrading as cache gets consumed

I migrated a Postgres database 9.1 doing 300K transactions/hour from a server with Red Hat OS, Intel(R) Xeon(R) CPU E5-2670 0 # 2.60GHz / 16 Core, 64 GB RAM, 240 GB x 4 Intel SSD TO Intel(R) Xeon(R) CPU E5-2680 v4 # 2.40GHz / 56 Core, 128 GB RAM, 2TB nvme PCI SSD, RANDOM READ 450000 iops, RANDOM WRITE 56000 iops. CentOS 6.9.
Over period of time the server slows down and the amount of data processed get reduced. If I clear the cache manually (sync; echo 3 > /proc/sys/vm/drop_caches) then the data processing resume to maximum level. Again after some time with load the performance deteriorate in terms of amount of data processed. The cache memory shows it has been fully consumed.
pg configuration :
datestyle = 'redwood,show_time'
db_dialect = 'redwood'
default_text_search_config = 'pg_catalog.english'
edb_dynatune = 90
edb_redwood_date = on
edb_redwood_strings = on
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
shared_preload_libraries = '$libdir/dbms_pipe,$libdir/edb_gen,$libdir/plugins/plugin_debugger,$libdir/plugins/plugin_spl_debugger'
timed_statistics = off
archive_command = 'rsync -a %p slave:/opt/PostgresPlus/9.1AS/wals/%f'
archive_mode = on
listen_addresses = '*'
log_destination = 'syslog'
syslog_facility = 'LOCAL0'
logging_collector = on
log_line_prefix = '%t'
max_wal_senders = 4
port = 6432
wal_keep_segments = 128
wal_level = hot_standby
temp_buffers='50MB'
constraint_exclusion = on
autovacuum = on
enable_bitmapscan = off
max_connections = 200
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 167772kB
maintenance_work_mem = 2GB
checkpoint_segments = 64
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100

orientdb 2.1.11 java process consuming too much memory

We're running OrientDB 2.1.11 (Community Edition) along with JDK 1.8.0.74.
We're noticing memory consumption by 'orientdb' java slowly creeping up and in a few days, the database becomes un-responsive (we have to stop/start Orientdb in order to release memory).
We also noticed this kind of behavior in a few hours when we index the database.
The total size of the database is only 60 GB and not more than 200 million records!
As you can see below, it already consumes VIRT(11.44 GB) RES(8.62 GB).
We're running CentOS 7.1.x.
Even change heap from 512 to 256M and modified diskcache.bufferSize to 8GB
MAXHEAP=-Xmx256m
ORIENTDB MAXIMUM DISKCACHE IN MB, EXAMPLE, ENTER -Dstorage.diskCache.bufferSize=8192 FOR 8GB
MAXDISKCACHE="-Dstorage.diskCache.bufferSize=8192"
top output:
Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16269052 total, 229492 free, 9510740 used, 6528820 buff/cache
KiB Swap: 8257532 total, 8155244 free, 102288 used. 6463744 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2367 nmx 20 0 11.774g 8.620g 14648 S 0.3 55.6 81:26.82 java
ps aux output:
nmx 2367 4.3 55.5 12345680 9038260 ? Sl May02 81:28 /bin/java
-server -Xmx256m -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError
-Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9
-Dprofiler.enabled=true -Dstorage.diskCache.bufferSize=8192
How do I control memory usage?
Is there a CB memory leak?
Could you set following setting for JVM -XX:+UseLargePages -XX:LargePageSizeInBytes=2m .
This should sovle your issue.
this page, solved my issue.
in nutshell:
add this configuration to your database:
static {
OGlobalConfiguration.NON_TX_RECORD_UPDATE_SYNCH.setValue(true); //Executes a synch against the file-system at every record operation. This slows down records updates but guarantee reliability on unreliable drives
OGlobalConfiguration.STORAGE_LOCK_TIMEOUT.setValue(300000);
OGlobalConfiguration.RID_BAG_EMBEDDED_TO_SBTREEBONSAI_THRESHOLD.setValue(-1);
OGlobalConfiguration.FILE_LOCK.setValue(false);
OGlobalConfiguration.SBTREEBONSAI_LINKBAG_CACHE_SIZE.setValue(5000);
OGlobalConfiguration.INDEX_IGNORE_NULL_VALUES_DEFAULT.setValue(true);
OGlobalConfiguration.MEMORY_CHUNK_SIZE.setValue(32);
long maxMemory = Runtime.getRuntime().maxMemory();
long maxMemoryGB = (maxMemory / 1024L / 1024L / 1024L);
maxMemoryGB = maxMemoryGB < 1 ? 1 : maxMemoryGB;
long cacheSizeMb = 2 * 65536 * maxMemoryGB / 1024;
long maxDirectMemoryMb = VM.maxDirectMemory() / 1024L / 1024L;
String cacheProp = System.getProperty("storage.diskCache.bufferSize");
if (cacheProp==null) {
long maxDirectMemoryOrientMb = maxDirectMemoryMb / 3L;
long cachSizeMb = cacheSizeMb > maxDirectMemoryOrientMb ? maxDirectMemoryOrientMb : cacheSizeMb;
cacheSizeMb = (long)Math.pow(2, Math.ceil( Math.log(cachSizeMb)/ Math.log((2))));
System.setProperty("storage.diskCache.bufferSize",Long.toString(cacheSizeMb));
// the command below generates a NullPointerException in Orient 2.2.15-snapshot
// OGlobalConfiguration.DISK_CACHE_SIZE.setValue(cacheSizeMb);
}
}