Async slave node missing WAL files on Postgres11 - postgresql
I have 3 VM nodes running Master-Slave Postgres-11. They are being managed by Pacemaker.
Node Attributes:
* Node node04:
+ master-pgsqlins : 1000
+ pgsqlins-data-status : LATEST
+ pgsqlins-master-baseline : 000000C0D8000098
+ pgsqlins-status : PRI
* Node node05:
+ master-pgsqlins : -INFINITY
+ pgsqlins-data-status : STREAMING|ASYNC
+ pgsqlins-status : HS:async
* Node node06:
+ master-pgsqlins : 100
+ pgsqlins-data-status : STREAMING|SYNC
+ pgsqlins-status : HS:sync
Async node throws an error at times that the required WAL file is missing. It then stops the replication and starts it again.
On the master node, WAL archiving is enabled and they are synced to another folder named wal_archive. There is another process that keeps removing the files from that wal_archive folder. So I understand why the slave node would throw that error, but what I want to understand is that how is it able to start back again without that missing file?
The postgresql.conf
# Connection settings
# -------------------
listen_addresses = '*'
port = 5432
max_connections = 600
tcp_keepalives_idle = 0
tcp_keepalives_interval = 0
tcp_keepalives_count = 0
# Memory-related settings
# -----------------------
shared_buffers = 2GB # Physical memory 1/4
##DEBUG: mmap(1652555776) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
#huge_pages = try # on, off, or try
#temp_buffers = 16MB # depends on DB checklist
work_mem = 8MB # Need tuning
effective_cache_size = 4GB # Physical memory 1/2
maintenance_work_mem = 512MB
wal_buffers = 64MB
# WAL/Replication/HA settings
# --------------------
wal_level = logical
synchronous_commit = remote_write
archive_mode = on
archive_command = 'rsync -a %p /xxxxx/wal_archive/%f'
#archive_command = ':'
max_wal_senders=5
hot_standby = on
restart_after_crash = off
wal_sender_timeout = 60000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_feedback = on
random_page_cost = 1.5
max_wal_size = 5GB
min_wal_size = 200MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
# Logging settings
# ----------------
log_destination = 'csvlog,syslog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql_%Y%m%d.log'
log_truncate_on_rotation = off
log_rotation_age = 1h
log_rotation_size = 0
log_timezone = 'Japan'
log_line_prefix = '%t [%p]: [%l-1] %h:%u#%d:[XXXPG]:CODE:%e '
log_statement = ddl
log_min_messages = info # DEBUG5
log_min_error_statement = info # DEBUG5
log_error_verbosity = default
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_disconnections = on
log_duration = off
log_min_duration_statement = 1000
log_autovacuum_min_duration = 3000ms
track_functions = pl
track_activity_query_size = 8192
# Locale/display settings
# -----------------------
lc_messages = 'C'
lc_monetary = 'en_US.UTF-8' # ja_JP.eucJP
lc_numeric = 'en_US.UTF-8' # ja_JP.eucJP
lc_time = 'en_US.UTF-8' # ja_JP.eucJP
timezone = 'Asia/Tokyo'
bytea_output = 'escape'
# Auto vacuum settings
# -----------------------
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_cost_limit = 200
#shared_preload_libraries = 'pg_stat_statements,auto_explain' <------------------check this
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
On the async slave node, this is the recovery.conf
primary_conninfo = 'host=1xx.xx.xx.xx port=5432 user=replica application_name=node05 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /xxxxx/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'
The logs about the error from master
2021-07-05 23:35:02.321 JST,,,28926,,60e16b42.70fe,122,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint complete: wrote 2897 buffers (1.1%); 0 WAL file(s) added, 0 removed
, 2 recycled; write=106.770 s, sync=0.050 s, total=106.827 s; sync files=251, longest=0.017 s, average=0.001 s; distance=20262 kB, estimate=46658 kB",,,,,,,,,""
2021-07-05 23:35:02.322 JST,,,28926,,60e16b42.70fe,123,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint starting: immediate force wait",,,,,,,,,""
2021-07-05 23:35:02.347 JST,,,28926,,60e16b42.70fe,124,,2021-07-04 17:03:14 JST,,0,LOG,00000,"checkpoint complete: wrote 173 buffers (0.1%); 0 WAL file(s) added, 0 removed,
1 recycled; write=0.007 s, sync=0.012 s, total=0.026 s; sync files=43, longest=0.005 s, average=0.001 s; distance=14410 kB, estimate=43434 kB",,,,,,,,,""
2021-07-05 23:35:02.348 JST,"replica","",3451,"1xx.xx.xx.xxx:45120",60e16bfc.d7b,3,"streaming C1/97C3E000",2021-07-04 17:06:20 JST,116/0,0,ERROR,XX000,"requested WAL segment 00000001000000C100000097 has already been removed",,,,,,,,,"node05"
2021-07-05 23:35:02.361 JST,"replica","",3451,"1xx.xx.xx.xxx:45120",60e16bfc.d7b,4,"idle",2021-07-04 17:06:20 JST,,0,LOG,00000,"disconnection: session time: 30:28:41.550 user=replica database= host=172.17.48.141 port=45120",,,,,,,,,"node05"
2021-07-05 23:35:02.399 JST,,,24896,"1xx.xx.xx.xxx:49278",60e31896.6140,1,"",2021-07-05 23:35:02 JST,,0,LOG,00000,"connection received: host=1xx.xx.xx.xxx port=49278",,,,,,,,,""
2021-07-05 23:35:02.401 JST,"postgres","postgres",24851,"[local]",60e31896.6113,3,"idle",2021-07-05 23:35:02 JST,,0,LOG,00000,"disconnection: session time: 0:00:00.251 user=postgres database=postgres host=[local]",,,,,,,,,"postgres#node04"
2021-07-05 23:35:02.403 JST,"replica","",24896,"1xx.xx.xx.xxx:49278",60e31896.6140,2,"authentication",2021-07-05 23:35:02 JST,116/72,0,LOG,00000,"replication connection authorized: user=replica",,,,,,,,,""
The logs about the error from async slave node
2021-07-05 23:35:02.359 JST,,,2541,,60e16bfc.9ed,2,,2021-07-04 17:06:20 JST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000C100000097 has already been removed",,,,,,,,,""
2021-07-05 23:35:02.408 JST,,,4703,,60e31896.125f,1,,2021-07-05 23:35:02 JST,,0,LOG,00000,"started streaming WAL from primary at C1/98000000 on timeline 1",,,,,,,,,""
2021-07-05 23:35:03.318 JST,,,4835,"[local]",60e31897.12e3,1,"",2021-07-05 23:35:03 JST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
Sync slave node doesn't throw this error, only async slave node, and that too recovers without any manual intervention. Is there a way to avoid this error other than by not removing the archived wal files from the wal_archive folder every 2 mins?
Related
Postgresql-12 too many wal files
We have a Postgresql-12 DB in our production. Today we realized our disk usage was increased against last month in master server (last month: 4.4TB out of 14TB, now: 9.8TB out of 14TB). When i run ncdu command our actual postgresql data is just 3.4TB other 6.4TB is used by just wal files. We have a standby server as well. Wal archiving is enabled on our deployment and we continuously storing wal files to our another backup server near the basebackup of our DB. So all this wal files are necessary even the after we backup them? If not what should we do for free our disk space from unnecessary wal files? Here is the our postgresql.conf: Our server's specs: Centos 7 48 Core Intel Xenon Gold CPU 256 GB Memory 4*7,6TB SSD RAID1+0 (Total ~14TB) listen_addresses = '*' max_connections = 500 superuser_reserved_connections = 10 password_encryption = md5 shared_buffers = 64GB max_prepared_transactions = 100 work_mem = 83886kB maintenance_work_mem = 2GB max_stack_depth = 2MB dynamic_shared_memory_type = posix bgwriter_delay = 100ms bgwriter_lru_maxpages = 10000 bgwriter_lru_multiplier = 10.0 effective_io_concurrency = 200 max_worker_processes = 48 max_parallel_maintenance_workers = 4 max_parallel_workers_per_gather = 4 max_parallel_workers = 48 wal_level = replica wal_sync_method = fdatasync wal_compression = on wal_log_hints = on wal_buffers = 32MB commit_delay = 0 max_wal_size = 16GB min_wal_size = 4GB checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'test ! -f /pgdata/wal_backup/%f && cp %p /pgdata/wal_backup/%f && /var/lib/pgsql/backup_wal.sh' archive_timeout = 3600 random_page_cost = 1.1 effective_cache_size = 192GB default_statistics_target = 100 log_destination = 'stderr' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 4000 log_checkpoints = on log_line_prefix = '<user=%u db=%d host=%h pid=%p app=%a time=%m > ' log_lock_waits = on log_temp_files = 0 log_timezone = 'Europe/Istanbul' cluster_name = 'pg12/primary' track_io_timing = on track_functions = all log_autovacuum_min_duration = 0 statement_timeout = 3600000 datestyle = 'iso, mdy' timezone = 'Europe/Istanbul' lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' shared_preload_libraries = 'pg_stat_statements' max_locks_per_transaction = 128 pg_stat_statements.max = 10000 pg_stat_statements.track = all pg_stat_statements.track_utility = on pg_stat_statements.save = on
We have a standby server as well. Wal archiving is enabled on our deployment and we continuously storing wal files to our another backup server near the basebackup of our DB. One option is that there is an unused replication slot (it has to be in primary_slot_name on the standby). Consult pg_replication_slots. The other is that your archiver is failing. Consult pg_stat_archiver.
pg_wal is taking a lot of disk space
I have installed postgres in a containerized environment using docker-compose, for that I have used this docker image crunchydata/crunchy-postgres-gis:centos7-11.5-2.4.2, all was running right till I realized that PG_DIR/pg_wal is taking a lot of disk space, I don't want to use pg_archivecleanup every time nor in a cron job, but I want to configure postgres to do that automatically. please, what is the correct configuration for that? This is my postgresql.conf file. listen_addresses = '*' # what IP address(es) to listen on; port = 5432 # (change requires restart) unix_socket_directories = '/tmp' # comma-separated list of directories unix_socket_permissions = 0777 # begin with 0 to use octal notation temp_buffers = 8MB # min 800kB max_connections = 400 shared_buffers = 1536MB effective_cache_size = 4608MB maintenance_work_mem = 384MB checkpoint_completion_target = 0.7 wal_buffers = 16MB default_statistics_target = 100 random_page_cost = 1.1 effective_io_concurrency = 200 work_mem = 4MB min_wal_size = 1GB max_wal_size = 2GB max_worker_processes = 4 max_parallel_workers_per_gather = 2 max_parallel_workers = 4 unix_socket_directories = '/tmp' # comma-separated list of directories unix_socket_permissions = 0777 # begin with 0 to use octal notation shared_preload_libraries = 'pg_stat_statements.so' # (change requires restart) #------------------------------------------------------------------------------ # WRITE AHEAD LOG #------------------------------------------------------------------------------ wal_level = hot_standby # minimal, archive, or hot_standby max_wal_senders = 6 # max number of walsender processes wal_keep_segments = 400 # in logfile segments, 16MB each; 0 disables hot_standby = on # "on" allows queries during recovery max_standby_archive_delay = 30s # max delay before canceling queries max_standby_streaming_delay = 30s # max delay before canceling queries wal_receiver_status_interval = 10s # send replies at least this often archive_mode = on # enables archiving; off, on, or always # (change requires restart) archive_command = 'pgbackrest archive-push %p' # command to use to archive a logfile segment # placeholders: %p = path of file to archive # %f = file name only # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' archive_timeout = 60 # force a logfile segment switch after this # number of seconds; 0 disables #------------------------------------------------------------------------------ # ERROR REPORTING AND LOGGING #------------------------------------------------------------------------------ log_destination = 'stderr' # Valid values are combinations of logging_collector = on # Enable capturing of stderr and csvlog log_directory = 'pg_log' # directory where log files are written, log_filename = 'postgresql-%a.log' # log file name pattern, log_truncate_on_rotation = on # If on, an existing log file with the log_rotation_age = 1d # Automatic rotation of logfiles will log_rotation_size = 0 # Automatic rotation of logfiles will log_min_duration_statement = 0 # -1 is disabled, 0 logs all statements log_checkpoints = on log_connections = on log_disconnections = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h' log_lock_waits = on # log lock waits >= deadlock_timeout log_timezone = 'US/Eastern' log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and datestyle = 'iso, mdy' timezone = 'US/Eastern' lc_messages = 'C' # locale for system error message lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C' # locale for number formatting lc_time = 'C' # locale for time formatting default_text_search_config = 'pg_catalog.english' Thanks
You haven't shown us any evidence that pgbackrest has anything to do with this. If it is failing, you should see messages about that in the server's log file. If it is succeeding, then it should be taking up space in the archive, wherever that is, not in pg_wal. But wal_keep_segments = 400 will lead to over 6.25GB of pg_wal being retained. I don't know if that constitutes "a lot" or not. pg_archivecleanup isn't for cleaning up pg_wal, it is for cleaning up the archive.
PostgreSQL 9.5 Replication Lag running on EC2
I have a series of PostgreSQL 9.5 servers running on r4.16xlarge instances and Amazon Linux 1 that started experiencing replication lag of several seconds starting this week. The configurations were changed but the old configs weren't saved so I'm not sure what the previous settings were. Here's the custom values: max_connections = 1500 shared_buffers = 128GB effective_cache_size = 132GB maintenance_work_mem = 128MB checkpoint_completion_target = 0.7 wal_buffers = 16MB default_statistics_target = 100 #effective_io_concurrency = 10 work_mem = 128MB min_wal_size = 1GB max_wal_size = 2GB max_worker_processes = 64 synchronous_commit = off The drive layout is as follows - 4 disks for the xlog drive and 10 for the regular partition, all gp2 disk type. Personalities : [raid0] md126 : active raid0 xvdo[3] xvdn[2] xvdm[1] xvdl[0] 419428352 blocks super 1.2 512k chunks md127 : active raid0 xvdk[9] xvdj[8] xvdi[7] xvdh[6] xvdg[5] xvdf[4] xvde[3] xvdd[2] xvdc[1] xvdb[0] 2097146880 blocks super 1.2 512k chunks The master server is a smaller c4.8xlarge instance with this setup: max_connections = 1500 shared_buffers = 15GB effective_cache_size = 45GB maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100 random_page_cost = 1.1 effective_io_concurrency = 16 work_mem = 26MB min_wal_size = 1GB max_wal_size = 2GB max_worker_processes = 36 With this drive layout: Personalities : [raid0] md126 : active raid0 xvdd[2] xvdc[1] xvdb[0] xvde[3] 419428352 blocks super 1.2 512k chunks md127 : active raid0 xvdr[12] xvdg[1] xvdo[9] xvdl[6] xvdh[2] xvdf[0] xvdp[10] xvdu[15] xvdm[7] xvdj[4] xvdn[8] xvdk[5] xvdi[3] xvds[13] xvdt[14] xvdq[11] 3355435008 blocks super 1.2 512k chunks I guess I'm looking for optimal settings for these two instance types so I can eliminate the replication lag. None of the servers are what I would call heavily loaded.
With further digging I found that the following setting fixed the replication lag: hot_standby_feedback = on This may cause some WAL bloating on the master but now the backlog is gone.
Postgres performance degrading as cache gets consumed
I migrated a Postgres database 9.1 doing 300K transactions/hour from a server with Red Hat OS, Intel(R) Xeon(R) CPU E5-2670 0 # 2.60GHz / 16 Core, 64 GB RAM, 240 GB x 4 Intel SSD TO Intel(R) Xeon(R) CPU E5-2680 v4 # 2.40GHz / 56 Core, 128 GB RAM, 2TB nvme PCI SSD, RANDOM READ 450000 iops, RANDOM WRITE 56000 iops. CentOS 6.9. Over period of time the server slows down and the amount of data processed get reduced. If I clear the cache manually (sync; echo 3 > /proc/sys/vm/drop_caches) then the data processing resume to maximum level. Again after some time with load the performance deteriorate in terms of amount of data processed. The cache memory shows it has been fully consumed. pg configuration : datestyle = 'redwood,show_time' db_dialect = 'redwood' default_text_search_config = 'pg_catalog.english' edb_dynatune = 90 edb_redwood_date = on edb_redwood_strings = on lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' shared_preload_libraries = '$libdir/dbms_pipe,$libdir/edb_gen,$libdir/plugins/plugin_debugger,$libdir/plugins/plugin_spl_debugger' timed_statistics = off archive_command = 'rsync -a %p slave:/opt/PostgresPlus/9.1AS/wals/%f' archive_mode = on listen_addresses = '*' log_destination = 'syslog' syslog_facility = 'LOCAL0' logging_collector = on log_line_prefix = '%t' max_wal_senders = 4 port = 6432 wal_keep_segments = 128 wal_level = hot_standby temp_buffers='50MB' constraint_exclusion = on autovacuum = on enable_bitmapscan = off max_connections = 200 shared_buffers = 32GB effective_cache_size = 96GB work_mem = 167772kB maintenance_work_mem = 2GB checkpoint_segments = 64 checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100
Master postgres initdb failed while deploying HAWQ 2.0 on Hortonworks
I tried to deploy HAWQ 2.0 but could not get the HAWQ Master to run. Below is the error log: [gpadmin#hdps31hwxworker2 hawqAdminLogs]$ cat ~/hawqAdminLogs/hawq_init_20160805.log 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Prepare to do 'hawq init' 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-You can find log in: 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_init_20160805.log 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-GPHOME is set to: 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-/usr/local/hawq/. 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Current user is 'gpadmin' 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Parsing config file: 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-/usr/local/hawq/./etc/hawq-site.xml 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Init hawq with args: ['init', 'master'] 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_address_host is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_address_port is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_directory is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_directory is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_address_port is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_dfs_url is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_temp_directory is set 20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_temp_directory is set 20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check if hdfs path is available 20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Check hdfs: /usr/local/hawq/./bin/gpcheckhdfs hdfs hdpsm2demo4.demo.local:8020/hawq_default off 20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[WARNING]:-2016-08-05 23:00:11.338621, p50546, th139769637427168, WARNING the number of nodes in pipeline is 1 [172.17.15.31(172.17.15.31)], is less than the expected number of replica 3 for block [block pool ID: isi_hdfs_pool block ID 4341187780_1000] file /hawq_default/testFile 20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-1 segment hosts defined 20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Set default_hash_table_bucket_number as: 6 20160805:23:00:17:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Start to init master The files belonging to this database system will be owned by user "gpadmin". This user must also own the server process. The database cluster will be initialized with locale en_US.utf8. fixing permissions on existing directory /data/hawq/master ... ok creating subdirectories ... ok selecting default max_connections ... 1280 selecting default shared_buffers/max_fsm_pages ... 125MB/200000 creating configuration files ... ok creating template1 database in /data/hawq/master/base/1 ... 2016-08-05 22:00:18.554441 GMT,,,p50803,th-1212598144,,,,0,,,seg-10000,,,,,"WARNING","01000","""fsync"": can not be set by the user and will be ignored.",,,,,,,,"set_config_option","guc.c",10023, ok loading file-system persistent tables for template1 ... 2016-08-05 22:00:20.023594 GMT,,,p50835,th38852736,,,,0,,,seg-10000,,,,,"WARNING","01000","""fsync"": can not be set by the user and will be ignored.",,,,,,,,"set_config_option","guc.c",10023, 2016-08-05 23:00:20.126221 BST,,,p50835,th38852736,,,,0,,,seg-10000,,,,,"FATAL","XX000","could not create shared memory segment: Invalid argument (pg_shmem.c:183)","Failed system call was shmget(key=1, size=506213024, 03600).","This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 506213024 bytes), reduce PostgreSQL's shared_buffers parameter (currently 4000) and/or its max_connections parameter (currently 3000). If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for. The PostgreSQL documentation contains more information about shared memory configuration.",,,,,,"InternalIpcMemoryCreate","pg_shmem.c",183,1 0x87463a postgres errstart + 0x22a 2 0x74c5e6 postgres <symbol not found> + 0x74c5e6 3 0x74c7cd postgres PGSharedMemoryCreate + 0x3d 4 0x7976b6 postgres CreateSharedMemoryAndSemaphores + 0x336 5 0x880489 postgres BaseInit + 0x19 6 0x7b03bc postgres PostgresMain + 0xdbc 7 0x6c07d5 postgres main + 0x535 8 0x3c0861ed1d libc.so.6 __libc_start_main + 0xfd 9 0x4a14e9 postgres <symbol not found> + 0x4a14e9 child process exited with exit code 1 initdb: removing contents of data directory "/data/hawq/master" Master postgres initdb failed 20160805:23:00:20:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Master postgres initdb failed 20160805:23:00:20:050348 hawq_init:hdps31hwxworker2:gpadmin-[ERROR]:-Master init failed, exit This is in Advanced gpcheck [global] configfile_version = 4 [linux.mount] mount.points = / [linux.sysctl] sysctl.kernel.shmmax = 500000000 sysctl.kernel.shmmni = 4096 sysctl.kernel.shmall = 400000000 sysctl.kernel.sem = 250 512000 100 2048 sysctl.kernel.sysrq = 1 sysctl.kernel.core_uses_pid = 1 sysctl.kernel.msgmnb = 65536 sysctl.kernel.msgmax = 65536 sysctl.kernel.msgmni = 2048 sysctl.net.ipv4.tcp_syncookies = 0 sysctl.net.ipv4.ip_forward = 0 sysctl.net.ipv4.conf.default.accept_source_route = 0 sysctl.net.ipv4.tcp_tw_recycle = 1 sysctl.net.ipv4.tcp_max_syn_backlog = 200000 sysctl.net.ipv4.conf.all.arp_filter = 1 sysctl.net.ipv4.ip_local_port_range = 1281 65535 sysctl.net.core.netdev_max_backlog = 200000 sysctl.vm.overcommit_memory = 2 sysctl.fs.nr_open = 2000000 sysctl.kernel.threads-max = 798720 sysctl.kernel.pid_max = 798720 # increase network sysctl.net.core.rmem_max = 2097152 sysctl.net.core.wmem_max = 2097152 [linux.limits] soft.nofile = 2900000 hard.nofile = 2900000 soft.nproc = 131072 hard.nproc = 131072 [linux.diskusage] diskusage.monitor.mounts = / diskusage.monitor.usagemax = 90% [hdfs] dfs.mem.namenode.heap = 40960 dfs.mem.datanode.heap = 6144 # in hdfs-site.xml dfs.support.append = true dfs.client.enable.read.from.local = true dfs.block.local-path-access.user = gpadmin dfs.datanode.max.transfer.threads = 40960 dfs.client.socket-timeout = 300000000 dfs.datanode.socket.write.timeout = 7200000 dfs.namenode.handler.count = 60 ipc.server.handler.queue.size = 3300 dfs.datanode.handler.count = 60 ipc.client.connection.maxidletime = 3600000 dfs.namenode.accesstime.precision = -1 Look like it is complaining about memory but I can't seem to find the parameters to change. Where is shared_buffers and max_connections? How to fix this error in general? Thanks.
Your memory settings are too low to initialize the database. Don't bother with shared_buffers or max_connections. You have: kernel.shmmax = 500000000 kernel.shmall = 400000000 and it should be: kernel.shmmax = 1000000000 kernel.shmall = 4000000000 Reference: http://hdb.docs.pivotal.io/hdb/install/install-cli.html I would also make sure you have enough swap configured on your nodes based on the amount of RAM you have. Reference: http://hdb.docs.pivotal.io/20/requirements/system-requirements.html
Shared_buffer sets the amount of memory a HAWQ segment instance uses for shared memory buffers. This setting must be at least 128KB and at least 16KB times max_connections. When setting shared_buffers, the values for the operating system parameters SHMMAX or SHMALL might also need to be adjusted The value of SHMMAX must be greater than this value: shared_buffers + other_seg_shmem You can set the parameter values using "hawq config " utility hawq config -s shared_buffers (Will show you the value ) hawq config -c shared_buffers -v value .Please let me know how that goes !