Barman geo-redundancy configuration error - postgresql

I trying to config the geo-redundancy on my barman setup, but i get an error when i try to copy from primary to secundary backup, my configuration is:
SERVER 1
Ubuntu 18 on Virtual Box
Postgres 12
192.168.0.103
/etc/barman.conf
backup_method = rsync
archiver = on
compression = gzip
reuse_backup = link
backup_options = concurrent_backup
parallel_jobs = 2
network_compression = true
basebackup_retry_times = 20
basebackup_retry_sleep = 120
/etc/barman.d/comauto_20200921_95
[comauto_20200921_95]
description = "Comauto Local Postgres 9.5 - 21/09/2020"
conninfo = host=192.168.0.102 user=barman dbname=postgres
ssh_command = ssh postgres#192.168.0.102
retention_policy = RECOVERY WINDOW OF 2 WEEKS
backup_options = exclusive_backup
SERVER 2
Ubuntu 18 on Virtual Box
Postgres 9.5
ifconfig = 192.168.0.102
/etc/barman.conf
backup_method = rsync
archiver = on
compression = gzip
reuse_backup = link
backup_options = concurrent_backup
parallel_jobs = 2
network_compression = true
basebackup_retry_times = 20
basebackup_retry_sleep = 120
; the only difference
primary_ssh_command = barman#192.168.0.103
/etc/barman.d/comauto_20200921_95
[comauto_20200921_95]
description = "Comauto Local Postgres 9.5 - 21/09/2020"
conninfo = host=192.168.0.102 user=barman dbname=postgres
ssh_command = ssh postgres#192.168.0.102
retention_policy = RECOVERY WINDOW OF 2 WEEKS
backup_options = exclusive_backup
On server 1:
sudo su barman
ssh barman#192.168.0.102 -C true
# OK
barman check comauto_20200921_95
# All OK
barman backup comauto_20200921_95
# OK
barman list-backup comauto_20200921_95
# comauto_20200921_95 20201111T172643 - Wed Nov 11 17:26:50 2020 - Size: 6.2 GiB - WAL Size: 0 B
# comauto_20200921_95 20201111T114656 - Wed Nov 11 11:47:08 2020 - Size: 6.2 GiB - WAL Size: 79.9 KiB
# comauto_20200921_95 20201111T112906 - Wed Nov 11 11:33:10 2020 - Size: 6.2 GiB - WAL Size: 96.4 KiB
The error happens here
On server 2:
sudo su barman
ssh barman#192.168.0.103 -C true
# OK
barman check comauto_20200921_95
# WAL archive: FAILED
# ssh: FAILED (Connection failed using 'barman#192.168.0.103 -o BatchMode=yes -o StrictHostKeyChecking=no' return code 127)
barman cron
# ERROR: Failed to retrieve the primary node status: sync-info execution on remote primary server comauto_20200921_95 failed: /bin/sh: 1: barman#192.168.0.103: not found
barman list-backup comauto_20200921_95
#

One obvious issue is that primary_ssh_command is missing the actual ssh; it should presumably be:
; the only difference
primary_ssh_command = ssh barman#192.168.0.103
See example in the documentation here: https://docs.pgbarman.org/#configuration-1

Related

Why Does Podman Report "Not enough IDs available in namespace" with different UIDs?

Facts:
Rootless podman works perfectly for uid 1480
Rootless podman fails for uid 2088
CentOS 7
Kernel 3.10.0-1062.1.2.el7.x86_64
podman version 1.4.4
Almost the entire environment has been removed between the two
The filesystem for /tmp is xfs
The capsh output of the two users is identical but for uid / username
Both UIDs have identical entries in /etc/sub{u,g}id files
The $HOME/.config/containers/storage.conf is the default and is identical between the two with the exception of the uids. The storage.conf is below for reference.
I wrote the following shell script to demonstrate just how similar an environment the two are operating in:
#!/bin/sh
for i in 1480 2088; do
sudo chroot --userspec "$i":10 / env -i /bin/sh <<EOF
echo -------------- $i ----------------
/usr/sbin/capsh --print
grep "$i" /etc/subuid /etc/subgid
mkdir /tmp/"$i"
HOME=/tmp/"$i"
export HOME
podman --root=/tmp/"$i" info > /tmp/podman."$i"
podman run --rm --root=/tmp/"$i" docker.io/library/busybox printf "\tCOMPLETE\n"
echo -----------END $i END-------------
EOF
sudo rm -rf /tmp/"$i"
done
Here's the output of the script:
$ sh /tmp/podman-fail.sh
[sudo] password for functional:
-------------- 1480 ----------------
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=1480(functional)
gid=10(wheel)
groups=0(root)
/etc/subuid:1480:100000:65536
/etc/subgid:1480:100000:65536
Trying to pull docker.io/library/busybox...Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
ERRO[0003] could not find slirp4netns, the network namespace won't be configured: exec: "slirp4netns": executable file not found in $PATH
COMPLETE
-----------END 1480 END-------------
-------------- 2088 ----------------
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=2088(broken)
gid=10(wheel)
groups=0(root)
/etc/subuid:2088:100000:65536
/etc/subgid:2088:100000:65536
Trying to pull docker.io/library/busybox...Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
ERRO[0003] Error while applying layer: ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
ERRO[0003] Error pulling image ref //busybox:latest: Error committing the finished image: error adding layer with blob "sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b": ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
Failed
Error: unable to pull docker.io/library/busybox: unable to pull image: Error committing the finished image: error adding layer with blob "sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b": ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
Here's the storage.conf for the 1480 uid. It's identical except s/1480/2088/:
[storage]
driver = "vfs"
runroot = "/run/user/1480"
graphroot = "/tmp/1480/.local/share/containers/storage"
[storage.options]
size = ""
remap-uids = ""
remap-gids = ""
remap-user = ""
remap-group = ""
ostree_repo = ""
skip_mount_home = ""
mount_program = ""
mountopt = ""
[storage.options.thinpool]
autoextend_percent = ""
autoextend_threshold = ""
basesize = ""
blocksize = ""
directlvm_device = ""
directlvm_device_force = ""
fs = ""
log_level = ""
min_free_space = ""
mkfsarg = ""
mountopt = ""
use_deferred_deletion = ""
use_deferred_removal = ""
xfs_nospace_max_retries = ""
You can see there's basically no difference between the two podman info outputs for the users:
$ diff -u /tmp/podman.1480 /tmp/podman.2088
--- /tmp/podman.1480 2019-10-17 22:41:21.991573733 -0400
+++ /tmp/podman.2088 2019-10-17 22:41:26.182584536 -0400
## -7,7 +7,7 ##
Distribution:
distribution: '"centos"'
version: "7"
- MemFree: 45654056960
+ MemFree: 45652697088
MemTotal: 67306323968
OCIRuntime:
package: containerd.io-1.2.6-3.3.el7.x86_64
## -24,7 +24,7 ##
kernel: 3.10.0-1062.1.2.el7.x86_64
os: linux
rootless: true
- uptime: 30h 17m 50.23s (Approximately 1.25 days)
+ uptime: 30h 17m 54.42s (Approximately 1.25 days)
registries:
blocked: null
insecure: null
## -35,14 +35,14 ##
- quay.io
- registry.centos.org
store:
- ConfigFile: /tmp/1480/.config/containers/storage.conf
+ ConfigFile: /tmp/2088/.config/containers/storage.conf
ContainerStore:
number: 0
GraphDriverName: vfs
GraphOptions: null
- GraphRoot: /tmp/1480
+ GraphRoot: /tmp/2088
GraphStatus: {}
ImageStore:
number: 0
- RunRoot: /run/user/1480
- VolumePath: /tmp/1480/volumes
+ RunRoot: /run/user/2088
+ VolumePath: /tmp/2088/volumes
I refuse to believe there's an if (2088 == uid) { abort(); } or similar nonsense somewhere in podman's source code. What am I missing?
Does podman system migrate fix there might not be enough IDs available in the namespace for you?
It did for me and others:
https://github.com/containers/libpod/issues/3421
AFAICT, sub-UID and GID ranges should not overlap between users. For reference, here is what the useradd manpage has to say about the matter:
SUB_GID_MIN (number), SUB_GID_MAX (number), SUB_GID_COUNT
(number)
If /etc/subuid exists, the commands useradd and newusers
(unless the user already have subordinate group IDs)
allocate SUB_GID_COUNT unused group IDs from the range
SUB_GID_MIN to SUB_GID_MAX for each new user.
The default values for SUB_GID_MIN, SUB_GID_MAX,
SUB_GID_COUNT are respectively 100000, 600100000 and 65536.
SUB_UID_MIN (number), SUB_UID_MAX (number), SUB_UID_COUNT
(number)
If /etc/subuid exists, the commands useradd and newusers
(unless the user already have subordinate user IDs) allocate
SUB_UID_COUNT unused user IDs from the range SUB_UID_MIN to
SUB_UID_MAX for each new user.
The default values for SUB_UID_MIN, SUB_UID_MAX,
SUB_UID_COUNT are respectively 100000, 600100000 and 65536.
The key word is unused.
CentOS 7.6 does not suport rootless buildah by default - see https://github.com/containers/buildah/pull/1166 and https://www.redhat.com/en/blog/preview-running-containers-without-root-rhel-76

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

Simply distributed index: precached 0 indexes

I have two simple indexes:
First, 01.conf:
searchd
{
listen = 9301
listen = 9401:mysql41
pid_file = /var/run/sphinxsearch/searchd01.pid
log = /var/log/sphinxsearch/searchd01.log
query_log = /var/log/sphinxsearch/query01.log
binlog_path = /var/lib/sphinxsearch/data/test/01
}
source base
{
type = mysql
sql_host = localhost
sql_db = test
sql_user = root
sql_pass = toor
sql_query_pre = SET NAMES utf8
sql_attr_uint = group_id
}
source test : base
{
sql_query = \
SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
FROM documents WHERE id % 2 = 0
}
index test
{
source = test
path = /var/lib/sphinxsearch/data/test/01
}
Second looks like first but with "02" instead "01" in filename and inside.
And distributed index in 00.conf:
searchd
{
listen = 9305
listen = 9405:mysql41
pid_file = /var/run/sphinxsearch/searchd00.pid
log = /var/log/sphinxsearch/searchd00.log
query_log = /var/log/sphinxsearch/query00.log
binlog_path = /var/lib/sphinxsearch/data/test
}
index test
{
type = distributed
agent = 127.0.0.1:9301:test
agent = 127.0.0.1:9302:test
}
And I try to use distributed index:
sudo searchd --config /etc/sphinxsearch/d/00.conf --stop
sudo searchd --config /etc/sphinxsearch/d/01.conf --stop
sudo searchd --config /etc/sphinxsearch/d/02.conf --stop
sudo searchd --config /etc/sphinxsearch/d/01.conf
sudo searchd --config /etc/sphinxsearch/d/02.conf
sudo indexer --all --rotate --config /etc/sphinxsearch/d/01.conf
sudo indexer --all --rotate --config /etc/sphinxsearch/d/02.conf
sudo searchd --config /etc/sphinxsearch/d/00.conf
Unfortunately I obtain next output:
...
using config file '/etc/sphinxsearch/d/00.conf'...
listening on all interfaces, port=9305
listening on all interfaces, port=9405
precached 0 indexes in 0.000 sec
Why?
And when I try to search something with distributed index (9305):
no enabled local indexes to search.
And mysql indexes are works perfectly if I use them with port 9301 and 9302 respectively. But searching in distributed index returns nothing.
UPDATE
# tail /var/log/sphinxsearch/searchd00.log
[Thu Sep 29 23:43:04.599 2016] [ 2353] binlog: finished replaying /var/lib/sphinxsearch/data/test/binlog.001; 0.0 MB in 0.000 sec
[Thu Sep 29 23:43:04.599 2016] [ 2353] binlog: finished replaying total 4 in 0.000 sec
[Thu Sep 29 23:43:04.599 2016] [ 2353] accepting connections
[Thu Sep 29 23:43:24.336 2016] [ 2353] caught SIGTERM, shutting down
[Thu Sep 29 23:43:24.472 2016] [ 2353] shutdown complete
[Thu Sep 29 23:43:24.473 2016] [ 2352] watchdog: main process 2353 exited cleanly (exit code 0), shutting down
[Thu Sep 29 23:43:24.634 2016] [ 2404] watchdog: main process 2405 forked ok
[Thu Sep 29 23:43:24.635 2016] [ 2405] listening on all interfaces, port=9305
[Thu Sep 29 23:43:24.635 2016] [ 2405] listening on all interfaces, port=9405
[Thu Sep 29 23:43:24.636 2016] [ 2405] accepting connections
UPDATE2
Hmm... It seems what problem in querying data from Sphinx. Also I renamed distributed index into test1. Next code works well.
# mysql -h 127.0.0.1 -P 9405
mysql> select * from test1 where match ('one|two');
+------+----------+
| id | group_id |
+------+----------+
| 1 | 1 |
| 2 | 1 |
+------+----------+
2 rows in set (0,00 sec)
I think what problem was in old version of sphinxapi.php what I used.
precached 0 indexes in 0.000 sec
Well that it self, is normal. There are no local indexes to 'precache'. A distributed index has no index files to 'load' or (pre)cache.
... but searchd should still be running at the end of that. I think searchd should start up ok.
Try also checking
/var/log/sphinxsearch/searchd00.log
might have some more.
Although I suppose its possible sphinx will not startup without any real indexes (ie cant have JUST distributed index), so could just add a fake index to that config.

Master postgres initdb failed while deploying HAWQ 2.0 on Hortonworks

I tried to deploy HAWQ 2.0 but could not get the HAWQ Master to run. Below is the error log:
[gpadmin#hdps31hwxworker2 hawqAdminLogs]$ cat ~/hawqAdminLogs/hawq_init_20160805.log
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Prepare to do 'hawq init'
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-You can find log in:
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_init_20160805.log
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-GPHOME is set to:
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-/usr/local/hawq/.
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Current user is 'gpadmin'
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Parsing config file:
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-/usr/local/hawq/./etc/hawq-site.xml
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Init hawq with args: ['init', 'master']
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_address_host is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_address_port is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_directory is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_directory is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_address_port is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_dfs_url is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_master_temp_directory is set
20160805:23:00:10:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check: hawq_segment_temp_directory is set
20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Check if hdfs path is available
20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[DEBUG]:-Check hdfs: /usr/local/hawq/./bin/gpcheckhdfs hdfs hdpsm2demo4.demo.local:8020/hawq_default off
20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[WARNING]:-2016-08-05 23:00:11.338621, p50546, th139769637427168, WARNING the number of nodes in pipeline is 1 [172.17.15.31(172.17.15.31)], is less than the expected number of replica 3 for block [block pool ID: isi_hdfs_pool block ID 4341187780_1000] file /hawq_default/testFile
20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-1 segment hosts defined
20160805:23:00:11:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Set default_hash_table_bucket_number as: 6
20160805:23:00:17:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Start to init master
The files belonging to this database system will be owned by user "gpadmin".
This user must also own the server process.
The database cluster will be initialized with locale en_US.utf8.
fixing permissions on existing directory /data/hawq/master ... ok
creating subdirectories ... ok
selecting default max_connections ... 1280
selecting default shared_buffers/max_fsm_pages ... 125MB/200000
creating configuration files ... ok
creating template1 database in /data/hawq/master/base/1 ... 2016-08-05 22:00:18.554441 GMT,,,p50803,th-1212598144,,,,0,,,seg-10000,,,,,"WARNING","01000","""fsync"": can not be set by the user and will be ignored.",,,,,,,,"set_config_option","guc.c",10023,
ok
loading file-system persistent tables for template1 ...
2016-08-05 22:00:20.023594 GMT,,,p50835,th38852736,,,,0,,,seg-10000,,,,,"WARNING","01000","""fsync"": can not be set by the user and will be ignored.",,,,,,,,"set_config_option","guc.c",10023,
2016-08-05 23:00:20.126221 BST,,,p50835,th38852736,,,,0,,,seg-10000,,,,,"FATAL","XX000","could not create shared memory segment: Invalid argument (pg_shmem.c:183)","Failed system call was shmget(key=1, size=506213024, 03600).","This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 506213024 bytes), reduce PostgreSQL's shared_buffers parameter (currently 4000) and/or its max_connections parameter (currently 3000).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.",,,,,,"InternalIpcMemoryCreate","pg_shmem.c",183,1 0x87463a postgres errstart + 0x22a
2 0x74c5e6 postgres <symbol not found> + 0x74c5e6
3 0x74c7cd postgres PGSharedMemoryCreate + 0x3d
4 0x7976b6 postgres CreateSharedMemoryAndSemaphores + 0x336
5 0x880489 postgres BaseInit + 0x19
6 0x7b03bc postgres PostgresMain + 0xdbc
7 0x6c07d5 postgres main + 0x535
8 0x3c0861ed1d libc.so.6 __libc_start_main + 0xfd
9 0x4a14e9 postgres <symbol not found> + 0x4a14e9
child process exited with exit code 1
initdb: removing contents of data directory "/data/hawq/master"
Master postgres initdb failed
20160805:23:00:20:050348 hawq_init:hdps31hwxworker2:gpadmin-[INFO]:-Master postgres initdb failed
20160805:23:00:20:050348 hawq_init:hdps31hwxworker2:gpadmin-[ERROR]:-Master init failed, exit
This is in Advanced gpcheck
[global]
configfile_version = 4
[linux.mount]
mount.points = /
[linux.sysctl]
sysctl.kernel.shmmax = 500000000
sysctl.kernel.shmmni = 4096
sysctl.kernel.shmall = 400000000
sysctl.kernel.sem = 250 512000 100 2048
sysctl.kernel.sysrq = 1
sysctl.kernel.core_uses_pid = 1
sysctl.kernel.msgmnb = 65536
sysctl.kernel.msgmax = 65536
sysctl.kernel.msgmni = 2048
sysctl.net.ipv4.tcp_syncookies = 0
sysctl.net.ipv4.ip_forward = 0
sysctl.net.ipv4.conf.default.accept_source_route = 0
sysctl.net.ipv4.tcp_tw_recycle = 1
sysctl.net.ipv4.tcp_max_syn_backlog = 200000
sysctl.net.ipv4.conf.all.arp_filter = 1
sysctl.net.ipv4.ip_local_port_range = 1281 65535
sysctl.net.core.netdev_max_backlog = 200000
sysctl.vm.overcommit_memory = 2
sysctl.fs.nr_open = 2000000
sysctl.kernel.threads-max = 798720
sysctl.kernel.pid_max = 798720
# increase network
sysctl.net.core.rmem_max = 2097152
sysctl.net.core.wmem_max = 2097152
[linux.limits]
soft.nofile = 2900000
hard.nofile = 2900000
soft.nproc = 131072
hard.nproc = 131072
[linux.diskusage]
diskusage.monitor.mounts = /
diskusage.monitor.usagemax = 90%
[hdfs]
dfs.mem.namenode.heap = 40960
dfs.mem.datanode.heap = 6144
# in hdfs-site.xml
dfs.support.append = true
dfs.client.enable.read.from.local = true
dfs.block.local-path-access.user = gpadmin
dfs.datanode.max.transfer.threads = 40960
dfs.client.socket-timeout = 300000000
dfs.datanode.socket.write.timeout = 7200000
dfs.namenode.handler.count = 60
ipc.server.handler.queue.size = 3300
dfs.datanode.handler.count = 60
ipc.client.connection.maxidletime = 3600000
dfs.namenode.accesstime.precision = -1
Look like it is complaining about memory but I can't seem to find the parameters to change. Where is shared_buffers and max_connections?
How to fix this error in general? Thanks.
Your memory settings are too low to initialize the database. Don't bother with shared_buffers or max_connections.
You have:
kernel.shmmax = 500000000
kernel.shmall = 400000000
and it should be:
kernel.shmmax = 1000000000
kernel.shmall = 4000000000
Reference: http://hdb.docs.pivotal.io/hdb/install/install-cli.html
I would also make sure you have enough swap configured on your nodes based on the amount of RAM you have.
Reference: http://hdb.docs.pivotal.io/20/requirements/system-requirements.html
Shared_buffer sets the amount of memory a HAWQ segment instance uses for shared memory buffers. This setting must be at least 128KB and at least 16KB times max_connections.
When setting shared_buffers, the values for the operating system parameters SHMMAX or SHMALL might also need to be adjusted
The value of SHMMAX must be greater than this value:
shared_buffers + other_seg_shmem
You can set the parameter values using "hawq config " utility
hawq config -s shared_buffers (Will show you the value )
hawq config -c shared_buffers -v value .Please let me know how that goes !

SNMP on Linux (Centos) not showing load average

I installed SNMP on my Linux in order to monitory load balance, disk space etc.
yum install net-snmp net-snmp-utils
vim /etc/snmp/snmpd.conf # added "rocommunity public"
/etc/init.d/snmpd start
Walking the SNMP returns very few entries, no load average, disk space etc.:
snmpwalk -v 1 -c public localhost
SNMPv2-MIB::sysDescr.0 = STRING: Linux XXX 2.6.35.4-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (232884) 0:38:48.84
SNMPv2-MIB::sysContact.0 = STRING: Root <root#localhost> (configure /etc/snmp/snmp.local.conf)
SNMPv2-MIB::sysName.0 = STRING: XXX
SNMPv2-MIB::sysLocation.0 = STRING: Unknown (edit /etc/snmp/snmpd.conf)
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORID.1 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.2 = OID: TCP-MIB::tcpMIB
SNMPv2-MIB::sysORID.3 = OID: IP-MIB::ip
SNMPv2-MIB::sysORID.4 = OID: UDP-MIB::udpMIB
SNMPv2-MIB::sysORID.5 = OID: SNMP-VIEW-BASED-ACM-MIB::vacmBasicGroup
SNMPv2-MIB::sysORID.6 = OID: SNMP-FRAMEWORK-MIB::snmpFrameworkMIBCompliance
SNMPv2-MIB::sysORID.7 = OID: SNMP-MPD-MIB::snmpMPDCompliance
SNMPv2-MIB::sysORID.8 = OID: SNMP-USER-BASED-SM-MIB::usmMIBCompliance
SNMPv2-MIB::sysORDescr.1 = STRING: The MIB module for SNMPv2 entities
SNMPv2-MIB::sysORDescr.2 = STRING: The MIB module for managing TCP implementations
SNMPv2-MIB::sysORDescr.3 = STRING: The MIB module for managing IP and ICMP implementations
SNMPv2-MIB::sysORDescr.4 = STRING: The MIB module for managing UDP implementations
SNMPv2-MIB::sysORDescr.5 = STRING: View-based Access Control Model for SNMP.
SNMPv2-MIB::sysORDescr.6 = STRING: The SNMP Management Architecture MIB.
SNMPv2-MIB::sysORDescr.7 = STRING: The MIB for Message Processing and Dispatching.
SNMPv2-MIB::sysORDescr.8 = STRING: The management information definitions for the SNMP User-based Security Model.
SNMPv2-MIB::sysORUpTime.1 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.2 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.3 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.4 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.5 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.6 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.7 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.8 = Timeticks: (0) 0:00:00.00
Found the problem:
needed to reset the .conf file completely:
echo "rocommunity public" > /etc/snmp/snmp.conf
needed to specify the exact branch to walk:
snmpwalk -v 1 -c public localhost .1.3.6.1.4.1.2021.10
For others like me who might search for this:
It is due to the rocommunity being restricted to basic access in the default SNMPd config with the line:
rocommunity public default -V systemonly
To give full access to localhost just uncomment the previous line which is:
rocommunity public localhost
Or you can give all computers full access like OP did via:
rocommunity public