Postgres Streaming Replication Disk Usage Discrepancy - postgresql

Just a quick question - apologies if its been asked before, I couldn't find it.
We are using asynchronous streaming replication with postgres and have noticed that the disk usage for the database can vary between the master and the replica, even though the databases appear to be synchronising correctly.
At the moment the discrepancy is quite small, but it has on occasion been in the region of several GB.
At present, this is the sync status:
Master:
master=# SELECT pg_current_xlog_location();
pg_current_xlog_location
--------------------------
35C/F142C98
(1 row)
Slave:
slave=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
35C/F142C98
(1 row)
The disk usage is as follows. Again, I realise the discrepancy is currently quite small (~1.5GiB), but yesterday it was several GB.
Master
-bash-4.1$ df -m /var/lib/pgsql/
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/lv_pgsql
401158 302898 98261 76% /var/lib/pgsql
Slave:
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/lv_pgsql
401158 301263 99895 76% /var/lib/pgsql
I should clarify that the archive command is set to archive to a different partition on the master. I guess what i'm asking is:
- Is the current disk usage discrepancy normal?
- How can it be explained?
- How much of a discrepancy should there be before I get worried?
Thanks in advance for any help.
edit - I'm specifically interested in the discrepancy within the "data/base/*" directories which house the actual db content, as follows:
Master:
7 /var/lib/pgsql/master/9.3/data/base/1
7 /var/lib/pgsql/master/9.3/data/base/12891
7 /var/lib/pgsql/master/9.3/data/base/12896
57904 /var/lib/pgsql/master/9.3/data/base/16385
180 /var/lib/pgsql/master/9.3/data/base/16387
11 /var/lib/pgsql/master/9.3/data/base/16389
203588 /var/lib/pgsql/master/9.3/data/base/48448446
7 /var/lib/pgsql/master/9.3/data/base/534138292
1 /var/lib/pgsql/master/9.3/data/base/pgsql_tmp
Slave:
7 /var/lib/pgsql/slave/9.3/data/base/1
7 /var/lib/pgsql/slave/9.3/data/base/12891
7 /var/lib/pgsql/slave/9.3/data/base/12896
57634 /var/lib/pgsql/slave/9.3/data/base/16385
180 /var/lib/pgsql/slave/9.3/data/base/16387
10 /var/lib/pgsql/slave/9.3/data/base/16389
202945 /var/lib/pgsql/slave/9.3/data/base/48448446
7 /var/lib/pgsql/slave/9.3/data/base/534138292
1 /var/lib/pgsql/slave/9.3/data/base/pgsql_tmp

Related

Mapping nested device mapper mounts back to their physical drive

Looking for a reliable (and hopefully simple) way to trace a directory in an lvm or other dm mounted fs back to the physical disk it resides on. Goal is to get the model and serial number of the drive no matter where the script wakes up.
Not a problem when the fs mount is on a physical partition, but gets messy when layers of lvm and/or loopbacks are in between. The lsblk tree shows the dm relationships back to /dev/sda in the following example, but wouldn't be very easy or desirable to parse:
# lsblk -po NAME,MODEL,SERIAL,MOUNTPOINT,MAJ:MIN
NAME MODEL SERIAL MOUNTPOINT MAJ:MIN
/dev/loop0 /mnt/test 7:0
/dev/sda AT1000MX500SSD1 21035FEA05B8 8:0
├─/dev/sda1 /boot 8:1
├─/dev/sda2 8:2
└─/dev/sda5 8:5
└─/dev/mapper/sda5_crypt 254:0
├─/dev/mapper/test5--vg-root / 254:1
└─/dev/mapper/test5--vg-swap_1 [SWAP] 254:2
Tried udevadm info, stat and a few other variations, but they all dead end at the device mapper without a way (that I can see) of connecting the dots to the backing disk and it's model/serial number.
Got enough solution by enumerating the base /dev/sd? devices, looping through each one and its partitions with lsblk -ln devpart and looking for the mountpoint in column 7. In the following example, the desired / shows up in the mappings to the /dev/sda5 partition. The serial number (and a lot of other data) for the base device can then be returned with udevadm info /dev/sda:
sda5 8:5 0 931G 0 part
sda5_crypt 254:0 0 931G 0 crypt
test5--vg-root 254:1 0 651G 0 lvm /
test5--vg-swap_1 254:2 0 976M 0 lvm [SWAP]

What's the best practice of improving IOPS of a CEPH cluster?

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?
The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:
1x 10Gbe Huawei switch
3x Rack server, with hardware configuration:
Intel(R) Xeon(R) CPU E5-2678 v3 # 2.50GHz x2, totally 48 logical cores,
128GB DDR3 RAM
Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)
My current /etc/ceph/ceph.conf:
[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072
[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320
IO benchmark is done by fio, with the configuration:
fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30
Benchmark result screenshot:
The bench mark result
The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.
Any help is appreciated,
First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.
That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.
Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.
You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?
Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.
ceph-volume lvm batch –osds-per-device 2
I assume you 3 meant 3 servers blades and not 3 racks.
What was you rough estimate of performance ?
What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ?
How many disk do you have in this pool, what is replication factor/strategy and object size ?
On the client side you are performing small reads: 4K
On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.
Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?
You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

PostgreSQL 11 Shared Memory Error: could not open shared memory segment "/PostgreSQL.XXXXXXXX": No such file or directory

Shared Memory files getting deleted some time (~15 hours) in Postgres 11
2019-07-09 08:46:41 CDT [] [6723]: [1-1] user=,db=,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [] [6722]: [1-1] user=,db=,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [1-1] user=user_name,db=db_name,e=58P01 ERROR: could not open shared memory segment "/PostgreSQL.291691635": No such file or directory
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [2-1] user=user_name,db=db_name,e=58P01 CONTEXT: parallel worker
2019-07-09 08:46:41 CDT [10.40.0.204(60550)] [13880]: [3-1] user=user_name,db=db_name,e=58P01 STATEMENT: WITH overall_reviewed AS (SQL Query)
GCP VM Config
CPU: 4
RAM: 16 GB
OS: Ubuntu 18.04.1 LTS
kernel shared memory setting shared
kernel.shmmax=8589934592
kernel.shmall=2097152
postgresql.config
max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 4194kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
During startup: no errors/warnings
After ~15 hours some of the shared memory files is getting deleted, I'm doubting is there any other process deleting files in "/dev/shm" ?
Not sure what is the root cause
making dynamic_shared_memory_type = none in postgresql.conf did solve the issue.
Got the same problem on Ubuntu 18.04 and PostgreSQL 11 and after some more research i have found a solution for us. The error occured when the backup user, which ist he same user as the PG service user, logs into the system. The following Link describes that the storeage under /dev/shm where deleted when a user logs in to the system (same user). So our solution was to change the following:
/etc/systemd/logind.conf
added the Line
RemoveIPC=no
and restart the service
systemctl restart systemd-logind.service
Sources:
https://www.postgresql-archive.org/systemd-deletes-shared-memory-segment-in-dev-shm-Postgresql-NNNNNN-td5883507.html
https://superuser.com/questions/1117764/why-are-the-contents-of-dev-shm-is-being-removed-automatically
We had the same issue and it turned out that someone set postgres user's UID bigger than 1000 (which means that postgres user was no longer a system account). And, as said here:
After hours of searching and reading, I found the culprit.
It's a setting for systemd. The /etc/systemd/logind.conf contains default configuration options, with each of them commented out.
The RemoveIPC option is set to yes by default. That option tells systemd to clean up interprocess communication (IPC) for "user accounts" who aren't logged in.
================================================
This does not affect "system accounts"
================================================
Met the same issue.....when I have opened two sqldeveloper with the same user account, one of them is my remote session which I completely forgot to close the session.
I was doing some aggregation operators like count(*) or max(...), and the error on both. And the error is similar:
ERROR: could not open shared memory segment "/PostgreSQL.798235387":
No such file or directory Where: parallel worker
Solution? I killed the remote session.... XD
And life is peaceful and happy again :D

MongoDB Out of Memory

MongoDB is crashing. When I open the mongodb.log file, I get:
$ tail /var/log/mongodb/mongodb.log
Sat Jan 25 03:06:56.153 [initandlisten] connection accepted from 127.0.0.1:58492 #63331 (263 connections now open)
Sat Jan 25 03:07:02.694 out of memory, printing stack and exiting:
0xde05e1 0x6cf37e 0x12129fd 0xc490c3 0xc4404e 0xc44196 0xda4913 0xda53e4 0xe28e69 0x7f5cbaa19e9a 0x7f5cb9d2c3fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo14my_new_handlerEv+0x3e) [0x6cf37e]
/usr/bin/mongod(_Znam+0x6d) [0x12129fd]
/usr/bin/mongod(_ZNK5mongo3Top8cloneMapERNS_9StringMapINS0_14CollectionDataEEE+0x83) [0xc490c3]
/usr/bin/mongod(_ZN5mongo9Snapshots12takeSnapshotEv+0x4e) [0xc4404e]
/usr/bin/mongod(_ZN5mongo14SnapshotThread3runEv+0x66) [0xc44196]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xc3) [0xda4913]
/usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0xda53e4]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f5cbaa19e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5cb9d2c3fd]
This question sounds similar: MongoDB: out of memory
But his problem was a ulimit issue. My memory settings are already unlimited.
Others had particular issues with .skip() or .limit() given unreasonably large values, but that's not happening here.
Anyone know what might be wrong?
The MongoDB docs recommend having enough swap space for MongoDB, despite it not being a requirement: http://docs.mongodb.org/manual/administration/production-notes/#ProductionNotes-Swap
I'm using Windows Azure hosting, and I discovered that their virtual servers don't have swap space by default:
$ sudo swapon -s
Filename Type Size Used Priority
(Azure defaults to no swap space: Part 1 & Part 2)
So I found a guide to creating a swap file: https://www.digitalocean.com/community/articles/how-to-add-swap-on-ubuntu-12-04
And it solved my problem!
Notes:
The guide says Ubuntu 12.04, but the same steps worked for me on 13.10.
You should use a swap file around half the size of your RAM, not the 512MB used in the guide.
I hope this helps others solve this problem.

Restore data from Postgres data files

Got system with Postgres broken (a RAID is the reason) , without any backups.
Trying to put data to another comptuter with Postgres (and make however backup).
But always when I set up data directory and run postgres I've got message
GET FATAL: database files are incompatible with server
2012-08-15 19:58:38 GET DETAIL: The database cluster was initialized with BLCKSZ 16777216, but the server was compiled with BLCKSZ 8192.
2012-08-15 19:58:38 GET HINT: It looks like you need to recompile or initdb.
It's very strange number 16777216(2 to power 24 - to big).
However I can't reset default value 8192 when compiling (playing with --with-blocksize= take no effect; BLCKSZ - I can't find it in headers files)
).
Any way to extract data ?
This is environment and circumstances:
harddrive: RAID 1 with 3 SAS disks in array
OS: ubuntu 10.04.04 amd64
Postgres: 9.1 (by apt-get (we change repository links to higher version of Ubuntu))
the system become broken - after some time got
AAC: Host Adapter BLINK LED 0x56
AACO: Adapter kernel panic'd 56
(filesystem or hardware error)
Somehow we got data directory. pg_conroldata shown:
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5714530593695276911
Database cluster state: shut down
pg_control last modified: Tue 15 Aug 2012 11:50:50
Latest checkpoint location: 1B595668/2000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 1B595668/2000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/4057946
Latest checkpoint's NextOID: 40960
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 670
Latest checkpoint's oldestXID's DB: 1344846103
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Tue 15 Aug 2012 11:50:50
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting:0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 16777216
Blocks per segment of large relation:131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
First I effort to up DB in Ubuntu servers (harddisk - simple serial 2, Ubuntu 10.04 i386, Postgres 9.1) and got the same exception above (with BLCKSZ).
That's why I deployed Ubuntu 10.04 amd64 with english Postgres 9.1 (because got '?' instead of russian symbols in error logs in previous step) in virtual machine
Got the same exception (with BLCKSZ).
Ather that have removed apt-get postgres version and compiled it as described at docs http://www.postgresql.org/docs/9.1/static/installation.html.
Playing withconfigure --with-blocksize=BLOCKSIZE had take no effect - got the same error
Sorry, for the post.
The pg_contol was broken by some manipulations with.
Sow, the cluster was succeful restored by pg_resetxlog with initial data.
A blocksize of 16Mb would be really weird, and since these two values also look completely bogus:
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
...you might want to question the integrity of this data before spending time on compiling postgres with a non-standard block size.
If you look at the sizes of files corresponding to relations, are they multiple of 16Mb or 8Kb?
If the database have some gigabytes tables, what appears to be the cut-off size on disk (the size above which postgres split the data into several files)? This should be equal to data block size*Blocks per segment of large relation. On a default install, it's 1Gb.
See here for details on configuring kernel resources. Perhaps the default/current settings for this new OS won't allow the postmaster to start.
Here are details on the meaning and context of the BLCKSZ parameter. Was the system that failed running a 64bit build of PostgreSQL and the new system is a 32bit build? If possible, attempting to obtain version information on the failed system's PostgreSQL could shed light on the problem. Let us know what version, build, and OS were used. Was is a custom build?