osm2pgsql import fails with "Failed to read from node cache: Input/output error" - postgresql

I'm attempting a whole-planet OSM data import on an AWS EC2. During or possibly after the "Ways" processing i receive the following message:
"Failed to read from node cache: Input/output error"
The EC2 has the following specs:
type: i3.xlarge
memory: 30.5 Gb
vCPUs: 4
Postgresql: v9.5.6
PostGIS: 2.2
In addition to the root volume, I have mounted 900GB SSD and a 2TB HHD (high throughput). The Postgresql data directory is on the HHD. I have commanded the osm2pgsql to write the flat-nodes file the SSD.
Here is my osm2pgsql command:
osm2pgsql -c -d gis --number-processes 4 --slim -C 20000 --flat-nodes /data-cache/flat-node-cache/flat.nodes /data-postgres/planet-latest.osm.pbf
I run the above command as user renderaccount that is a member of the following groups renderaccount ubuntu postgres. The flat-nodes file appears to be successfully created at /data-cache/flat-node-cache/flat.nodes and has this profile:
ubuntu#ip-172-31-25-230:/data-cache/flat-node-cache$ ls -l
total 37281800
-rw------- 1 renderaccount renderaccount 38176555024 Apr 13 05:45 flat.nodes
Has anyone run into and or resolved this? I suspect maybe a permissions issue? I notice now that since this last osm2pgsql failure, the mounted SSD that is the destination of the flat-nodes file has been converted to a "read-only" file system - which sounds like may happen when there are i/o errors on the mounted volume(?).
Also, does osm2pgsql write to a log that I could acquire additional info?
UPDATE: dmesg output:
[ 6206.884412] blk_update_request: I/O error, dev nvme0n1, sector 66250752
[ 6206.890813] EXT4-fs warning (device nvme0n1): ext4_end_bio:329: I/O error -5 writing to inode 14024706 (offset 10871640064 size 8388608 starting block 8281600)
[ 6206.890817] Buffer I/O error on device nvme0n1, logical block 8281344
After researching the above output, it appears it might be a bug in Ubuntu 16.04. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129?comments=all

This was an error with Ubuntu 16.04 writing to the volume nvme0n1. Solved by this https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129/comments/29

Related

error during upgrade postgres version -11 to 12 (old and new pg_controldata WAL segment sizes are invalid or do not match)

we are facing error during postgres upgrade version from 11 to 12 .
Please suggest what I do to resolve this issue.
Error:
Performing Consistency Checks on Old Live Server
------------------------------------------------
Checking cluster versions ok
old and new pg_controldata WAL segment sizes are invalid or do not match
Failure, exiting
bash-4.1$
.....................
command:
/usr/pgsql-12/bin/pg_upgrade --old-datadir /var/lib/pgsql/11/data/ --new-datadir /var/lib/pgsql/12/data/ --old-bindir /usr/pgsql-11/bin/ --new-bindir /usr/pgsql-12/bin/ --check

Ceph luminous rbd map hangs forever

Running a 1 node ceph cluster, and using the ceph-client from another node. Qemu is working fine with the RBD mounting. When I try to mount a RBD block device on the ceph-client I get an indefinite hang with no output. How to diagnose whats wrong?
System is ubuntu 16.04 server, and Ceph Luminous.
sudo ceph tell osd.* version
{
"version": "ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)"
}
ceph -s
cluster:
id: 4bfcc109-e432-4ac0-ba9d-bf81243aea
health: HEALTH_OK
services:
mon: 1 daemons, quorum gcmaster
mgr: gcmaster(active)
osd: 1 osds: 1 up, 1 in
data:
pools: 1 pools, 128 pgs
objects: 1512 objects, 5879 MB
usage: 7356 MB used, 216 GB / 223 GB avail
pgs: 128 active+clean
rbd info gcbase
rbd image 'gcbase':
size 512 MB in 128 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.376974b0dc51
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
create_timestamp: Fri Dec 29 17:58:02 2017
This hangs forever
rbd map gcbase --pool rbd
As does this
rbd map typo_gcbase --pool rbd
dmesg shows
Dec 29 13:27:32 cephclient1 kernel: [85798.195468] libceph: mon0 192.168.1.55:6789 feature set mismatch, my 106b84a842a42 < server's 40106b84a842a42, missing 400000000000000
Dec 29 13:27:32 cephclient1 kernel: [85798.222070] libceph: mon0 192.168.1.55:6789 missing required protocol features
The dmesg output tells what's going on: The cluster requires a feature bit that is not supported by the libceph kernel module.
The feature bit in question is either CEPH_FEATURE_CRUSH_TUNABLES5, CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING or CEPH_FEATURE_FS_FILE_LAYOUT_V2 (they are overlapping because they were introduced at the same time) which only became available on kernel 4.5, whereas Ubuntu 16.04 uses a 4.4 kernel.
A similar question (although related to CephFS) came up on the mailing list with a possible solution:
Yes, you should be able to set your CRUSH tunables profile to hammer
with "ceph osd crush tunables hammer".
This will disable some features, but should make the older kernel compatible with the cluster.
Alternatively you could upgrade to a mainline kernel or to a newer OS release.

Ran out of Docker disk space

I have this Docker command:
docker run -d mongo
this will build and run a mongodb server running in a docker container
However, I get an error:
no space left on device
I am on MacOS, and using the newer versions of Docker which use hyper-v instead of VirtualBox (I think that's correct).
Here is the exact error message from the mongo container:
$ docker logs efee16702c5756659d563b98d4ae0f58ecf1f1bba8a54f63443c0ae4b520ab4e
about to fork child process, waiting until server is ready for connections.
forked process: 21
2017-05-04T20:23:51.412+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-04T20:23:51.430+0000 I CONTROL [main] ERROR: Cannot write pid file to /tmp/tmp.Lo035QkbfL: No space left on device
ERROR: child process failed, exited with error number 1
Any idea how to fix this and prevent it from happening in future?
As suggested, the output of df -h is:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 116Gi 349Gi 25% 1963838 4293003441 0% /
devfs 183Ki 183Ki 0Bi 100% 634 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
Output of docker info is:
$ docker info
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 741
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.13-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: OR4L:WYWW:FFAP:IDX3:B6UK:O2AN:UVTO:EPH6:GYSV:4GV4:L5WP:BQTH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 17
Goroutines: 30
System Time: 2017-05-04T20:45:27.056157913Z
EventsListeners: 1
No Proxy: *.local, 169.254/16
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
As you state in the comments to the question, ls -altrh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.‌​amd64-linux/Docker.q‌​cow2 returns the following:
-rw-r--r--# 1 alexamil staff 53G
This is a known bug on MacOS (actually, not only) and an official dev comment could be found here. Except for one thing: I read, that different people get different size limit. In the comment it is 64Gb, but for another person it was 20Gb.
There are a couple walkarounds, but no definite solution that I could find.
The manual one
Run docker ps -a and manually remove all unused containers. Then run docker images and remove manually all the intermediate and unused images.
The simplest one
Delete the Docker.qcow2 file entirely. But you will lose all images and containers. Completely.
The less simple
Another way is to run docker volume prune, which will remove all unused volumes
The resizing one (keeps the data)
Another idea that comes to me is to expand the disk image size with QEMU or something like it:
$ brew install qemu
$ /Applications/Docker.app/Contents/MacOS/qemu-img resize ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 +5G
After you expanded the image, you will need to run a VM in which you should run GParted against Docker.qcow2 and expand the partition to use added space. You could use GParted Live ISO for that:
$ qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live.iso -boot d -device usb-mouse -usb
Some people report this either doesn't work or doesn't help.
Yet another resizing one (wipes the data)
Create a substitute image with desired size (120G):
$ qemu-img create -f qcow2 ~/data.qcow2 120G
$ cp ~/data.qcow2 /Application/Docker.app/Contents/Resources/moby/data.qcow2
$ rm ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2
data.qcow2 is copied to ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 when you restart docker.
This walkaround comes from this comment.
Hope this helps. Good luck!

WAL contains references to invalid pages

centos 6.7
postgresql 9.5.3
I've DB servers that are on master-standby replication.
Suddenly, standby server's postgresql process was stopped with this logs.
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]WARNING: page 1671400 of relation base/16400/559613 is uninitialized
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]PANIC: WAL contains references to invalid pages
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: startup process (PID 15579) was terminated by signal 6: Aborted
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: terminating any other active server processes
And, master server's postgresql logs were nothing special.
But, master server's /var/log/messages was listed as below.
Jul 14 05:38:44 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 05:38:44 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 05:38:44 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468442324 SOCKET 1 APIC 20
Jul 14 05:38:44 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 05:38:44 host kernel:
Jul 14 18:30:40 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 18:30:40 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 18:30:40 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468488640 SOCKET 1 APIC 20
Jul 14 18:30:41 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 18:30:41 host kernel:
The memory error's started at 1 week ago. So, I doubt the memory error causes postgresql's error.
My question is here.
1) Can memory error of kernel cause postgresql's "WAL contains references to invalid pages" error?
2) Why there is not any logs at master server's postgresql?
thx.
Faulty memory can cause all kinds of data corruption, so that seems like a good enough explanation to me.
Perhaps there are no log entries at the master PostgreSQL server because all that was corrupted was the WAL stream.
You can run
oid2name
to find out which database has OID 16400 and then
oid2name -d <database with OID 16400> -f 559613
to find out which table belongs to file 559613.
Is that table larger than 12 GB? If not, that would mean that page 1671400 is indeed an invalid value.
You didn't tell which PostgreSQL version you are using, but maybe there are replication bugs fixed in later versions that could cause replication problems even without a hardware bug present; read the release notes.
I would perform a new pg_basebackup and reinitialize the slave system.
But what I'd really be worried about is possible data corruption on the master server. Block checksums are cool (turned on if pg_controldata <data directory> | grep checksum gives you 1), but possibly won't detect the effects of memory corruption.
Try something like
pg_dumpall -f /dev/null
on the master and see if there are errors.
Keep your old backups in case you need to repair something!

MongoDB FATAL ERROR: Out of memory trying to allocate internal tcmalloc data

I am running mongodb 3.2 on ubuntu 14.04 server 64 bit. The mongodb server keeps crashing. Whenever I restart the server I see this:
stop: Unknown instance:
mongod start/running, process 25687
Also on running mongo shell after this I get the following error in it:
src/third_party/gperftools-2.2/src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
It was not happening before on my system. How Can I correct this error? It keeps happening every 2-3 hour.
Complete MongoDB Log File : Download Here
EDIT1: Added mongodb log file.