Dataflow template job FAIL cause of GCS Problem - google-cloud-storage

my jod id is
2019-02-01_06_50_27-10838491598599390366
this is dataflow batch job. with template.
here is the dataflow error code.
2019-02-01 23:51:02.647 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.659 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.699 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.699 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.700 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.710 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:02.937 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:03.387 JST
EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
2019-02-01 23:51:10.509 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.511 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
{
insertId: "s=51b724ba020b4384acc382e634e62cbc;i=568;b=879cba75f5cd4eff82751e8f30ef312b;m=9a91b9;t=580d6461241e4;x=6549465094b7bc54"
jsonPayload: {…}
labels: {…}
logName: "projects/fluted-airline-109810/logs/dataflow.googleapis.com%2Fkubelet"
receiveTimestamp: "2019-02-01T14:51:18.883283433Z"
resource: {…}
severity: "ERROR"
timestamp: "2019-02-01T14:51:10.511494Z"
}
2019-02-01 23:51:10.560 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.577 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.580 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.608 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.645 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.646 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.694 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.749 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.751 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.775 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.777 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.785 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.809 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.811 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.816 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.857 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:10.929 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:10.931 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:10.966 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:11.214 JST
Error initializing dynamic plugin prober: Error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system
2019-02-01 23:51:11.216 JST
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
2019-02-01 23:51:11.254 JST
[ContainerManager]: Fail to get rootfs information unable to find data for container /
2019-02-01 23:51:15.619 JST
PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
2019-02-01 23:51:15.793 JST
PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
2019-02-01 23:51:15.974 JST
PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
2019-02-01 23:51:16.264 JST
PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs

Is the gs:// bucket accessible by the service account for this job?

Related

Could not access file "pglogical" while trying to install pglogical

I'm following instructions from https://github.com/2ndQuadrant/pglogical to install pglogical on postgres 12 on Centos 8. The install seems be successful:
yum -y install postgresql12-pglogical
Last metadata expiration check: 0:21:30 ago on Wed 30 Sep 2020 09:32:13 PM CDT.
Dependencies resolved.
=====================================================================================================================================================================================================================================================
Package Architecture Version Repository Size
=====================================================================================================================================================================================================================================================
Installing:
postgresql12-pglogical x86_64 2.3.2-1.el8 2ndquadrant-dl-default-release-pg12 145 k
Installing dependencies:
postgresql12 x86_64 12.4-1PGDG.rhel8 pgdg12 1.6 M
postgresql12-server x86_64 12.4-1PGDG.rhel8 pgdg12 5.2 M
Transaction Summary
=====================================================================================================================================================================================================================================================
Install 3 Packages
Total download size: 7.0 M
Installed size: 29 M
Downloading Packages:
(1/3): postgresql12-12.4-1PGDG.rhel8.x86_64.rpm 1.5 MB/s | 1.6 MB 00:01
(2/3): postgresql12-pglogical-2.3.2-1.el8.x86_64.rpm 117 kB/s | 145 kB 00:01
(3/3): postgresql12-server-12.4-1PGDG.rhel8.x86_64.rpm 4.0 MB/s | 5.2 MB 00:01
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 5.3 MB/s | 7.0 MB 00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : postgresql12-12.4-1PGDG.rhel8.x86_64 1/3
Running scriptlet: postgresql12-12.4-1PGDG.rhel8.x86_64 1/3
failed to link /usr/bin/psql -> /etc/alternatives/pgsql-psql: /usr/bin/psql exists and it is not a symlink
failed to link /usr/bin/clusterdb -> /etc/alternatives/pgsql-clusterdb: /usr/bin/clusterdb exists and it is not a symlink
failed to link /usr/bin/createdb -> /etc/alternatives/pgsql-createdb: /usr/bin/createdb exists and it is not a symlink
failed to link /usr/bin/createuser -> /etc/alternatives/pgsql-createuser: /usr/bin/createuser exists and it is not a symlink
failed to link /usr/bin/dropdb -> /etc/alternatives/pgsql-dropdb: /usr/bin/dropdb exists and it is not a symlink
failed to link /usr/bin/dropuser -> /etc/alternatives/pgsql-dropuser: /usr/bin/dropuser exists and it is not a symlink
failed to link /usr/bin/pg_basebackup -> /etc/alternatives/pgsql-pg_basebackup: /usr/bin/pg_basebackup exists and it is not a symlink
failed to link /usr/bin/pg_dump -> /etc/alternatives/pgsql-pg_dump: /usr/bin/pg_dump exists and it is not a symlink
failed to link /usr/bin/pg_dumpall -> /etc/alternatives/pgsql-pg_dumpall: /usr/bin/pg_dumpall exists and it is not a symlink
failed to link /usr/bin/pg_restore -> /etc/alternatives/pgsql-pg_restore: /usr/bin/pg_restore exists and it is not a symlink
failed to link /usr/bin/reindexdb -> /etc/alternatives/pgsql-reindexdb: /usr/bin/reindexdb exists and it is not a symlink
failed to link /usr/bin/vacuumdb -> /etc/alternatives/pgsql-vacuumdb: /usr/bin/vacuumdb exists and it is not a symlink
Running scriptlet: postgresql12-server-12.4-1PGDG.rhel8.x86_64 2/3
Installing : postgresql12-server-12.4-1PGDG.rhel8.x86_64 2/3
Running scriptlet: postgresql12-server-12.4-1PGDG.rhel8.x86_64 2/3
Installing : postgresql12-pglogical-2.3.2-1.el8.x86_64 3/3
Running scriptlet: postgresql12-pglogical-2.3.2-1.el8.x86_64 3/3
Verifying : postgresql12-pglogical-2.3.2-1.el8.x86_64 1/3
Verifying : postgresql12-12.4-1PGDG.rhel8.x86_64 2/3
Verifying : postgresql12-server-12.4-1PGDG.rhel8.x86_64 3/3
Installed:
postgresql12-12.4-1PGDG.rhel8.x86_64 postgresql12-pglogical-2.3.2-1.el8.x86_64 postgresql12-server-12.4-1PGDG.rhel8.x86_64
Complete!
But when I try to restart postgres, I get this error
systemctl restart postgresql
Job for postgresql.service failed because the control process exited with error code.
See "systemctl status postgresql.service" and "journalctl -xe" for details.
Relevant portions of the journalctl -xe
-- Unit postgresql.service has begun starting up.
Sep 30 21:54:59 aba postmaster[305963]: 2020-10-01 02:54:59.825 UTC [305963] FATAL: could not access file "pglogical": No such file or directory
Sep 30 21:54:59 aba postmaster[305963]: 2020-10-01 02:54:59.825 UTC [305963] LOG: database system is shut down
Sep 30 21:54:59 aba systemd[1]: postgresql.service: Main process exited, code=exited, status=1/FAILURE
Sep 30 21:54:59 aba systemd[1]: postgresql.service: Failed with result 'exit-code'.
Sep 30 21:54:59 aba systemd[1]: Failed to start PostgreSQL database server.
-- Subject: Unit postgresql.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit postgresql.service has failed.
--
-- The result is failed.
I am lost!
Your session log tells the the server was installed as a prerequisite, but the "link" messages insinuate that there was already an incompatible client version in place. Probably you had installed PostgreSQL from the CentOS packages, but the pglogical RPMs pulled in the PGDG packages.
The error message probably means that shared_preload_libraries contains pglogical, but pglogical.so could not be found in the lib directory.
Presumably the installation process edited the configuration in your old server installation, but installed the shared object in the new one.
Upshot: you cannot use those pglogical binaries with your installation. Either switch to the PGDG RPMs or build pglogical from source.
You see that there is a certain amount of conjecture in my deductions, but that should help you solve the problem.

how can we mount root file system in android10

I can able to mount system partition.
I could not able to mount root partition.
getting error like this :
1|console:/ # mount -o rw,remount /
[ 3640.420613] EXT4-fs (dm-0): couldn't mount RDWR because of unsupported optional features (4000)
[ 3640.434479] EXT4-fs (dm-0): couldn't mount RDWR because of unsupported optional features (4000)
'/dev/block/dm-0' is read-only
console:/ # [ 3903.028999] WLDEV-ERROR)

osm2pgsql import fails with "Failed to read from node cache: Input/output error"

I'm attempting a whole-planet OSM data import on an AWS EC2. During or possibly after the "Ways" processing i receive the following message:
"Failed to read from node cache: Input/output error"
The EC2 has the following specs:
type: i3.xlarge
memory: 30.5 Gb
vCPUs: 4
Postgresql: v9.5.6
PostGIS: 2.2
In addition to the root volume, I have mounted 900GB SSD and a 2TB HHD (high throughput). The Postgresql data directory is on the HHD. I have commanded the osm2pgsql to write the flat-nodes file the SSD.
Here is my osm2pgsql command:
osm2pgsql -c -d gis --number-processes 4 --slim -C 20000 --flat-nodes /data-cache/flat-node-cache/flat.nodes /data-postgres/planet-latest.osm.pbf
I run the above command as user renderaccount that is a member of the following groups renderaccount ubuntu postgres. The flat-nodes file appears to be successfully created at /data-cache/flat-node-cache/flat.nodes and has this profile:
ubuntu#ip-172-31-25-230:/data-cache/flat-node-cache$ ls -l
total 37281800
-rw------- 1 renderaccount renderaccount 38176555024 Apr 13 05:45 flat.nodes
Has anyone run into and or resolved this? I suspect maybe a permissions issue? I notice now that since this last osm2pgsql failure, the mounted SSD that is the destination of the flat-nodes file has been converted to a "read-only" file system - which sounds like may happen when there are i/o errors on the mounted volume(?).
Also, does osm2pgsql write to a log that I could acquire additional info?
UPDATE: dmesg output:
[ 6206.884412] blk_update_request: I/O error, dev nvme0n1, sector 66250752
[ 6206.890813] EXT4-fs warning (device nvme0n1): ext4_end_bio:329: I/O error -5 writing to inode 14024706 (offset 10871640064 size 8388608 starting block 8281600)
[ 6206.890817] Buffer I/O error on device nvme0n1, logical block 8281344
After researching the above output, it appears it might be a bug in Ubuntu 16.04. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129?comments=all
This was an error with Ubuntu 16.04 writing to the volume nvme0n1. Solved by this https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129/comments/29

Kubernetes node high disk IO and CPU usage

I run a Kubernetes cluster with version 1.5.2, setup with Kops on AWS. The setup has nothing exotic. My nodes run on m4.xlarge with 70 Gb of disk storage with 1000 iops.
I have periods where some of my nodes get crazy with iops. Here is what I see:
So du take all my iops in the docker overlay directory. Here is what the kubelet logs display:
fsHandler.go:131] du and find on following dirs took 4.22914425s: [/var/lib/docker/overlay/592c1d88d1fd115f21e8fe6f198a8a27cd44efefb9b5dc58940fbf6d7999eda3 /var/lib/docker/containers/2347d28886bc0e6b74fc326538e1483927ddeb89b38e035acd845d5db621cb79]
fsHandler.go:131] du and find on following dirs took 24.94283434s: [/var/lib/docker/overlay/81f24df3624ebf7b7e45edc38fafeb41958bc675ae57fd0126c44cb2c3a6d6d6 /var/lib/docker/containers/43d576931081500fd4cd316afe5bfc6ff2442ff20e8e8266c27e930a0a77dd34]
fsHandler.go:131] du and find on following dirs took 18.478782737s: [/var/lib/docker/overlay/422ef31413df4e76de51acaa7d6ff6f77edc65fabde88a7c70e7edad3b1e55e5 /var/lib/docker/containers/1519a33729c8fb13297358edc53fe22f0b4b684636884976dfcb67c47fbf320b]
helpers.go:101] Unable to get network stats from pid 13515: couldn't read network stats: failure opening /proc/13515/net/dev: open /proc/13515/net/dev: no such file or directory
fsHandler.go:131] du and find on following dirs took 7.971745844s: [/var/lib/docker/overlay/45b83939bd1b4ec7dfa627bb6a9eb8b89a380007f9e22a93fff2ba4054252271 /var/lib/docker/containers/f6d3387423398d7dd4fac6c19ee0a1446d0465b5f9cf90289fcd605ad28c0d6e]
fsHandler.go:131] du and find on following dirs took 5.886763577s: [/var/lib/docker/overlay/8c01a73671eedb2e62c58fa12fc2d25df58c506545b6ea048fa0db1756d19f2c /var/lib/docker/containers/1d9c0ebcc6dbbd7065923f7f81c05c0d9d710aed0d353a1bab90ce1c994dfb57]
fsHandler.go:131] du and find on following dirs took 5.714942029s: [/var/lib/docker/overlay/26213ba30a17f240a9b9756a0d23ab32550f921de533667c9ab91cfb7f10ed5b /var/lib/docker/containers/7c27c242a49d8d33cee8b2e8335dae450af13b26f010794dc83ef5750a212d0d]
fsHandler.go:131] du and find on following dirs took 6.111478835s: [/var/lib/docker/overlay/0fe2bd0feeda24699bd6d443ca126ac1a33071cdff039ae9fd9159bbef80867b /var/lib/docker/containers/ec6fb966139e9666ec0be5e13399773f1971ddd99841b84167a7463402e28d73]
fsHandler.go:131] du and find on following dirs took 2.661604836s: [/var/lib/docker/overlay/04f9d01a8863cfee26e678e938fced84f826dda6ed03626dda11b6aad6901465 /var/lib/docker/containers/a4e37aee69c7523c46c5252c1834fa3fcd5a804a7aee256a468e44b4d6bcbd64]
fsHandler.go:131] du and find on following dirs took 11.834409809s: [/var/lib/docker/overlay/4cb1476621b90e2c2ee2b1131c0e6ac62f62dc3ca418129812b487bffac1d827 /var/lib/docker/containers/5a01521cfdd3041aff128dce7353ab336ddafa60c8c0b2254fb6bae697cb1676]
I recommend upgrading to k8s version 1.6, there are many updates noted in the CHANGELOG that should help debug your issue.
Generally, EBS volumes are not fully available in terms of IO unless you have fully "pre-warmed" them by reading and writing to every block on the device.

WAL contains references to invalid pages

centos 6.7
postgresql 9.5.3
I've DB servers that are on master-standby replication.
Suddenly, standby server's postgresql process was stopped with this logs.
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]WARNING: page 1671400 of relation base/16400/559613 is uninitialized
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]PANIC: WAL contains references to invalid pages
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: startup process (PID 15579) was terminated by signal 6: Aborted
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: terminating any other active server processes
And, master server's postgresql logs were nothing special.
But, master server's /var/log/messages was listed as below.
Jul 14 05:38:44 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 05:38:44 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 05:38:44 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468442324 SOCKET 1 APIC 20
Jul 14 05:38:44 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 05:38:44 host kernel:
Jul 14 18:30:40 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 18:30:40 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 18:30:40 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468488640 SOCKET 1 APIC 20
Jul 14 18:30:41 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 18:30:41 host kernel:
The memory error's started at 1 week ago. So, I doubt the memory error causes postgresql's error.
My question is here.
1) Can memory error of kernel cause postgresql's "WAL contains references to invalid pages" error?
2) Why there is not any logs at master server's postgresql?
thx.
Faulty memory can cause all kinds of data corruption, so that seems like a good enough explanation to me.
Perhaps there are no log entries at the master PostgreSQL server because all that was corrupted was the WAL stream.
You can run
oid2name
to find out which database has OID 16400 and then
oid2name -d <database with OID 16400> -f 559613
to find out which table belongs to file 559613.
Is that table larger than 12 GB? If not, that would mean that page 1671400 is indeed an invalid value.
You didn't tell which PostgreSQL version you are using, but maybe there are replication bugs fixed in later versions that could cause replication problems even without a hardware bug present; read the release notes.
I would perform a new pg_basebackup and reinitialize the slave system.
But what I'd really be worried about is possible data corruption on the master server. Block checksums are cool (turned on if pg_controldata <data directory> | grep checksum gives you 1), but possibly won't detect the effects of memory corruption.
Try something like
pg_dumpall -f /dev/null
on the master and see if there are errors.
Keep your old backups in case you need to repair something!