HowTo zdb -e poolname to recover data from a single ZFS device - recovery

I have the following situation:
1*10TB Drive, full of data on a ZFS
I wanted to add a 100GB NVME partition as a cache
instead of using zpool add poolname cache nvmepartion I wrote zpool add poolname nvmepartition
I did not see my mistake and exported this pool.
Neither the NVME drive is availeable any more, nor the system has any information about this pool in the ZFS cache (due to export).
current status:
zpool import shows the pool but I cannot import the pool using any way found on the internet.
zdb -e poolname shows me what i know: the pool, its name, that it (sadly) has 2 children which one is not availeable any more - and that the system has no informatioon about the missing child (so all tricks i found on internet in linked a ghost device etc. wont work either)
as far i know the only way is to use ZDB to generate all files through
the "journal" and pipe/save them to another path.
**
but how? Nowhere I found any documentation on that.
**
note: the 10 TB drive was 90% full, then I added the NVME partion as a sibling - as ZFS is no real raid 0 and due to the fact that these sibling have been so unequal in size and as I did not wrote many data after my mistake happened - I am quite sure that most of my data is still there.

Related

Truncated file in XFS filesystem using dd - how to recover

Disk layout
My HDD raid array is going for his end of life, and I bought some new disks for it.
Old HDD I have used as a storage of raw disk images for kvm/qemu virtual machines.
Raid array was built using mdadm. On md device I have physical volume for LVM. On physical volume I have XFS file system which stores raw disk images.
Every raw disk image was made by qemu-img and contains physical volume for LVM. One PV = one LV = one VG inside raw disk image.
Action
When I tried to use cp for data moving I was encountered with bad blocks and i/o problems in my raid array, so I switched from cp to dd with noerror,sync flags
I wrote dd if=/mnt/old/file.img of=/mnt/**old**/file.img bs=4k conv=noerror,sync
Problem
Now file /mnt/old/file.img has zero size in XFS file system.
Is there a simple solution to recover it?
My sense is your RAID array has failed. You can see the RAID state with...
cat /proc/mdstat
Since you are seeing i/o errors that is likely the source of your problem. The best path forward would be to make sector level copies of each RAID member (or at a minimum the member(s) that are throwing i/o errors). See Linux ddrescue. It is desigened to copy failing hard drives. Then perform recovery work from the copies.
Finally I have found the solution, but it isn't very simple.
Xfs_undelete haven't matched my problem because it does not support B+Tree extent storage format (V3) for very big files.
Successfull semi-manual procedure that had successed my problem is consists of theese main steps:
Unmount filesystem immediately and make full partition backup using dd to a file
Investigate XFS log entries about truncated file
Revert manually inode core header using xfs_db in expert mode
MB. Recovering inode core will not unmark extents as non-free, and when you try to copy some data in usual way from file with recovered inode header you will get i/o error.
It was a cause for developing python script.
Use script that extracts extents data from B+Tree tree for inode and writes them to disk
I have published recovery script under LGPL license at GitHub
P.S. Some data was lost because of corrupted inode b+tree extent records, but they are haven't make sense for me.

Could not create shared memory segment. Failed system call was shmget. PostgreSQL MacOS Mojave. Symlinked postgres data directory

This is a common error message and there are many general answers that have not worked for me.
I think I have isolated this particular problem to the PostgreSQL data directory being symlinked to an external hard drive.
FATAL: could not create shared memory segment: No space left on device
DETAIL: Failed system call was shmget(key=5432001, size=56, 03600).
HINT: This error does *not* mean that you have run out of disk space. It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
$ sysctl -a | grep sysv
kern.sysv.shmmax: 412316860416
kern.sysv.shmmin: 8
kern.sysv.shmmni: 64
kern.sysv.shmseg: 128
kern.sysv.shmall: 100663296
$ sudo cat /etc/sysctl.conf
kern.sysv.shmmax=412316860416
kern.sysv.shmmin=8
kern.sysv.shmmni=64
kern.sysv.shmseg=128
kern.sysv.shmall=100663296
PostgreSQL version 9.4.15. From my PostgreSQL config
shared_buffers = 128MB
Don't know what other settings would be relevant.
Other environment details:
The external hard drive with the data directory is at only 50% capacity. My RAM usage when this happens is ~60% capacity.
I have not been able to determine an exact set of steps that reproduces the bug. I have an external hard drive with a PostgreSQL data directory and a local folder with another data directory. In my project, I'll symlink to one or the other depending on which copy of data I want to use. As far as I have noticed, the problem only appears when I've been working off the symlinked hard drive and when I unplug it without stopping the server and then plug it back in. But it doesn't happen every time when I perform those steps.
I don't expect anyone to be able to point to the specific problem given the above description.
But how can I get more useful information next time I'm in a bugged state? Are there any system commands that would help identify the exact problem?
...It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
How can I check if if all available shared memory IDs have been taken or if the system's overall limit for shared memory has been reached and what do I do with the answer?

Deleting files in Ceph does not free up space

I am using Ceph, uploading many files through radosgw. After, I want to delete the files. I am trying to do that in Python, like this:
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
bucket.delete_key(key)
Afterwards, I use bucket.list() to list files in the bucket, and this says that the bucket is now empty, as I intended.
However, when I run ceph df on the mon, it shows that the OSDs still have high utilization (e.g. %RAW USED 90.91). If I continue writing (thinking that the status data just hasn't caught up with the state yet), Ceph essentially locks up (100% utilization).
What's going on?
Note: I do have these standing out in ceph status:
health HEALTH_WARN
3 near full osd(s)
too many PGs per OSD (2168 > max 300)
pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
From what I gather online, this wouldn't cause my particular issue. But I'm new to Ceph and could be wrong.
I have one mon and 3 OSDs. This is just for testing.
You can check if the object is really deleted by rados -p $pool list,
I knew for cephfs, when you delete a file, it will return ok when mds mark
it as deleted in local memory and then do real delete by sending delete messages to related osd.
Maybe radosgw use the same design to speed up delete

100GB free space on NFS server, but can't write even an empty file

In my Production NFS server more 100GB free, but I can't write even an empty file on that drive. Please find the attached image for clarification. Now I have fixed the issue by removing some folders on that drive.
Use both df & df -i after reading df(1); perhaps you have too many inodes in your file system. See also stat(1) so run stat -f
Perhaps you have reached some disk quota. See also quota(1)
Consider using strace(1) to find the failing syscall and its errno

Linux Page Cache Replacement

I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB.
I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk.
I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days.
Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem.
This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is:
$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit-1:
It seems that there is no NUMA configuration:
$ dmesg | grep -i numa
[ 0.000000] No NUMA configuration found
Edit-2:
I used page-types tool in Linux kernel source tree to monitor page cache statuses. From the results I conclude that:
data-1 pages are in state : referenced,uptodate,lru,active,private
data-2 pages are in state : referenced,uptodate,lru,mappedtodisk
Take a look at the cpusets you have configured in /dev/cpusets. If you have multiple directories in here then you have multiple cpusets, and potentially multiple memory nodes.
The cpusets mechanism is documented in detail here: http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html