Checksums with gsutil rsync - google-cloud-storage

I am downloading a large number of public data files from Google Cloud Storage using gsutil rsync. Occasionally the download fails for a few files. To ensure that I have all of the requested files, I run gsutil a second time with checksums turned on. During the second run, gsutil reports that it is computing checksums for fewer files than were downloaded. I have attached some sample output below. In this case it downloaded 29 files during the first rsync, but only reported that it was computing checksums for 16 files during the second rsync.
Is gsutil not computing the checksums and doing the rsync for some of the files, or is it simply not reporting that it is doing the checksums?
Ken
mix> gsutil -m rsync -R -P gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX /csrpc1/NEXRAD/level2/2017/201702/20170201/KHGX
Building synchronization state...
Starting synchronization
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131210000_20170131215959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131190000_20170131195959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131230000_20170131235959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131200000_20170131205959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131220000_20170131225959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201040000_20170201045959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201090000_20170201095959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201020000_20170201025959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201010000_20170201015959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201030000_20170201035959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201180000_20170201185959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201130000_20170201135959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201060000_20170201065959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201100000_20170201105959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201110000_20170201115959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201120000_20170201125959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201170000_20170201175959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201150000_20170201155959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201160000_20170201165959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201050000_20170201055959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201000000_20170201005959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201080000_20170201085959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201070000_20170201075959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201140000_20170201145959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201200000_20170201205959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201190000_20170201195959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201210000_20170201215959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201220000_20170201225959.tar...
Copying gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201230000_20170201235959.tar...
- [29/29 files][387.3 MiB/387.3 MiB] 100% Done
Operation completed over 29 objects/387.3 MiB.
mix> gsutil -m rsync -R -P -c gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX /csrpc1/NEXRAD/level2/2017/201702/20170201/KHGX
Building synchronization state...
Starting synchronization
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131190000_20170131195959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131200000_20170131205959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131210000_20170131215959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170131220000_20170131225959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201000000_20170201005959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201010000_20170201015959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201020000_20170201025959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201030000_20170201035959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201040000_20170201045959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201050000_20170201055959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201160000_20170201165959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201170000_20170201175959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201180000_20170201185959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201190000_20170201195959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201200000_20170201205959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201210000_20170201215959.tar...
Computing CRC32C for gs://gcp-public-data-nexrad-l2/2017/02/01/KHGX/NWS_NEXRAD_NXL2DPBL_KHGX_20170201210000_20170201215959.tar...
mix>

gsutil rsync -c only computes checksums if the size of the source and destination match. This saves time because there's no need to compute the checksum to determine the size-matching files need to be downloaded.

Related

Deleted a Mongodb data folder by accident on ext4, how to best recover data?

What's the best/fastest/safest way to recover deleted files from ext4 ?
Specs:
The disk is 1TB SSHD (hibrid HDD + SSD), also the partition is encrypted with LUKS Encryption (version 1)
Mongodb is using WiredTiger as a storage engine.
Also if I manage a partial recovery of files, could I do a partial recovery of mongo's collections?
Step 1: File recovery
Fast Recovery of files using extundelete:
sudo umount /path/to/disk &&
sudo extundelete /path/to/disk --restore-directory /path/to/dir -o /restored/path/
/path/to/disk represents the disk path, e.g. /dev/sdd , /dev/mapping/label
/path/to/dir represents the path that you want recovered relative to disk mounting point, e.g. if /dev/ssd would be mounted at /mnt/label/ the full path would be /mnt/label/path/to/dir and the relative path is /path/to/dir
pros of recovery with extundelete:
it's lightweight
can work if the disk is mounted or encrypted
pretty fast, it gave answers if recovery is possible in seconds and it writes the recovered files with over 100 MB/s
cons for data recovery in general
no guarantee for success
won't work if new data was written in the deleted sectors (so unmount the disk as soon as possible and make an image of the broken disk before any recovery)
Step 2 : repair mongodb if missing data
Backup before this step, mongod --repair could delete good data
Untested, but from my understanding mongod --repair should help repair the database if incomplete otherwise you can continue recovery for WiredTiger with :
Recovering a WiredTiger collection from a corrupt mongodb installation

Issue with crc32c verification using gsutil

crc32c signature computed for local file (Rgw3kA==) doesn't match cloud-supplied digest (5A+KjA==). Local file (/home/blah/pgdata.tar) will be deleted.
I did a bit of diagnosing, and I noticed that it was always on the cloud-supplied digest of "5A+KjA==" but usually at a different point in the file with different local crc32c. This is using either:
gsutil -m rsync gs://bucket/ /
or
gsutil -m cp gs://bucket/pgdata.tar /
I seem to get this error almost all the time transferring a large 415GB tar database file. It always exits in error at a different part, and it doesn't resume. Is there any workarounds for this? If it is legitimate file corruption, I would expect it to fail at the same point in the file?
File seems fine as I've loaded this onto various instances and postgresql about a week ago.
I'm not sure of the verision of gsutils, but it is the natively installed one on GCE Ubuntu 14.04 image, following the GCE provided instructions for crcmod installation on Debian/Ubuntu.

mongorestore takes a lot of time, how about I just copy-paste the '/data/db' directory?

In my case, I want to backup and restore all the databases. This might sound stupid but -
Instead of doing
# backup
mongodump # takes time
# restore
mongorestore # takes a lot of time
Why can't I just
# backup
tar -cvzf /backup/mongo.tar.gz /data/db
# restore
tar -xzf /backup/mongo.tar.gz -C /data/db
Would this not work?
In principle, yes, that's possible, but there are several caveats. The strategies with their respective down- and upsides are discussed in detail in the backup documentation. Essentially, replica sets and sharding make the process more complex.
You'll have to shut down or lock the server so the files aren't being written to while you're copying them. Since copying still takes time, it makes sense to only do that on a secondary, otherwise your system will be effectively down.
Consider using file system / lvm snapshots (also discussed in the documentation); they are generally faster because the file system will do copy-on-write when necessary afterwards, so the actual snapshot takes only milliseconds. However, make sure you understand how that works on whatever LVM, file system or virtualization platform you're using, the performance characteristics can be peculiar, especially when keeping multiple snapshots.
Remember that any backup taken while the system is running is inconsistent - the only way to get a 'clean' backup is to gracefully shut down the application (so it finishes all pending writes but doesn't accept any further requests), then backup the database.

What are important mongo data files for backup

If I want to backup database by copying raw files. What files do I need to copy ? only db-name.ns, db-name.0, db-name.1.... or whole folder (local.ns.., journal). I'm running replica set. I understand procedure for locking hidden secondary node and then copying files to new location. But I'm wondering do I need to copy whole folder or just some files.
Thx
Simple answer: All of them. As obvious as it might sound. And here is why:
If you don't copy a namespaces file, your database will most likely not work.
When not copying all datafiles, some of your data is missing and your indices will point to void locations. The database in question might work (minus the data stored in the missing data file), but I would not bet on that – and since the data was important enough to create a backup in the first place, you don't want this to happen, do you?
Config, admin and local databases are vitally necessary for their respective features – and since you used the feature, you probably want to use it after a restore, too.
How do I backup all files?
The best solution save for MMS backup I have found so far is to create LVM snapshots of the filesystem the MongoDB data resides on. In order for tis to work, the journal needs to be included. Usually, you don't need a dedicated backup node for this approach. It is a bit complicated to set up, though.
Preparing LVM backups
Let's assume you have your data in the default data directory /data/db and you have not changed any paths. Then you would mount a logical volume to /data/db and use this to hold the data. Assuming that you don't have anything like this, here is a step by step guide:
Create a logical volume big enough to hold your data. I will call that one /dev/VolGroup/LogVol1 from now on. Make sure that you only use about 80% of the available disk space in the volume group for creating the logical volume.
Create a filesystem on the logical volume. I prefer XFS, so we create an xfs filesystem on /dev/VolGroup/LogVol1:
mkfs.xfs /dev/VolGroup/LogVol1
Mount the newly created filesystem on /mnt
mount /dev/VolGroup/LogVol1 /mnt
Shut down mongod:
killall mongod
(Note that the upstart scripts sometimes have problems shutting down mongod, and this command gracefully stops mongod anyway).
Copy the datafiles from /data/dbto /mntby issuing
cp -a /data/db/* /mnt
Adjust your /etc/fstab so that the logical volume gets mounted on reboot:
# The noatime parameter increases io speed of mongod significantly
/dev/VolGroup/LogVol1 /data/db xfs defaults,noatime 0 1
Umount the logical volume from it's current outpoint and remount it on the correct one:
cd && umount /mnt/ && mount /data/db
Restart mongod
Creating a backup
Creating a backup now becomes as easy as
Create a snapshot:
lvcreate -l100%FREE -s -n mongo_backup /dev/VolGroup/LogVol1
Mount the snapshot:
mount /dev/VolGroup/mongo_backup /mnt
Copy it somewhere. The reason we need to do this is that the snapshot can only be held up until the changes to the data files do not exceed the space in the volume group you did not allocate during preparation. For example, if you have a 100GB disk and you allocated 80GB for /dev/VolGroup/LogVol1, the snapshot size would be 20GB. While the changes on the filesystem from the point you took the snapshot are less than 20GB, everything runs fine. After that, the filesystem will refuse to take any changes. So you aren't in a hurry, but you should definitely move the data to an offsite location, an FTP server or whatever you deem appropriate. Note that compressing the datafiles can take quite long and you might run out of "change space" before finishing that. Personally, I like to have a slower HDD as a temporary place to store the backup, doing all other operations on the HDD. So my copy command looks like
cp -a /mnt/* /home/mongobackup/backups
when the HDD is mounted on /home/mongobackup.
Destroy the snapshot:
umount /mnt && lvremove /dev/VolGroup/mongo_backup
The space allocated for the snapshot is released and the restrictions to the amount of changes to the filesystem are removed.
Whole db-Data folder + where ever you have your logs and journalling
The best solution to backup data on MongoDB would be to use Mongo monitoring Service(MMS). All other solutions including copying files manually, mongodump, mongoexport are way behind MMS.

PostgreSQL backup with smallest output files

We have a Postgresql database that is over 732 GB when backed as a file system backup. When we do a pg_dump we can get it down to 585 GB. If I combined the pg_dump with the PITR method will this give me the best backup with smallest backup data file size? My plan was to run the pg_start_backup, then the pg_dump, then the pg_stop_backup. I know the documentation states to run a file system backup but I want a smaller backup data set. I would then copy off WAL files and then backup them up at night.
To truly get the smallest file, you'll have to try compressing your pg_dump -Fc dump file with one of many compression tools and settings. Using gzip or xz with maximum possible compression would be a start. This will of course require an excellent CPU and lots of CPU time.