Google Cloud Storage - rsync or cp for initial upload? - google-cloud-storage

We're using GCS for our archive backup and I was curious what people thought was better for the initial update - rsync or cp?
I've gotten hung up twice (once on a non-unicode character and again on what seemed like a long path) and would like to be able to pick up where I left off.
Any advice would be appreciated!
(and if this is a bad question, can someone tell me exactly why its bad or how to fix it? it seems I suck at asking questions here!)

rsync is better suited for doing archives/backups, for the reason you hinted at - if you started uploading data and then encountered a problem partway through, restarting a cp would cause you to re-upload files that were already successfully uploaded, while rsync would only upload files that weren't uploaded (or that changed since the last upload). Moreover, if some of the source files were deleted since you last started uploading, rsync will remove them from the destination bucket, making the destination content match the source content.

Related

SFTP file uploading and downloading at same time

A cronjob runs every 3 hours to download a file using SFTP. The scheduled program is written in Perl and the module used is Net::SFTP::Foreign.
Can the Net::SFTP::Foreign download files that are only partially uploaded using SFTP?
If so, do we need to check the SFTP file modified date to check copy process completion?
Suppose a new file is uploading by someone in SFTP and he file upload/copy is in progress. If a download is attempted at the same time, do I need to code for the possibility of fetching only part of a file?
It's not a question of the SFTP client you use, that's irrelevant. It's how the SFTP server handles the situation.
Some SFTP servers may lock the file being uploaded, preventing you from accessing it, while it is still being uploaded. But most SFTP servers, particularly the common OpenSSH SFTP server, won't lock the file.
There's no generic solution to this problem. Checking for timestamp or size changes may work for you, but it's hardly reliable.
There are some common workarounds to the problem:
Have the uploader upload "done" file once upload finishes. Make your program wait for the "done" file to appear.
You can have dedicated "upload" folder and have the uploader (atomically) move the uploaded file to "done" folder. Make your program look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the uploader (atomically) rename the file after upload to its final name. Make your program ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, when you download an incomplete file.
For details, see my answer to SFTP file lock mechanism.
The easiest way to do that when the upload process is also under your control, is to upload files using temporal names (for instance, foo-20170809.tgz.temp) and once the upload finishes, rename then (Net::SFTP::Foreign::put method supports the atomic option which does just that). Then on the download side, filter out the files with names corresponding to temporal files.
Anyway, Net::SFTP::Foreign get and rget methods can be instructed to resume a transfer passing the option resume => 1.
Also, if you have full SSH access to the SFTP server, you could check if some other process is still writing to the file to be downloaded using fuser or some similar tool (though, note that even then, the file may be incomplete if for instance there is some network issue and the uploader needs to reconnect before resuming the transfer).
You can check the size of the file.
Connect to SFTP.
Check file size.
Sleep for 5/10 seconds.
Check file size again.
If size did not change, download the file, if the size changed do step 3.

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!
I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match
gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Robocopy not to copy files, which are deleted in the destination

We use Robocopy as part of our backup concept.
Now when the destination computer has crashed, we will restore files from our backup/source computer using the backup one hour (which is the related backup interval) ago. However the destination computer might receive files this way, which had been deleted in the meantime by users on the destination server (which acts very dynamic) on purpose.
This is something, we would like avoid. Is this possible with Robocopy? Per my understanding it is not...
If I understand the situation correctly it is something like this:
- One hour ago: your backup is created
- 30 mins ago: User deletes a file
- 10 mins ago: server disk crashes
- Now you need to restore from backup
The server hard disk is now gone. The only information you now have is on the backup from one hour ago. The backup was done before the user deleted the file. So you have no way of knowing that the user deleted the file.
It seems to me that robocopy will not help in this situation, as there is no way to know what files were deleted after the backup was taken.
Does this make sense?

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

What would happen if I deleted all the files associated with vBulletin?

I would like to completely take down the vBulletin forum running out of a subfolder of a site. I have already removed access to the bulletin via .htaccess, but now I would like to get rid of the whole shebang.
Can I just go in via ftp and remove all of the vBulletin files or will that cause problems?
The reason I want to get rid of the bulletin now, other than for security and resource conservation, is because now, after a move to a new server, I am receiving emails of database errors (I am assuming this is because the bulletin didn't get hooked up to the database at the new server).
If it makes any difference, this is the error:
mysql_connect() [function.mysql-connect]: Unknown MySQL server host 'blah.blah.blah.some.url.associated.with.my.old.hosts.nameserver.com' (1)
/path/to/my/forum/includes/class_core.php on line 317
Thanks in advance for any advice/info you have.
to completely remove vbulletin you want to remove all the files in FTP and delete the database as well. the database is usually a lot larger then the forum files. but to just stop the errors your getting removing the ftp files will work.