On resume gsutil seems to re-upload files - google-cloud-storage

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!

You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.

If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

Related

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!
I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match
gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Data written with gsutil is not visible with gcsfuse

I have installed gcsfuse to support an app requiring a posix-like mount point.
Existing data written with gsutil is not visible, but data written via the browser (Cloud Storage > Storage Browser) is.
According to https://cloud.google.com/storage/docs/gcsfuse -
You can simultaneously read and write to Google Cloud Storage using the Fuse Adapter and tools like gsutil. For example, if you write an object using the Fuse Adapter, it will immediately be available to read with gsutil, or vice versa, without the need to re-mount the bucket or reboot the Compute Engine instance.
Has anyone been successful collaborating with gcsfuse and gsutil?
I feel like I'm missing something.
Thanks!
This is likely because gsutil doesn't create directory placeholder objects, and gcsfuse by default requires them in order for a directory to be visible. To confirm: when you write an object with gsutil in a directory that you can already see (e.g. the root), does it show up?
You can work around this in one of two ways:
Create the directory placeholders for the directories you're missing. The easiest way to do this for a missing object foo/bar/baz is using a gcsfuse mount:
mkdir -p foo/bar
Run gcsfuse with the --implicit-dirs flag. Make sure to read the documentation linked above for caveats, though.

Google Cloud Storage - GSUtil Update fails because file used by another process

We use an ETL process to pull data from Google Cloud Storage, but annoyingly it hangs everytime Google releases udpates to GSUtil, because it sits at a prompt asking if you want to update the library. Fine if you are doing this manually, but not cool when it's being run in an automated SSIS package, as jobs don't finish for days and you keep wasting time with the same stupid cause.
I thought I was going to be cleaver, and add "python gsutil update -n" to the top of the bash script I'm automating the building/execution of in my SSIS Package in the hope to curb this problem, but when I run this command from the prompt in either Windows Server 2008r2 or Windows 7 I get the following:
C:\gsutil>python gsutil update -f -n
Copying gs://pub/gsutil.tar.gz...
OSError: The process cannot access the file because it is being used by another process.
Any help?
P.S. - Also, Google engineers... can you PLEASE remove these prompts? for all of us using these tools in automated processes? I have other things to work on, instead of constantly going back to things like this every few days/weeks.
What version of gsutil are you running?
Also, to be clear: Are you talking about the fact that gsutil checks for available software updates periodically, and if it finds them it then prompts you whether you want to update? Or are you talking about the fact that the gsutil update command asks if you want to perform the update?
If the former, gsutil shouldn't be performing this check/prompting if you are running gsutil from a script not connected to at TTY. If that's not working correctly we'd like to know.
And also, if that's the problem you're having, you can completely disable automated software update checks by setting software_update_check_period=0 in the [GSUtil] section of your .boto config file.

compute engine use gsutil to download tgz file has crcmod error

I find if you create a compute engine (CentOS or Debian) machine and using gsutil to download (cp) a tgz file will cause a crcmod error...
$ gsutil cp gs://mybucket/data.tgz .
Copying gs://mybucket/data.tgz...
CommandException:
Downloading this composite object requires integrity checking with CRC32c, but
your crcmod installation isn't using the module's C extension, so the the hash
computation will likely throttle download performance. For help installing the
extension, please see:
$ gsutil help crcmod
To download regardless of crcmod performance or to skip slow integrity checks,
see the "check_hashes" option in your boto config file.
Currently I use "check_hashes = never" to bypass the check...
$ vi /etc/boto.cfg
[GSUtil]
default_project_id = 429100748693
default_api_version = 2
check_hashes = never
...
But, what is the root cause? and is there any good solution to solve the problem?
The object you're trying to download is a composite object, which basically means it was uploaded in parallel chunks. gsutil automatically does this when uploading objects larger than 150M (a configurable threshold), to provide better performance.
Composite objects only have a crc32c checksum (no MD5), so in order to validate data integrity when downloading composite objects, gsutil needs to perform a crc32c checksum. Unfortunately, the libraries distributed with Python don't include a compiled crc32c implementation, so unless you install a compiled crc32c, gsutil will use a non-compiled Python implementation of crc32c that's quite slow. That warning is printed to let you know there's a way to fix that performance problem: Please run:
gsutil help crcmod
and follow the instructions there for installing a compiled crc32c. It's pretty easy to do it, and worth the effort.
One other note: I strongly recommend against setting check_hashes = never in your boto config file. That will disable integrity checking, which means it's possible your download could get corrupted and you wouldn't know it. You want data integrity checking enabled to ensure you're working with correct data.

Is it possible to keep a local mirror of an FTP site with aria2?

I have a website to which I have FTP access only (otherwise I'd use rsync for this) and I'd like to keep a local copy of it. At the moment I run the following wget command every once in a while
wget -m --ftp-user=me --ftp-password=secret ftp://my.server.com
When there are many updates it does get tedious with wget only having one connection at a time. I read about aria2 but couldn't find any hints as to answer the questions whether it would be possible to use aria2 as a replacement for this purpose?
No, according to the aria2 docs the option for downloading only newer files only works with http(s).
--conditional-get[=true|false]
Download file only when the local file is older than remote file. This function only works with HTTP(S) downloads only. It does not work if file size is specified in Metalink.