Why is gsutil rsync re-downloading all our files?

Why is gsutil rsync re-downloading all our files? - google-cloud-storage

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!

I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match

gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Related

Google Cloud Storage - rsync or cp for initial upload?

We're using GCS for our archive backup and I was curious what people thought was better for the initial update - rsync or cp?
I've gotten hung up twice (once on a non-unicode character and again on what seemed like a long path) and would like to be able to pick up where I left off.
Any advice would be appreciated!
(and if this is a bad question, can someone tell me exactly why its bad or how to fix it? it seems I suck at asking questions here!)

rsync is better suited for doing archives/backups, for the reason you hinted at - if you started uploading data and then encountered a problem partway through, restarting a cp would cause you to re-upload files that were already successfully uploaded, while rsync would only upload files that weren't uploaded (or that changed since the last upload). Moreover, if some of the source files were deleted since you last started uploading, rsync will remove them from the destination bucket, making the destination content match the source content.

robocopy error with ERROR 32 (0x00000020)

I have two drives A and B. Using a python script I am creating some files in "A" drive and I am running a powerscript which copies all the files in the drive A to drive B in the interval of 1 sec.
I am getting this error in my powershell.
2015/03/10 23:55:35 ERROR 32 (0x00000020) Time-Stamping Destination
File \x.x.x.x\share1\source\ Dummy_100.txt The process cannot access
the file because it is being used by another process. Waiting 30
seconds...
How will I overcome this error?

This happened is because the file is locked by running process. To fix this, download Process Explorer. Then use Find>Find Handle or DLL, find out which process locked this file. Use 'taskkill' to kill that process in commandline. You will be fine.

if you want to skip this files you can use /r:n that n is times of tries
for example /w:3 /r:5 will try 5 time every 3 seconds

How will I overcome this error?
If backup is, what you got in mind, and you encounter in-use files frequently, you look into Volume Shadow Copies (VSS), which allow to copy files despite them being ‘in use’. It's not a product, but a windows technology used by various backup tool.
Sadly, it's not built into robocopy, but can be used in conjunction with it. See
➝ https://superuser.com/a/602833/75914
and especially:
➝ https://github.com/candera/shadowspawn

It could be many reasons.
In my case, I was running a CMD script to copy from one server to another, a heap of SQL Server backups and transaction logs. I too had the same problem because it was trying to write into a log file that was supposedly opened by another process. It was not.
I ran many IP checks and Process ID checkers that I ran out of knowing what was hogging the log file. Event viewer said nothing.
I found out it was not even the log file that was being locked. I was able to delete it by logging into the server as a normal user with no admin privileges!
It was the backup files themselves by the SQL Server Agent. Like #Oseack said, there may have been the need to use another tool whilst the backup files themselves were still being used or locked by the SQL Server Agent.
The way I got around it was to force ROBOCOPY to wait.
/W:5
did it.

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!

You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.

If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

Is it possible to keep a local mirror of an FTP site with aria2?

I have a website to which I have FTP access only (otherwise I'd use rsync for this) and I'd like to keep a local copy of it. At the moment I run the following wget command every once in a while
wget -m --ftp-user=me --ftp-password=secret ftp://my.server.com
When there are many updates it does get tedious with wget only having one connection at a time. I read about aria2 but couldn't find any hints as to answer the questions whether it would be possible to use aria2 as a replacement for this purpose?

No, according to the aria2 docs the option for downloading only newer files only works with http(s).
--conditional-get[=true|false]
Download file only when the local file is older than remote file. This function only works with HTTP(S) downloads only. It does not work if file size is specified in Metalink.

I cannot upload large (> 2GB) files to the Google Cloud Storage web UI

I have been using the Google Cloud Storage Manager link on the Google APIs console in order to upload my files.
This works great for most files: 1KB, 10KB, 1MB, 10MB, 100MB. However yesterday I could not upload a 3GB file. Any idea what is wrong?
What is the best way to upload large files to Google Cloud Storage?

The web UI only supports uploads smaller than 2^32 bytes (4 GigaBytes). I believe this is a javascript limitation.
If you need to transfer many or large files consider using gsutil:
GSUtil uploads and downloads any size file.
GSUtil resumes uploads and resumes downloads that fail part way through.
GSUtil calculates the MD5 checksum to verify the contents of each file transferred correctly.
GSUtil can upload and download many files at the same time.
gsutil -m cp /path/to/*thousands-of-files* gs://my-bucket/*

In my experience, the accepted answer is not correct - maybe it was but something has changed.
I just uploaded a file of size 2.2GB to GCS using the web interface on Chrome 42 on Windows 8.1.
I would also point out that the question is about files > 2GB, and the answer mentions 2GB, but gets that from 2^32, which is 4GB, not 2. So maybe the limit really is 2^32 (4GB) - I haven't tried anything that big.
(It is still a good idea to use gsutil for large files.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse