gsutil streaming doesn't seem to actually stream

gsutil streaming doesn't seem to actually stream - gcloud

I'm following the instructions here:
https://cloud.google.com/storage/docs/streaming
As I start a process to upload to Google Cloud Storage via streaming, the process has to complete before anything shows up in google cloud. If the process gets interrupted in any way nothing ever shows up.
Is there a way to actually stream to gcloud storage (and perhaps even be able to start downloading before the upload is complete?

It actually does do streaming via resumable uploads, but won't send a chunk until you've hit the resumable chunk size:
https://github.com/GoogleCloudPlatform/gsutil/blob/88ab2023f0dcf3d7a5444832eb547b2cbc68d7bd/gslib/util.py#L833
You can write a fairly simple python script to generate more than that at a time, then pause, allowing you to see the uploads occurring:
import sys
import time
for i in range(3):
sys.stdout.write('a' * 1024 * 1024 * 101L)
time.sleep(10)
And pipe it to gsutil like so:
python my_data_generator.py | gsutil cp - gs://BUCKET/OBJECT
to watch the data transfer for a while, stop at the sleep, and continue.
To reduce this chunk size, you can tweak the GSUtil:json_resumable_chunk_size option in your .boto file.

Related

Upload data to Cloud Storage from external website

I need to upload data from a public source to one of my Cloud Storage buckets. Currently, I download the data to my machine and then upload it to GCS. Being huge data sources (60GB in all, this week), I began running into problems to do it.
Is there a way to do it coding straight into GCS, without needing all the local downloading process?
UPDATE: I have tried using curl http://originaladdress | gsutil cp - gs://bucket. The problem is it would take 21 hours to do the whole process with 100 MB chunks, which is longer than it takes for me to download and upload the file. Is that right? Did I miss some parameter?

Is there any advantage to using gsutil or the google cloud storage API in production transfers?

Which is better to use with production transfers, gsutil, or the google cloud storage API?

gsutil uses a Google Cloud Storage API to transfer data, specifically the JSON API (by default, you can change it). Its main advantage over using the API directly is that it has been tuned to transfer data quickly. For example, it can open up multiple simultaneous connections to GCS, each of which is uploading or downloading part of the file concurrently, which in many cases can provide a significant boost to total throughput.
There's no reason that programming against the API directly could not also provide the same or even better performance, but I would expect gsutil to be at least a little bit faster on average if you implement things in the simplest possible manner.

I'm not sure this is adding much over what Brandon has said. I'm very new to gcloud storage and Python, but I've quickly found that I prefer to use the gsutil command line over the python client library whereever possible. I create compute instances that copy a few GB of input data from cloud storage after they have booted. I found that its both neater and faster to do this using the gsutil command line where possible, so in my python code I use:
import subprocess
subprocess.call("gsutil -m cp gs://my-uberdata-archive/* /home/<username>/rawdata/", shell=True)
The main reasons being that I can do the command in a single line whereas it takes several lines using the client library, and as Brandon points out, gsutil supports multi-threading with the '-m' flag. I haven't found an equivalent way to do this with the Python Client library yet.

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!

I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match

gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!

You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.

If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

I cannot upload large (> 2GB) files to the Google Cloud Storage web UI

I have been using the Google Cloud Storage Manager link on the Google APIs console in order to upload my files.
This works great for most files: 1KB, 10KB, 1MB, 10MB, 100MB. However yesterday I could not upload a 3GB file. Any idea what is wrong?
What is the best way to upload large files to Google Cloud Storage?

The web UI only supports uploads smaller than 2^32 bytes (4 GigaBytes). I believe this is a javascript limitation.
If you need to transfer many or large files consider using gsutil:
GSUtil uploads and downloads any size file.
GSUtil resumes uploads and resumes downloads that fail part way through.
GSUtil calculates the MD5 checksum to verify the contents of each file transferred correctly.
GSUtil can upload and download many files at the same time.
gsutil -m cp /path/to/*thousands-of-files* gs://my-bucket/*

In my experience, the accepted answer is not correct - maybe it was but something has changed.
I just uploaded a file of size 2.2GB to GCS using the web interface on Chrome 42 on Windows 8.1.
I would also point out that the question is about files > 2GB, and the answer mentions 2GB, but gets that from 2^32, which is 4GB, not 2. So maybe the limit really is 2^32 (4GB) - I haven't tried anything that big.
(It is still a good idea to use gsutil for large files.)