I cannot upload large (> 2GB) files to the Google Cloud Storage web UI - google-cloud-storage

I have been using the Google Cloud Storage Manager link on the Google APIs console in order to upload my files.
This works great for most files: 1KB, 10KB, 1MB, 10MB, 100MB. However yesterday I could not upload a 3GB file. Any idea what is wrong?
What is the best way to upload large files to Google Cloud Storage?

The web UI only supports uploads smaller than 2^32 bytes (4 GigaBytes). I believe this is a javascript limitation.
If you need to transfer many or large files consider using gsutil:
GSUtil uploads and downloads any size file.
GSUtil resumes uploads and resumes downloads that fail part way through.
GSUtil calculates the MD5 checksum to verify the contents of each file transferred correctly.
GSUtil can upload and download many files at the same time.
gsutil -m cp /path/to/*thousands-of-files* gs://my-bucket/*

In my experience, the accepted answer is not correct - maybe it was but something has changed.
I just uploaded a file of size 2.2GB to GCS using the web interface on Chrome 42 on Windows 8.1.
I would also point out that the question is about files > 2GB, and the answer mentions 2GB, but gets that from 2^32, which is 4GB, not 2. So maybe the limit really is 2^32 (4GB) - I haven't tried anything that big.
(It is still a good idea to use gsutil for large files.)

Related

Use s3fs only to upload new files, don't care about existing ones already on bucket

I was hoping to use s3fs to upload new files into S3. On the documentation I saw that it doesn't work well when there are multiple clients uploading/syncing to the same bucket.
I really don't care about syncing files from to bucket to my local drive, I only want to perform the opposite: only upload to s3 new files as they are created.
Is there a way to achieve that with s3fs? It wasn't clear on the docs if they offer that functionality by the usage of flags.
s3fs does not synchronize files. Instead it intercepts the open, read, write, etc. calls and relays them to the S3 server. Thus it will work for your upload-only use case. Note that s3fs does use some temporary storage to stage the upload.

Upload data to Cloud Storage from external website

I need to upload data from a public source to one of my Cloud Storage buckets. Currently, I download the data to my machine and then upload it to GCS. Being huge data sources (60GB in all, this week), I began running into problems to do it.
Is there a way to do it coding straight into GCS, without needing all the local downloading process?
UPDATE: I have tried using curl http://originaladdress | gsutil cp - gs://bucket. The problem is it would take 21 hours to do the whole process with 100 MB chunks, which is longer than it takes for me to download and upload the file. Is that right? Did I miss some parameter?

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!
I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match
gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Batch file uploading to cloud storage

Could anyone cut and paste a working request to upload several files to cloud storage in a batch. I am really struggling to get it working, there are no examples of file uploads and I'm really stuck. Could probably work it out if I had a working starting point. I'm starting to go crazy so any help would be much appreciated.
You can find an example at [1] and consult this other answer at [2] as reference.
I would suggest you to use gsutil to copy files even as an external call from your application (PHP exec() or system()) since this tool is optimised for parallel file transfer (-m option) and recursive folder copy (-R option) making it very simple and efficient.
For more help on gsutil copy command : gsutil cp help
Links:
[1] - https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch#example
[2] - Batch upload requests to Google Cloud Storage using javascript
Regards
Paolo

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.