Batch file uploading to cloud storage - google-cloud-storage

Could anyone cut and paste a working request to upload several files to cloud storage in a batch. I am really struggling to get it working, there are no examples of file uploads and I'm really stuck. Could probably work it out if I had a working starting point. I'm starting to go crazy so any help would be much appreciated.

You can find an example at [1] and consult this other answer at [2] as reference.
I would suggest you to use gsutil to copy files even as an external call from your application (PHP exec() or system()) since this tool is optimised for parallel file transfer (-m option) and recursive folder copy (-R option) making it very simple and efficient.
For more help on gsutil copy command : gsutil cp help
Links:
[1] - https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch#example
[2] - Batch upload requests to Google Cloud Storage using javascript
Regards
Paolo

Related

Using PowerShell to upload to AWS S3

Hopefully this is a quick fix (most likely user error) I am using PowerShell to upload to AWS S3, I'm attempting to copy x amount of .mp4s from a folder to an S3 location, I'm able to copy individual files successfully using the below command:
aws s3 cp .\video1.mp4 s3://bucketname/root/source/
But when I try to copy all the files within that directory I get an error:
aws s3 cp F:\folder1\folder2\folder3\folder4\* s3://bucketname/root/source/
The user-provided path F:\folder1\folder2\folder3\folder4\* does not exist.
I've tried multiple variations on the above, no path just *, *.mp4, .*.mp4 (coming from a Linux background, using quotation marks etc) but I can't seem to get it working.
I was using this documentation initially https://www.tutorialspoint.com/how-to-copy-folder-contents-in-powershell-with-recurse-parameter I feel the answer is probably very simple but couldn't see what I was doing wrong.
Any help would be appreciated.
Thanks.

Pull from and Push to S3 using Perl

everyone! I have what I assume to be a simple problem, but I could use a hand digging in. I have a server that preprocesses data before translation. This is done by a series of perl scripts developed over a decade ago (but they work!). This virtual server is being lifted into AWS. The change this makes for my scripts is that the location they pull from and the location they write to will be S3 buckets now.
The work flow is: copy all files in the source location to the local drive, preprocess the data file by file, and when complete move the preprocessed files to a final destination.
process_file ($workingDir, $dirEntry);
final_move;
move("$downloadDir/$dirEntry", "$archiveDir") or die "ERROR: Archive file $downloadDir/$dirEntry -> $archiveDir FAILED $!\n";
unlink("$workingDir/$dirEntry");
So, in this case $dir and $archiveDir are S3 buckets.
Any advice on adapting this is appreciated.
TIA,
VtR
You have a few options.
Use a system like s3fs-fuse to mount your S3 bucket as a local drive. This would presumably require the smallest changes to your existing code.
Use the AWS Command Line Interface to copy your files to your S3 bucket.
Use the Amazon API (through something like Paws) to upload your files to S3.

Is there any advantage to using gsutil or the google cloud storage API in production transfers?

Which is better to use with production transfers, gsutil, or the google cloud storage API?
gsutil uses a Google Cloud Storage API to transfer data, specifically the JSON API (by default, you can change it). Its main advantage over using the API directly is that it has been tuned to transfer data quickly. For example, it can open up multiple simultaneous connections to GCS, each of which is uploading or downloading part of the file concurrently, which in many cases can provide a significant boost to total throughput.
There's no reason that programming against the API directly could not also provide the same or even better performance, but I would expect gsutil to be at least a little bit faster on average if you implement things in the simplest possible manner.
I'm not sure this is adding much over what Brandon has said. I'm very new to gcloud storage and Python, but I've quickly found that I prefer to use the gsutil command line over the python client library whereever possible. I create compute instances that copy a few GB of input data from cloud storage after they have booted. I found that its both neater and faster to do this using the gsutil command line where possible, so in my python code I use:
import subprocess
subprocess.call("gsutil -m cp gs://my-uberdata-archive/* /home/<username>/rawdata/", shell=True)
The main reasons being that I can do the command in a single line whereas it takes several lines using the client library, and as Brandon points out, gsutil supports multi-threading with the '-m' flag. I haven't found an equivalent way to do this with the Python Client library yet.

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

I cannot upload large (> 2GB) files to the Google Cloud Storage web UI

I have been using the Google Cloud Storage Manager link on the Google APIs console in order to upload my files.
This works great for most files: 1KB, 10KB, 1MB, 10MB, 100MB. However yesterday I could not upload a 3GB file. Any idea what is wrong?
What is the best way to upload large files to Google Cloud Storage?
The web UI only supports uploads smaller than 2^32 bytes (4 GigaBytes). I believe this is a javascript limitation.
If you need to transfer many or large files consider using gsutil:
GSUtil uploads and downloads any size file.
GSUtil resumes uploads and resumes downloads that fail part way through.
GSUtil calculates the MD5 checksum to verify the contents of each file transferred correctly.
GSUtil can upload and download many files at the same time.
gsutil -m cp /path/to/*thousands-of-files* gs://my-bucket/*
In my experience, the accepted answer is not correct - maybe it was but something has changed.
I just uploaded a file of size 2.2GB to GCS using the web interface on Chrome 42 on Windows 8.1.
I would also point out that the question is about files > 2GB, and the answer mentions 2GB, but gets that from 2^32, which is 4GB, not 2. So maybe the limit really is 2^32 (4GB) - I haven't tried anything that big.
(It is still a good idea to use gsutil for large files.)