Performance of gsutil cp command has declined

Performance of gsutil cp command has declined - google-cloud-storage

We have observed that the gsutil cp command for copying a single file to google storage was better when few such processes where running to copy different single files to different location on google storage. The normal speed at that time was ~50mbps. But as "gsutil cp" processes to copy a single file to google storage have increased, the average speed these days has dropped to ~10mbps.
I suppose "gsutil -m cp" command will not improve performace as there is only 1 file to be copied.
What can be attributed to this low speed with increase in number of gsutil cp processes to copy many single files. What can we do increase the speed of these processes

gsutil can upload a single large file in parallel. It does so by uploading parts of the file as separate objects in GCS and then asking GCS to compose them together afterwards and then deleting the individual sub-objects.
N.B. Because this involves uploading objects and then almost immediately deleting them, you shouldn't do this on Nearline buckets, since there's an extra charge for deleting objects that have been recently uploaded.
You can set a file size above which gsutil will use this behavior. Try this:
gsutil -o GSUtil:parallel_composite_upload_threshold=100M cp bigfile gs://your-bucket
More documentation on the feature is available here: https://cloud.google.com/storage/docs/gsutil/commands/cp#parallel-composite-uploads

Related

How to upload larger than 5Tb object to Google Cloud Storage?

Trying to save a PostgreSQL backup (~20 Tb) to Google Cloud Storage for the long-term, and I am currently piping PostgreSQL pg_dump() command to an streaming transfer through gsutil.
pg_dump -d $DB_NAME -b --format=t \
| gsutil cp - gs://$BUCKET_NAME/$BACKUP_FILE
However, I am worried that the process will crash because of GCS' 5Tb object size limit.
Is there any way to upload larger than 5Tb objects to Google Cloud Storage?
EDITION: using split?
I am considering to pipe pg_dump to Linux's split utility and the gsutil cp.
pg_dump -d $DB -b --format=t \
| split -b 50G - \
| gsutil cp - gs://$BUCKET/$BACKUP
Would something like that work?

You generally don't want to upload a single object in the multi-terabyte range with a streaming transfer. Streaming transfers have two major downsides, and they're both very bad news for you:
Streaming Transfers don't use Cloud Storage's checksum support. You'll get regular HTTP data integrity checking, but that's it, and for periodic 5 TB uploads, there's a nonzero chance that this could eventually end up in a corrupt backup.
Streaming Transfers can't be resumed if they fail. Assuming you're uploading at 100 Mbps around the clock, a 5 TB upload would take at least 4 and a half days, and if your HTTP connection failed, you'd need to start over from scratch.
Instead, here's what I would suggest:
First, minimize the file size. pg_dump has a number of options for reducing the file size. It's possible something like "--format=c -Z9" might produce a much smaller file.
Second, if possible, store the dump as a file (or, preferably, a series of split up files) before uploading. This is good because you'll be able to calculate their checksums, which gsutil can take advantage of, and also you'd be able to manually verify that they uploaded correctly if you wanted. Of course, this may not be practical because you'll need a spare 5TB of hard drive space, but unless your database won't be changing for a few days, there may not be an easy alternative to retry in case you lose your connection.

As mentioned by Ferregina Pelona, guillaume blaquiere and John Hanley. There is no way to bypass the 5-TB limit implemented by Google, as mentioned in this document:
Cloud Storage 5TB object size limit
Cloud Storage supports a maximum single-object size up to 5 terabytes.
If you have objects larger than 5TB, the object transfer fails for
those objects for either Cloud Storage or Transfer for on-premises.
If the file surpasses the limit (5 TB), the transfer fails.
You can use Google's issue tracker to request this feature, within the link provided, you can check the features that were requested or request a feature that satisfies your expectations.

how to avoid uploading file that is being partially written - Google storage rsync

I'm using google cloud storage with option rsync
I create a cronjob that sync file every minute.
But there're the problem
Right on a file is being partially written, the cronjob run, then it sync a part of file even though it wasn't done.
Is there the way to settle this problem?

The gsutil rsync command doesn't have any way to check that a file is still being written. You will need to coordinate your writing and rsync'ing jobs such that they operate on disjoint parts of the file tree. For example, you could arrange your writing job to write to directory A while your rsync job rsyncs from directory B, and then switch pointers so your writing job writes to directory B while your rsync job writes to directory A. Another option would be to set up a staging area into which you copy all the files that have been written before running your rsync job. If you put it on the same file system as where they were written you could use hard links so the link operation works quickly (without byte copying).

Bulk file restore from Google Cloud Storage

Accidentally run delete command on wrong bucket, object versioning is turned on, but I don't really understand what steps should I take in order to restore files, or what's more important, how to do it in bulk as I've deleted few hundreds of them.
Will appreciate any help.

To restore hundreds of objects you could do something as simple as:
gsutil cp -AR gs://my-bucket gs://my-bucket
This will copy all objects (including deleted ones) to the live generation, using metadata-only copying, i.e., not require copying the actual bytes. Caveats:
It will leave the deleted generations in place, so costing you extra storage.
If your bucket isn't empty this command will re-copy any live objects on top of themselves (ending up with an extra archived version of each of those as well, also costing you for extra storage).
If you want to restore a large number of objects this simplistic script would run too slowly - you'd want to parallelize the individual gsutil cp operations. You can't use the gsutil -m option in this case, because gsutil prevents that, in order to preserve generation ordering (e.g., if there were several generations of objects with the same name, parallel copying them would end up with the live generation coming from an unpredictable generation). If you only have 1 generation of each you could parallelize the copying by doing something like:
gsutil ls -a gs://my-bucket/** | sed 's/\(.\)\(#[0-9]\)/gsutil cp \1\2 \1 \&/' > gsutil_script.sh
This generates a listing of all objects (including deleted ones), and transforms it into a sequence of gsutil cp commands to copy those objects (by generation-specific name) back to the live generation in parallel. If the list is long you'll want to break in into parts so you don't (for example) try to fork 100k processes to do the parallel copying (which would overload your machine).

What is the fastest way to duplicate google storage bucket?

I have one 10TB bucket and need to create it's copy as quickly as possible. What is the fastest and most effective way of doing this?

Assuming you want to copy the bucket to another bucket in the same location and storage class, you could run gsutil rsync on a GCE instance:
gsutil -m rsync -r -d -p gs://source-bucket gs://dest-bucket
If you want to copy across locations or storage classes the above command will still work, but it will take longer because in that case the data (not just metadata) need to be copied.
Either way, you should check the result status and re-run the rsync command if any errors occurred. (The rsync command will avoid re-copying objects that have already been copied.) You should repeat the rsync command until the bucket has successfully been fully copied.

One simple way is to use Google's Cloud Storage Transfer Service. It may also be the fastest, though I have not confirmed this.

You can achieve this easily with gsutil.
gsutil -m cp -r gs://source-bucket gs://duplicate-bucket
Are you copying within Google Cloud Storage to a bucket with the same location and storage class? If so, this operation should be very fast. If the buckets have different locations and/or storage classes, the operation will be slower (and more expensive), but this will still be the fastest way.

Composing objects with > 1024 parts without download/upload

Is there a way to either clear the compose count or copy an object inside cloud storage so as to remove the compose count without downloading and uploading again?
With a 5TB object size limit, I'd need 5GB pieces composed together with a 1024 compose limit -- are 5GB uploads even possible? They are certainly not easy to work with.
The compose count should be higher (1MM) or I should be able to copy an object within cloud storage to get rid of the existing compose count.

There is no longer a restriction on the component count. Composing > 1024 parts is allowed.
https://cloud.google.com/storage/docs/composite-objects

5G uploads are definitely possible. You can use a tool such as gsutil to perform them easily.
There's not an easy way to reduce the existing component count, but it is possible using the Rewrite API. Per the documentation: "When you rewrite a composite object where the source and destination are different locations and/or storage classes, the result will be a composite object containing a single component."
So you can create a bucket of a different storage class, rewrite it, then rewrite it back to your original bucket and delete the copy. gsutil uses the rewrite API under the hood, so you could do all of this with gsutil cp:
$ gsutil mb -c DRA gs://dra-bucket
$ gsutil cp gs://original-bucket/composite-obj gs://dra-bucket/composite-obj
$ gsutil cp gs://your-dra-bucket/composite-obj gs://original-bucket/composite-obj
$ gsutil rm gs://dra-bucket/composite-obj