Is "in the cloud" gsutil cp an atomic operation? - google-cloud-storage

Assuming I have copied one object into a Google Cloud Storage bucket using the following command:
gsutil -h "Cache-Control:public,max-age=3600" cp -a public-read a.html gs://some-bucket/
I now want to copy this file "in the cloud" while keeping the public ACL and simultaneously updating the Cache-Control header:
gsutil -h "Cache-Control:no-store" cp -p gs://some-bucket/a.html gs://some-bucket/b.html
Is this operation atomic? I.e. can I be sure, that the object gs://some-bucket/b.html will become initially available with the modified Cache-Control:no-store header?
The reason for my question is: I'm using a Google Cloud Storage bucket as a CDN-backend. While I want most of the objects in the bucket to be cached by the CDN according to the max-age provided in the Cache-Control header I want to make sure that a few specific files, which are in fact copies of cacheable versions, are never cached by a CDN. It is therefore crucial that these objects – when being copied – never appear with a Cache-Control:public,max-age=XXX but immediately appear with a Cache-Control:no-store header as to eliminate the chance that a request coming from a CDN would read the copied object at a point in time where a max-age would still be present and hence cache the object which is supposed to never be cached.

Yes, copying to the new object with Cache-Control set will be atomic. You can verify this by looking at the metageneration property of the object.
For example, upload an object:
$ BUCKET=mybucket
$ echo foo | ./gsutil cp - gs://$BUCKET/foo.txt
Copying from <STDIN>...
/ [1 files][ 0.0 B/ 0.0 B]
Operation completed over 1 objects.
and you'll see that its initial metageneration is 1:
$ ./gsutil ls -L gs://$BUCKET/foo.txt | grep Meta
Metageneration: 1
Whenever an object's metadata is modified, the metageneration is changed. For example, if the cache control is updated later like so:
$ ./gsutil setmeta -h "Cache-Control:no-store" gs://$BUCKET/foo.txt
Setting metadata on gs://mybucket/foo.txt...
/ [1 objects]
Operation completed over 1 objects.
The new metageneration is 2:
$ ./gsutil ls -L gs://$BUCKET/foo.txt | grep Meta
Metageneration: 2
Now, if we run the copy command:
$ ./gsutil -h "Cache-Control:no-store" cp -p gs://$BUCKET/foo.txt gs://$BUCKET/bar.txt
Copying gs://mybucket/foo.txt [Content-Type=application/octet-stream]...
- [1 files][ 4.0 B/ 4.0 B]
Operation completed over 1 objects/4.0 B.
The metageneration of the new object is 1:
$ ./gsutil ls -L gs://$BUCKET/bar.txt | grep Meta
Metageneration: 1
This means that the object was written once and has not been modified since.

Related

What parameter(s) do I have to pass `gsutil` to access a Google Cloud local storage? (storage-testbench)

For test purposes, I want to run the storage-testbench simulator. It allows me to send REST commands to a local server which is supposed to work like a Google Cloud Storage facility.
In my tests, I want to copy 3 files from my local hard drive to that local GCS-like storage facility using gsutil cp .... I found out that in order to connect to that specific server, I need additional options on the command line as follow:
gsutil \
-o "Credentials:gs_json_host=127.0.0.1" \
-o "Credentials:gs_json_port=9000" \
-o "Boto:https_validate_certificates=False" \
cp -p test my-file.ext gs://bucket-name/my-file.ext
See .boto for details on defining the credentials.
Unfortunately, I get this error:
CommandException: No URLs matched: test
The name at the end (test) is the project identifier (-p test). There is an example in the README.md of the storage-testbench project, although it's just a variable in a URI.
How do I make the cp command work?
Note:
The gunicorn process shows that the first GET from the cp command works as expected. It returns a 200. So the issue seems to be inside gsutil. Also, I'm able to create the bucket just fine:
gsutil \
-o "Credentials:gs_json_host=127.0.0.1" \
-o "Credentials:gs_json_port=9000" \
-o "Boto:https_validate_certificates=False" \
mb -p test gs://bucket-name
Trying the mb a second time gives me a 509 as expected.
More links:
gsutil global options
gsutil cp ...

Why some buckets should not appear after a gsutil ls?

When I do gsutil ls -p myproject-id I get a list of buckets (in my case 2 buckets), which I expect to be the list of all my buckets in the project:
gs://bucket-one/
gs://bucket-two/
But, if I do gsutil ls -p myproject-id gs://asixtythreecharacterlongnamebucket I actually get the elements of that long-named bucket:
gs://asixtythreecharacterlongnamebucket/somefolder/
So my question is: why when I do a ls to the project I don't get in the results the long-named bucket?
The only explanation it made sense to me was this: https://stackoverflow.com/a/34738829/3457432
But I'm not sure. Is this the reason? Or could it be other ones?
Are you sure that asixtythreecharacterlongnamebucket belongs to myproject-id? It really sounds like asixtythreecharacterlongnamebucket was created in a different project.
You can verify this by checking the bucket ACLs for asixtythreecharacterlongnamebucket and bucket-one and seeing if the project numbers in the listed entities match:
$ gsutil ls -Lb gs://asixtythreecharacterlongnamebucket | grep projectNumber
$ gsutil ls -Lb gs://bucket-one | grep projectNumber
Also note that the -p argument to ls has no effect in your second command when you're listing objects in some bucket. The -p argument only affects which project should be used when you're listing buckets in some project, as in your first command. Think of ls as listing the children resources belonging to some parent -- the parent of a bucket is a project, while the parent of an object is a bucket.
You don't perform the same request!
gsutil ls -p myproject-id
Here you ask all the bucket resources that belong to a project
gsutil ls -p myproject-id gs://asixtythreecharacterlongnamebucket
Here you ask all the objects that belong to the bucket asixtythreecharacterlongnamebucket and you use the quota project myproject-id
In both case, you need to have permissions to access the resources

"cat urls.txt | gsutil -m cp -I gs://target-bucket-name/" consistently hangs after transferring ~10,000 files

I am trying to copy ~80,000 images from one google cloud storage bucket to another.
I am initiating this operation from a mac with google cloud sdk 180.0.1 which contains gsutil 4.28.
The ~url of each image to be transferred in in a text file which I feed to gsutil cp like so ...
$cat urls.txt | gsutil -m cp -I gs://target-bucket-name/
wherein urls.txt looks like ...
head -3 urls.txt
gs://source-bucket-name/1506567870546.jpg
gs://source-bucket-name/1506567930548.jpg
gs://source-bucket-name/1507853339446.jpg
The process consistently hangs after ~10,000 of the images have been transferred.
I have edited $HOME/.boto to uncomment:
parallel_composite_upload_threshold = 0
This has not prevented the operation from hanging.
I am uncertain what causes the hanging.
The underlying need is for a general utility to copy N items from one bucket to another. I need a work around that will enable me to accomplish that mission.
UPDATE
Removing the -m option seems to work around the hanging problem but the file transfer is now significantly slower. I would like to be able avoid the hanging problem whilst still gaining the speed that comes with using concurrency if possible.
gstuil should not be hanging. This is a bug. Could you record the output of gsutl -D and when it hangs, create an issue in the gsutil github repo with the output attached and comment here with a link to it? You can use the following command to log the output:
$ cat urls.txt | gsutil -D -m cp -I gs://target-bucket-name/ 2>&1 | tee output
In the meanwhile, you could try experimenting with reducing the number of threads and processes that the parallel mode (-m) uses by changing these defaults in your boto file.
parallel_process_count = 1 # Default - 12
parallel_thread_count = 10 # Default - 10
Note that gsutil has options to copy all files in a bucket or subdirectory to a new bucket, as well as only copy files that have changed or do not exist in the target with the following commands:
gsutil -m cp gs://source-bucket/ gs://target-bucket
gsutil -m cp 'gs://source-bucket/dir/**' gs://target-bucket
gsutil -m rsync -r gs://source-bucket gs://target-bucket

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.
For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

Fast way of deleting non-empty Google bucket?

Is this my only option or is there a faster way?
# Delete contents in bucket (takes a long time on large bucket)
gsutil -m rm -r gs://my-bucket/*
# Remove bucket
gsutil rb gs://my-bucket/
Buckets are required to be empty before they're deleted. So before you can delete a bucket, you have to delete all of the objects it contains.
You can do this with gsutil rm -r (documentation). Just don't pass the * wildcard and it will delete the bucket itself after it has deleted all of the objects.
gsutil -m rm -r gs://my-bucket
Google Cloud Storage bucket deletes can't succeed until the bucket listing returns 0 objects. If objects remain, you can get a Bucket Not Empty error (or in the UI's case 'Bucket Not Ready') when trying to delete the bucket.
gsutil has built-in retry logic to delete both buckets and objects.
Another option is to enable Lifecycle Management on the bucket. You could specify an Age of 0 days and then wait a couple days. All of your objects should be deleted.
Using Python client, you can force a delete within your script by using:
bucket.delete(force=True)
Try out a similar thing in your current language.
Github thread that discusses this
This deserves to be summarized and pointed out.
Deleting with gsutil rm is slow if you have LOTS (terabytes) of data
gsutil -m rm -r gs://my-bucket
However, you can specify the expiration for the bucket and let the GCS do the work for you. Create a fast-delete.json policy:
{
"rule":[
{
"action":{
"type":"Delete"
},
"condition":{
"age":0
}
}
]
}
then apply
gsutil lifecycle set fast-delete.json gs://MY-BUCKET
Thanks, #jterrace and #Janosch
Use this to set an appropriate lifecycle rule. e.g. wait for a day.
https://cloud.google.com/storage/docs/gsutil/commands/lifecycle
Example (Read carefully before copy paste)
gsutil lifecycle set [LIFECYCLE_CONFIG_FILE] gs://[BUCKET_NAME]
Example (Read carefully before copy paste)
{
"rule":
[
{
"action": {"type": "Delete"},
"condition": {"age": 1}
}
]
}
Then delete the bucket.
This will delete the data asynchronously, so you don't have to keep
some background job running on your end.
Shorter one liner for the lifecycle change:
gsutil lifecycle set <(echo '{"rule":[{"action":{"type":"Delete"},"condition":{"age":0}}]}') gs://MY-BUCKET
I've also had good luck creating an empty bucket then starting a transfer to the bucket I want to empty out. Our largest bucket took about an hour to empty this way; the lifecycle method seems to take at least a day.
I benchmarked deletes using three techniques:
Storage Transfer Service: 1200 - 1500 / sec
gcloud alpha storage rm: 520 / sec
gsutil -m rm: 240 / sec
The big winner is the Storage Transfer Service. To delete files with it you need a source bucket (or folder in a bucket) that is empty, and then you copy that to a destination bucket (or folder in that bucket) that you want to be empty.
If using the GUI select this bullet in the advanced transfer options dialog:
You can also create and run the job from the CLI. This example assumes you have access to gs://bucket1/empty/ (which has no objects in it) and you want to delete all objects from gs://bucket2/:
gcloud transfer jobs create \
gs://bucket1/empty/ gs://bucket2/ \
--delete-from=destination-if-unique \
--project my-project
If you want your deletes to happen even faster you'll need to create multiple transfer jobs and have them target different sections of the bucket. Because it has to do a bucket listing to find the files to delete you'd want to make the destination paths non-overlapping (e.g. gs://bucket2/folder1/ and gs://bucket2/folder2/, etc). Each job will process in parallel at speed getting the job done in less total time.
Usually I like this better than using Object Lifecycle Management (OLM) because it starts right away (no waiting up to 24 hours for policy evaluation) but there may be times when OLM is the way to go.
Remove the bucket from Developers Console. It will ask for confirmation before deleting a non empty bucket. It works like a charm ;)
I've tried both ways (expiration time and gsutil command direct to bucket root), but I could not wait to the expiration time to propagate.
The gsutil rm was deleting 200 files per second, so I did this:
Open several terminal and executed the gsutil rm using different "folder" names with *
ie:
gsutil -m rm -r gs://my-bucket/a*
gsutil -m rm -r gs://my-bucket/b*
gsutil -m rm -r gs://my-bucket/c*
In this example, the command is able to delete 600 files per second.
So you just need to open more terminals and find the patterns to delete more files.
If one wildcard is huge, you can detail, like this
gsutil -m rm -r gs://my-bucket/b1*
gsutil -m rm -r gs://my-bucket/b2*
gsutil -m rm -r gs://my-bucket/b3*