Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders - google-cloud-storage

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help

GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.

When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

Related

Nearline - Backup Solution - Versioning

I've setup some Nearline buckets and enabled versioning and object lifecycle management. The use-case is to replace my current backup solution, Crashplan.
Using gsutil I can see the different versions of a file using a command like gsutil ls -al gs://backup/test.txt.
First, is there any way of finding files that don't have a live version (e.g. deleted) but still have a version attached?
Second, is there any easier way of managing versions? For instance if I delete a file from my PC, it will no longer have a live version in my bucket but will still have the older versions associated. Say, if I didn't know the file name would I just have to do a recursive ls on the entire bucket and sift through the output?
Would love a UI that supported versioning.
Thanks.
To check if the object currently has no life version use x-goog-if-generation-match header equal to 0, for example :
gsutil -h x-goog-if-generation-match:0 cp file.txt gs://bucket/file.txt
will fail (PreconditionException: 412 Precondition Failed) if file has a live version and will succeed if it has only archived versions.
In order to automatically synchronize your local folder and folder in the bucket (or the other way around) use gcloud rsync:
gcloud rsync -r -d ./test gs://bucket/test/
notice the trailing / in gs://bucket/test/, without it you will receive
CommandException: arg (gs://graham-dest/test) does not name a directory, bucket, or bucket subdir.
-r synchronize all the directories in ./test recursively to gs://bucket/test/`
-d will delete all files from gs://bucket/test/that are not found in./test`
Regarding UI, there already exists a future request. I don't know anything about third party applications however.

Google Cloud Storage: How to get list of new files in bucket/folder using gsutil

I have a bucket/folder into which a lot for files are coming in every minutes.
How can I read only the new files based on file timestamp.
eg: list all files with timestamp > my_timestamp
You could use some bash-fu:
gsutil ls -l gs://<your-bucket-name> | sort -k2n | tail -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'
breaking that down:
# grab detailed list of objects in bucket
gsutil ls -l gs://your-bucket-name
# sort by number on the date field
sort -k2n
# grab the last row returned
tail -n1
# delete first two cols (size and date) and ltrim to remove whitespace
awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'`
Tested with Google Cloud SDK v186.0.0, gsutil v4.28
This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.
Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.
If you are interested in new files or we can say in another words the files which are not present in your destination bucket then alternatively you can use gsutil -n option as it copies only those files which are not present in destination bucket.
From documentation
https://cloud.google.com/storage/docs/gsutil/commands/cp?hl=ru
No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional GET request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.
cons with this approach is, it makes a check request for every file present in your source bucket

GCS CLI: using “gsutil rm” to remove files by creation date

Is there a way to remove files from GoogleCloudStorage, using the CLI by their creation date?
For example:
I would like to remove all files in a specific path, which their creation date is lower than 2016-12-01
There's no built-in way in the CLI to delete by date. There are a couple ways to accomplish something like this. One possibility is to use an object naming scheme that prefixes object names by their creation date. Then it is easy to remove them with wildcards, for example:
gsutil -m rm gs://your-bucket/2016-12-01/*
Another approach would be to write a short parser for gsutil ls -L gs://your-bucket that filters object names by their creation date, then call gsutil -m rm -I with the resulting object names.
If you just want to automatically delete objects older than a certain age, then there is a much easier way than using the CLI: you can configure an Object Lifecycle Management policy on your bucket.

GCS - Global Consistency with delete + rename

My issue may be a result of my misunderstanding with global consistency in google storage, but since I have not experienced this issue until just recently (mid November) and now it seems easily reproducible, I wanted some clarification. The issue started happening in a piece of spark code running on compute engine using bdutil but I can reproduce from the command line with gsutil.
My code is deleting a destination path and then immediately renaming a source path as the destination path. With global consistency I would expect since the destination path no longer exists, the src would be renamed to the destination, but instead the src is being nested inside destination as if the destination still exists and that is not consistent.
The hadoop code to reproduce looks like:
fs.delete(new Path(dest), true)
fs.rename(new Path(src), new Path(dest))
From command line I can reproduce with:
gsutil -m rm -r gs://mybucket/dest
gsutil -m cp -r gs://mybucket/src gs://mybucket/dest
If the reason is because list operations are eventually consistent and the FileSystem implementation is using list operations to determine if the destination still exists, then I understand, and then is there a recommended solution to ensure the destination no longer exists before renaming?
Thanks,
Luke
Travis's answer is a couple of years old and not true anymore. Object list operation is strongly consistent now. Read Google's post.
Read-after-write (including delete) operations are strongly consistent, so for example, if you did:
gsutil -m rm -r gs://mybucket/dest
# Command output shows it removed gs://mybucket/dest/file1
gsutil cp gs://mybucket/dest/file1 my_local_dir/file1
That would always fail.
However, to determine if a "directory" exists, gsutil must perform an eventually-consistent listing operation to find out if any object in Google Cloud Storage's flat name space has a prefix with the name of that "directory". This can lead to the problem you described, and I expect that the hadoop code behaves similarly.
There isn't a strongly consistent workaround for this problem because there's no way to check for the existence of a prefix in a strongly-consistent way.

gsutil - is it possible to list only folders?

Is it possible to list only the folders in a bucket using the gsutil tool?
I can't see anything listed here.
For example, I'd like to list only the folders in this bucket:
Folders don't actually exist. gsutil and the Storage Browser do some magic under the covers to give the impression that folders exist.
You could filter your gsutil results to only show results that end with a forward slash but this may not show all the "folders". It will only show "folders" that were manually created (i.e., not implicitly exist because an object name contains slashes):
gsutil ls gs://bucket/ | grep -e "/$"
Just to add here, if you directly drag a folder tree to google cloud storage web GUI, then you don't really get a file for a parent folder, in fact each file name is a fully qualified url e.g. "/blah/foo/bar.txt" , instead of a folder blah>foo>bar.txt
The trick here is to first use the GUI to create a folder called blah and then create another folder called foo inside (using the button in the GUI) and finally drag the files in it.
When you now list the file you will get a separate entry for
blah/
foo/
bar.txt
rather than only one
blah/foo/bar.txt