How to list all files in Google Storage bucket in a short time? - google-cloud-storage

I have a Google Storage bucket that contains more than 20k+ filenames. Is there any way to list all the filenames in the bucket in a short time?

It depends on what you mean by "short", but:
One thing you can do to speed up listing a bucket is to shard the listing operation. For example, if your bucket has objects that begin with English alphabetic characters you could list each letter in parallel and combine the results. You could do this with gsutil in bash like this:
gsutil ls gs://your-bucket/a* > a.out &
gsutil ls gs://your-bucket/b* > b.out &
...
gsutil ls gs://your-bucket/b* > z.out &
wait
cat ?.out > listing.out
If your bucket has objects with different naming you'd have to adjust how you do the sharding.

Related

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

gsutil / gcloud storage file listing sorted date descending?

Is there no way to get a file listing out from a Google Cloud Storage bucket that is sorted by date descending? This is very frustrating. I need to check the status of files that are uploaded and the bucket has thousands of objects.
gsutil ls does not have the standard linux -t option.
Google cloud console also lists it but does not offer sorting options.
I use this as a workaround:
gsutil ls -l gs://[bucket-name]/ | sort -k 2
This outputs full listing including date as the second field, sort -k 2 then sorts by this field.
The only ordering supported by GCS is lexicographic.
As a workaround, if it's possible for you to name your objects with a datestamp, that would give you a way to list objects by date.

Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help
GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.
When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

Google Cloud Storage: How to get list of new files in bucket/folder using gsutil

I have a bucket/folder into which a lot for files are coming in every minutes.
How can I read only the new files based on file timestamp.
eg: list all files with timestamp > my_timestamp
You could use some bash-fu:
gsutil ls -l gs://<your-bucket-name> | sort -k2n | tail -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'
breaking that down:
# grab detailed list of objects in bucket
gsutil ls -l gs://your-bucket-name
# sort by number on the date field
sort -k2n
# grab the last row returned
tail -n1
# delete first two cols (size and date) and ltrim to remove whitespace
awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'`
Tested with Google Cloud SDK v186.0.0, gsutil v4.28
This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.
Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.
If you are interested in new files or we can say in another words the files which are not present in your destination bucket then alternatively you can use gsutil -n option as it copies only those files which are not present in destination bucket.
From documentation
https://cloud.google.com/storage/docs/gsutil/commands/cp?hl=ru
No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional GET request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.
cons with this approach is, it makes a check request for every file present in your source bucket

Listing all public links for all objects in a bucket using gsutil

Is there a way to list all public links for all the objects stored into a Google Cloud Storage bucket (or a directory in a bucket) using Cloud SDK's gsutil or gcloud?
Something like:
$ gsutil ls --public-link gs://my-bucket/a-directory
Public links for publicly visible objects are predictable. They just match this pattern: https://storage.googleapis.com/BUCKET_NAME/OBJECT_NAME.
gsutil doesn't have a command to print URLs for objects in a bucket, but it can just list objects. You could pipe that to a program like sed to replace those listings with object names. For example:
gsutil ls gs://pub/** | sed 's|gs://|https://storage.googleapis.com/|'
The downside here is that this would produce links to all resources, not just those that are publicly visible. So you'd need to either know which resources are publicly visible, or you'd need to write a more elaborate filter based on gsutil ls -L.
Even though the question is about a possible flag passed to gsutil to achieve the desired result and since there isn't one at the moment, I'd like to post another programmatic approach using a Cloud Storage Client Library that could be extended and/or adapted to Python modules.
Is as follows (the only third party dependency is google-cloud-storage):
python3 -c """
from operator import attrgetter
from pathlib import Path
import sys
from google.cloud import storage
url = Path(sys.argv[1]) #a blob with the objects we want...
bucket = storage.Client().bucket(url.parent.name)
urls = tuple(map(attrgetter('public_url'), filter(lambda blob:not blob.name.endswith('/'), bucket.list_blobs(prefix=url.name)))) # TODO improve this as not only excludes self blob as homologous 'folder' abstraction blobs inside
print('\n'.join(urls))
""" gs://my-bucket/a-directory