Google Cloud Storage: How to Delete a folder (recursively) in Python - google-cloud-storage

I am trying to delete a folder in GCS and its all content (including sub-directories) with its Python library. Also I understand GCS doesn't really have folders (but prefix?) but I am wondering how I can do that?
I tested this code:
from google.cloud import storage
def delete_blob(bucket_name, blob_name):
"""Deletes a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.delete()
delete_blob('mybucket', 'top_folder/sub_folder/test.txt')
delete_blob('mybucket', 'top_folder/sub_folder/')
The first call to delete_blob worked but not the 2nd one. What can I delete a folder recursively?

To delete everything starting with a certain prefix (for example, a directory name), you can iterate over a list:
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix='some/directory')
for blob in blobs:
blob.delete()
Note that for very large buckets with millions or billions of objects, this may not be a very fast process. For that, you'll want to do something more complex, such as deleting in multiple threads or using lifecycle configuration rules to arrange for the objects to be deleted.

Now it can be done by:
def delete_folder(cls, bucket_name, folder_name):
bucket = cls.storage_client.get_bucket(bucket_name)
"""Delete object under folder"""
blobs = list(bucket.list_blobs(prefix=folder_name))
bucket.delete_blobs(blobs)
print(f"Folder {folder_name} deleted.")

deleteFiles might be what you are looking for. Or in Python delete_blobs. Assuming they are the same, the Node docs do a better job describing the behavior, namely
This is not an atomic request. A delete attempt will be made for each
file individually. Any one can fail, in which case only a portion of
the files you intended to be deleted would have.
Operations are performed in parallel, up to 10 at once.

Related

Google Cloud Storage Python API: blob rename, where is copy_to

I am trying to rename a blob (which can be quite large) after having uploaded them to a temporary location in the bucket.
Reading the documentation it says:
Warning: This method will first duplicate the data and then delete the old blob. This means that with very large objects renaming could be a very (temporarily) costly or a very slow operation. If you need more control over the copy and deletion, instead use google.cloud.storage.blob.Blob.copy_to and google.cloud.storage.blob.Blob.delete directly.
But I can find absolutely no reference to copy_to anywhere in the SDK (or elsewhere really).
Is there any way to rename a blob from A to B without the SDK copying the file. In my case overwriting B, but I can remove B first if it's easier.
The reason is checksum validation, I'll upload it under A first to make sure it's successfully uploaded (and doesn't trigger DataCorruption) and only then replace B (the live object)
GCS itself does not support renaming objects. Renaming with a copy+delete is done in the client as a helper, and there is no better way to rename an object at the moment.
As you say your goal is checksum validation, there is a better solution. Upload directly to your destination and use GCS's built in checksum verification. How you do this depends on the API:
JSON objects.insert: Set crc32c or md5Hash header.
XML PUT object: Set x-goog-hash header.
Python SDK Blob.upload_from_* methods: Set checksum="crc32c" or checksum="md5" method parameter.

fetch all items in a GCS bucket python while it is still being written

I am using python GCS client.
I want to list all blobs that are part of a bucket at the given time, but the system keeps writing new data to this bucket all the time, and a lot faster than I read
Is it possible that my
all_blobs = list(client.list_blobs(bucket))
will run forever?
Does it keep bringing new items?
Does it run on a snapshot and will eventually finish?
Thanks a lot
You can use page token to iterate over the pages of the APIs
blobs = bucket.list_blobs(max_results=1000)
for blob in blobs:
print(blob.name)
print(blobs.next_page_token)
blobs = bucket.list_blobs(page_token=blobs.next_page_token, max_results=1000)
for blob in blobs:
print(blob.name)
There is 2 interesting part in this example
You can set the max_result equal to 1000 (max value) to have largest page to use
The Next Page Token is the Base64 encoded name of the latest object/version returned, as described in the documentation
The documentation also mention that:
If a blob is create before the Next Page Token (let say in the alphabetically order), you won't list it by iterating over the next pages
A the opposite, if a blob is create after the Next Page Token you will see it in the further iterations.

Can we preserve the storage class when copying an object to a new bucket?

We have two different buckets: short-term, that has lifecycle policies applied, and retain, where we put data that we intend to keep indefinitely. The way we get data into the retain bucket is usually by copying the original object from the short-term bucket using the JSON API.
The short-term bucket after 30 days moves data to nearline, after 60 days to coldline, and after 90 days deletes the data. The storage class for our retain bucket is standard. When we're copying data from short-term bucket to the retain bucket, we'd like to preserve the storage-class of the file that we're duplicating - is it possible for us to specify the storage class on the destination file using the JSON API?
If you want to preserve the storage class it is recommended to perform a rewrite instead:
Use the copy method to copy between objects in the same location and storage class
In the rewrite you should set the storage class. The other way should be in the case that you have separated the objects according to the storage class, but as per my understanding, this is not your case.

Different S3 behavior using different endpoints?

I'm currently writing code to use Amazon's S3 REST API and I notice different behavior where the only difference seems to be the Amazon endpoint URI that I use, e.g., https://s3.amazonaws.com vs. https://s3-us-west-2.amazonaws.com.
Examples of different behavior for the the GET Bucket (List Objects) call:
Using one endpoint, it includes the "folder" in the results, e.g.:
/path/subfolder/
/path/subfolder/file1.txt
/path/subfolder/file2.txt
and, using the other endpoint, it does not include the "folder" in the results:
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Using one endpoint, it represents "folders" using a trailing / as shown above and, using the other endpoint, it uses a trailing _$folder$:
/path/subfolder_$folder$
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Why the differences? How can I make it return results in a consistent manner regardless of endpoint?
Note that I get these same odd results even if I use Amazon's own command-line AWS S3 client, so it's not my code.
And the contents of the buckets should be irrelevant anyway.
Your assertion notwithstanding, your issue is exactly about the content of the buckets, and not something S3 is doing -- the S3 API has no concept of folders. None. The S3 console can display folders, but this is for convenience -- the folders are not really there -- or if there are folder-like entities, they're irrelevant and not needed.
In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
So why are you seeing this?
Either you've been using EMR/Hadoop, or some other code written by someone who took a bad example and ran with it... or is doing something differently than it should have been done for quite some time.
Amazon EMR is a web service that uses a managed Hadoop framework to process, distribute, and interact with data in AWS data stores, including Amazon S3. Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the <directoryname>_$folder$ suffix.
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
This may have been something the S3 console did many years ago, and apparently (since you don't report seeing them in the console) it still supports displaying such objects as folders in the console... but the S3 console no longer creates them this way, if it ever did.
I've mirrored the bucket "folder" layout exactly
If you create a folder in the console, an empty object with the key "foldername/" is created. This in turn is used to display a folder that you can navigate into, and upload objects with keys beginning with that folder name as a prefix.
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
If you just create objects using the API, then "my/object.txt" appears in the console as "object.txt" inside folder "my" even though there is no "my/" object created... so if the objects are created with the API, you'd see neither style of "folder" in the object listing.
That is probably a bug in the API endpoint which includes the "folder" - S3 internally doesn't actually have a folder structure, but instead is just a set of keys associated with files, where keys (for convenience) can contain slash-separated paths which then show up as "folders" in the web interface. There is the option in the API to specify a prefix, which I believe can be any part of the key up to and including part of the filename.
EMR's s3 client is not the apache one, so I can't speak accurately about it.
In ASF hadoop releases (and HDP, CDH)
The older s3n:// client uses $folder$ as its folder delimiter.
The newer s3a:// client uses / as its folder marker, but will handle $folder$ if there. At least it used to; I can't see where in the code it does now.
The S3A clients strip out all folder markers when you list things; S3A uses them to simulate empty dirs and deletes all parent markers when you create child file/dir entries.
Whatever you have which processes GET should just ignore entries with "/" or $folder at the end.
As to why they are different, the local EMRFS is a different codepath, using dynamo for implementing consistency. At a guess, it doesn't need to mock empty dirs, as the DDB tables will host all directory entries.

Google storage api list storage bucket with "/" in the name

I am trying to list all objects in a bucket(Google storage) in the google storage api. The bucket is nested like a folder, such as "my-bucket/sub-folder". I got the following error:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
If I use a bucket name without "/" it works fine. How can I list a bucket like a folder structure?
Google Cloud Storage buckets do not have slashes in their name. In the example above, the bucket is named "my-bucket" and the object is named something like "sub-folder/object.txt" or just "object.txt".
It's useful to remember that GCS does not have any real notion of folders. There are only buckets and objects in buckets. If you have a subdirectory named "dir" in bucket named "mybucket", and that subdirectory has 5 objects in it, what you really have is 5 objects named "dir/obj1", "dir/obj2", etc, all still within bucket "mybucket."
A number of tools (like gsutil and the GCS web-based storage browser) make it appear that there are folders, through use of markers and prefixes in the API -- even though as noted, there really are just objects that have slashes in the name.