Deleting all blobs inside a path prefix using google cloud storage API - google-cloud-storage

I am using google cloud storage python API. I came across a situation where I need to delete a folder that might have hundred of files using API. Is there an efficient way to do it without making recursive and multiple delete call?
One solution that I have is to list all blob objects in the bucket with given path prefix and delete them one by one.
The other solution is to use gsutil:
$ gsutil rm -R gs://bucket/path

Try something like this:
bucket = storage.Client().bucket(bucket_name)
blobs = bucket.list_blobs()
while True:
blob = blobs.next()
if not blob: break
if blob.name.startswith('/path'): blob.delete()
And if you want to delete the contents of a bucket instead of a folder within a bucket you can do it in a single method call as such:
bucket = storage.Client().bucket(bucket_name)
bucket.delete_blobs(bucket.list_blobs())

from google.cloud import storage
def deleteStorageFolder(bucketName, folder):
"""
This function deletes from GCP Storage
:param bucketName: The bucket name in which the file is to be placed
:param folder: Folder name to be deleted
:return: returns nothing
"""
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
print str(e.message)
In this case folder = "path"

Related

Google Storage Python ACL Update not Working

I have uploaded one image file to my google storage bucket.
#Block 1
#Storing the local file inside the bucket
blob_response = bucket.blob(cloud_path)
blob_response.upload_from_filename(local_path, content_type='image/png')
File gets uploaded fine. I verify the file in bucket.
After uploading the file, in the same method, I am trying to update the acl for file to be publicly accessible as:
#Block 2
blob_file = storage.Blob(bucket=bucket20, name=path_in_bucket)
acl = blob_file.acl
acl.all().grant_read()
acl.save()
This does not make the file public.
Strange thing is that,after I run the above upload method, if I just call the #Block 2 code. separately in jupyter notebook; It is working fine and file become publicly available.
I have tried:
Checked existence of blob file in bucket after upload code.
Introducing 5 seconds delay after upload.
Any help is appreciated.
If you are changing the file uploaded from upload_from_filename() to public, you can reuse the blob from your upload. Also, add a reloading of acl prior to changing the permission. This was all done in 1 block in Jupyter Notebook using GCP AI Platform.
# Block 1
bucket_name = "your-bucket"
destination_blob_name = "test.txt"
source_file_name = "/home/jupyter/test.txt"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(blob) #prints the bucket, file uploded
blob.acl.reload() # reload the ACL of the blob
acl = blob.acl
acl.all().grant_read()
acl.save()
for entry in acl:
print("{}: {}".format(entry["role"], entry["entity"]))
Output:

How to change the metadata of all specific file of exist objects in Google Cloud Storage?

I have uploaded thousands of files to google storage, and i found out all the files miss content-type,so that my website cannot get it right.
i wonder if i can set some kind of policy like changing all the files content-type at the same time, for example, i have bunch of .html files inside the bucket
a/b/index.html
a/c/a.html
a/c/a/b.html
a/a.html
.
.
.
is that possible to set the content-type of all the .html files with one command in the different place?
You could do:
gsutil -m setmeta -h Content-Type:text/html gs://your-bucket/**.html
There's no a unique command to achieve the behavior you are looking for (one command to edit all the object's metadata) however, there's a command from gcloud to edit the metadata which you could use on a bash script to make a loop through all the objects inside the bucket.
1.- Option (1) is to use a the gcloud command "setmeta" on a bash script:
# kinda pseudo code here.
# get the list with all your object's names and iterate over the metadata edition command.
for OUTPUT in $(get_list_of_objects_names)
do
gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]
# the "gs://[BUCKET_NAME]/[OBJECT_NAME]" would be your object name.
done
2.- You could also create a C++ script to achieve the same thing:
namespace gcs = google::cloud::storage;
using ::google::cloud::StatusOr;
[](gcs::Client client, std::string bucket_name, std::string object_name,
std::string key, std::string value) {
# you would need to find list all the objects, while on the loop, you can edit the metadata of the object.
for (auto&& object_metadata : client.ListObjects(bucket_name)) {
string bucket_name=object_metadata->bucket(), object_name=object_metadata->name();
StatusOr<gcs::ObjectMetadata> object_metadata =
client.GetObjectMetadata(bucket_name, object_name);
gcs::ObjectMetadata desired = *object_metadata;
desired.mutable_metadata().emplace(key, value);
StatusOr<gcs::ObjectMetadata> updated =
client.UpdateObject(bucket_name, object_name, desired,
gcs::Generation(object_metadata->generation()))
}
}

moving local data to google cloud bucket using python api

I can move data in google storage to buckets using the following:
gsutil cp afile.txt gs://my-bucket
How to do the same using the python api library:
from google.cloud import storage
storage_client = storage.Client()
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
Cant find anything more than the above.
There is an API Client Library code sample code here. My code typically looks like below which is a slight variant on the code they provide:
from google.cloud import storage
client = storage.Client(project='<myprojectname>')
mybucket = storage.bucket.Bucket(client=client, name='mybucket')
mydatapath = 'C:\whatever\something' + '\\' #etc
blob = mybucket.blob('afile.txt')
blob.upload_from_filename(mydatapath + 'afile.txt')
In case it is of interest, another method is to run the "gsutil" command line how you have typed in your Original Post using the subprocess command, e.g.:
import subprocess
subprocess.call("gsutil cp afile.txt gs://mybucket/", shell=True)
In my view, there are pros and cons of both methods depending on what you are trying to achieve - the latter method allows multi-threading if you have many files to upload whereas the former method perhaps allows better control, specification of metadata for each file, etc.

mount S3 to databricks

I'm trying understand how mount works. I have a S3 bucket named myB, and a folder in it called test. I did a mount using
var AwsBucketName = "myB"
val MountName = "myB"
My question is that: does it create a link between S3 myB and databricks, and would databricks access all the files include the files under test folder? (or if I do a mount using var AwsBucketName = "myB/test"does it only link databricks to that foldertestbut not anyother files that outside of that folder?)
If so, how do I say list files in test folder, read that file or or count() a csv file in scala? I did a display(dbutils.fs.ls("/mnt/myB")) and it only shows the test folder but not files in it. Quite new here. Many thanks for your help!
From the Databricks documentation:
// Replace with your values
val AccessKey = "YOUR_ACCESS_KEY"
// Encode the Secret Key as that can contain "/"
val SecretKey = "YOUR_SECRET_KEY".replace("/", "%2F")
val AwsBucketName = "MY_BUCKET"
val MountName = "MOUNT_NAME"
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey#$AwsBucketName", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))
If you are unable to see files in your mounted directory it is possible that you have created a directory under /mnt that is not a link to the s3 bucket. If that is the case try deleting the directory (dbfs.fs.rm) and remounting using the above code sample. Note that you will need your AWS credentials (AccessKey and SecretKey above). If you don't know them you will need to ask your AWS account admin for them.
It only lists the folders and files directly under bucket.
In S3
<bucket-name>/<Files & Folders>
In Databricks
/mnt/<MOUNT-NAME>/<Bucket-Data-List>
Just like below (Output for dbutils.fs.ls(s"/mnt/$MountName"))
dbfs:/mnt/<MOUNT-NAME>/Folder/
dbfs:/mnt/<MOUNT-NAME>/file1.csv
dbfs:/mnt/<MOUNT-NAME>/file2.csv

Create file in Google Cloud Storage with python

This is the method that i used to save a new file in Google Cloud Storage
cloud_storage_path = "/gs/[my_app_name].appspot.com/%s/%s" % (user_key.id(), img_title)
blobstore_key = blobstore.create_gs_key(cloud_storage_path)
cloud_storage_file = cloudstorage_api.open(
filename=cloud_storage_path, mode="w", content_type=img_type
)
cloud_storage_file.write(img_content)
cloud_storage_file.close()
But when execute this method. The log file printed :
Path should have format /bucket/filename but got /gs/[my_app_name].appspot.com/6473924464345088/background.jpg
PS: i changed [my_app_name] and, [my_app_name].appspot.com is my bucket name
So, what will I do next in this case ?
I can not save the file to that path