moving local data to google cloud bucket using python api - google-cloud-storage

I can move data in google storage to buckets using the following:
gsutil cp afile.txt gs://my-bucket
How to do the same using the python api library:
from google.cloud import storage
storage_client = storage.Client()
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
Cant find anything more than the above.

There is an API Client Library code sample code here. My code typically looks like below which is a slight variant on the code they provide:
from google.cloud import storage
client = storage.Client(project='<myprojectname>')
mybucket = storage.bucket.Bucket(client=client, name='mybucket')
mydatapath = 'C:\whatever\something' + '\\' #etc
blob = mybucket.blob('afile.txt')
blob.upload_from_filename(mydatapath + 'afile.txt')
In case it is of interest, another method is to run the "gsutil" command line how you have typed in your Original Post using the subprocess command, e.g.:
import subprocess
subprocess.call("gsutil cp afile.txt gs://mybucket/", shell=True)
In my view, there are pros and cons of both methods depending on what you are trying to achieve - the latter method allows multi-threading if you have many files to upload whereas the former method perhaps allows better control, specification of metadata for each file, etc.

Related

Why is "gs" is not defined in a python script accessing google cloud storage?

The purpose of the script is to import products to google vision API product search by using a CSV file containing information about the products. In this example the cloud storage address is the default one given in the documentation(https://cloud.google.com/vision/product-search/docs/create-product-set-search-products#using_a_dataset)(https://cloud.google.com/vision/product-search/docs/create-product-set#using_bulk_import_to_create_a_product_set_with_products). When I try to set the cvs_file_uri variable, vscode states that "gs" is not defined, despite google cloud storage being imported.
from google.cloud import vision
from google.protobuf import field_mask_pb2 as field_mask
from google.cloud import storage
#"""Import images of different products in the product set.
#Args:
# project_id: Id of the project.
# location: A compute region name.
# gcs_uri: Google Cloud Storage URI.
# Target files must be in Product Search CSV format.
#"""
client = vision.ProductSearchClient()
# A resource that represents Google Cloud Platform location.
location_path = f"projects/[project id]/locations/europe-west1"
# Set the input configuration along with Google Cloud Storage URI
############error is on the 2 lines below
gcs_source = vision.ImportProductSetsGcsSource(
csv_file_uri=gs://cloud-samples-data/vision/product_search/product_catalog.csv)
########################
input_config = vision.ImportProductSetsInputConfig(
gcs_source=gcs_source)
# Import the product sets from the input URI.
response = client.import_product_sets(
parent=location_path, input_config=input_config)
print('Processing operation name: {}'.format(response.operation.name))
# synchronous check of operation status
result = response.result()
print('Processing done.')
for i, status in enumerate(result.statuses):
print('Status of processing line {} of the csv: {}'.format(
i, status))
# Check the status of reference image
# `0` is the code for OK in google.rpc.Code.
if status.code == 0:
reference_image = result.reference_images[i]
print(reference_image)
else:
print('Status code not OK: {}'.format(status.message))

How can I run pytesseract / tesseract in Foundry Code Repositories?

I am trying to use the function image_to_string from the library pytesseract in a repository to perform OCR of PDFs. However, I am getting the following error:
From the checks I would assume the library was loaded correctly:
Does anyone have an idea how to trouble shoot here?
It seems like Foundry is not respecting / running the environment activation script
https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh
that sets the TESSDATA_PREFIX environment variable automatically. However, we can infer the value manually and provide it to the pytesseract API calls.
Define the following helper function:
def _get_tessdata_directory_path():
import sys
from pathlib import Path
env_root = Path(sys.executable).parent.parent
share_dir = env_root / 'share' / 'tessdata'
assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
return str(share_dir)
and use it like shown in the following snippet:
tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)

How to change the metadata of all specific file of exist objects in Google Cloud Storage?

I have uploaded thousands of files to google storage, and i found out all the files miss content-type,so that my website cannot get it right.
i wonder if i can set some kind of policy like changing all the files content-type at the same time, for example, i have bunch of .html files inside the bucket
a/b/index.html
a/c/a.html
a/c/a/b.html
a/a.html
.
.
.
is that possible to set the content-type of all the .html files with one command in the different place?
You could do:
gsutil -m setmeta -h Content-Type:text/html gs://your-bucket/**.html
There's no a unique command to achieve the behavior you are looking for (one command to edit all the object's metadata) however, there's a command from gcloud to edit the metadata which you could use on a bash script to make a loop through all the objects inside the bucket.
1.- Option (1) is to use a the gcloud command "setmeta" on a bash script:
# kinda pseudo code here.
# get the list with all your object's names and iterate over the metadata edition command.
for OUTPUT in $(get_list_of_objects_names)
do
gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]
# the "gs://[BUCKET_NAME]/[OBJECT_NAME]" would be your object name.
done
2.- You could also create a C++ script to achieve the same thing:
namespace gcs = google::cloud::storage;
using ::google::cloud::StatusOr;
[](gcs::Client client, std::string bucket_name, std::string object_name,
std::string key, std::string value) {
# you would need to find list all the objects, while on the loop, you can edit the metadata of the object.
for (auto&& object_metadata : client.ListObjects(bucket_name)) {
string bucket_name=object_metadata->bucket(), object_name=object_metadata->name();
StatusOr<gcs::ObjectMetadata> object_metadata =
client.GetObjectMetadata(bucket_name, object_name);
gcs::ObjectMetadata desired = *object_metadata;
desired.mutable_metadata().emplace(key, value);
StatusOr<gcs::ObjectMetadata> updated =
client.UpdateObject(bucket_name, object_name, desired,
gcs::Generation(object_metadata->generation()))
}
}

FFmpeg transcoding on Lambda results in unusable (static) audio

I'd like to move towards serverless for audio transcoding routines in AWS. I've been trying to setup a Lambda function to do just that; execute a static FFmpeg binary and re-upload the resulting audio file. The static binary I'm using is here.
The Lambda function I'm using in Python looks like this:
import boto3
s3client = boto3.client('s3')
s3resource = boto3.client('s3')
import json
import subprocess
from io import BytesIO
import os
os.system("cp -ra ./bin/ffmpeg /tmp/")
os.system("chmod -R 775 /tmp")
def lambda_handler(event, context):
bucketname = event["Records"][0]["s3"]["bucket"]["name"]
filename = event["Records"][0]["s3"]["object"]["key"]
audioData = grabFromS3(bucketname, filename)
with open('/tmp/' + filename, 'wb') as f:
f.write(audioData.read())
os.chdir('/tmp/')
try:
process = subprocess.check_output(['./ffmpeg -i /tmp/joe_and_bill.wav /tmp/joe_and_bill.aac'], shell=True, stderr=subprocess.STDOUT)
pushToS3(bucketname, filename)
return process.decode('utf-8')
except subprocess.CalledProcessError as e:
return e.output.decode('utf-8'), os.listdir()
def grabFromS3(bucket, file):
obj = s3client.get_object(Bucket=bucket, Key=file)
data = BytesIO(obj['Body'].read())
return(data)
def pushToS3(bucket, file):
s3client.upload_file('/tmp/' + file[:-4] + '.aac', bucket, file[:-4] + '.aac')
return
You can listen to the output of this here. WARNING: Turn your volume down or your ears will bleed.
The original file can be heard here.
Does anyone have any idea what might be causing the encoding errors? It doesn't seem to be an issue with the file upload, since the md5 on the Lambda fs matches the MD5 of the uploaded file.
I've also tried building the static binary on an Amazon Linux instance in EC2, then zipping and porting it into the Lambda project, but the same issue persists.
I'm stumped! :(
Alright this is a fun one.
So it turns out the Python subprocess inherits stdin from some Lambda processes going on in the background. I was watching this AWS re:Invent keynote and he was describing some issues they were having w.r.t. this issue.
I added stdin=subprocess.DEVNULL to the subprocess call and the audio is now fixed.
Very interesting bug if you ask me.

Deleting all blobs inside a path prefix using google cloud storage API

I am using google cloud storage python API. I came across a situation where I need to delete a folder that might have hundred of files using API. Is there an efficient way to do it without making recursive and multiple delete call?
One solution that I have is to list all blob objects in the bucket with given path prefix and delete them one by one.
The other solution is to use gsutil:
$ gsutil rm -R gs://bucket/path
Try something like this:
bucket = storage.Client().bucket(bucket_name)
blobs = bucket.list_blobs()
while True:
blob = blobs.next()
if not blob: break
if blob.name.startswith('/path'): blob.delete()
And if you want to delete the contents of a bucket instead of a folder within a bucket you can do it in a single method call as such:
bucket = storage.Client().bucket(bucket_name)
bucket.delete_blobs(bucket.list_blobs())
from google.cloud import storage
def deleteStorageFolder(bucketName, folder):
"""
This function deletes from GCP Storage
:param bucketName: The bucket name in which the file is to be placed
:param folder: Folder name to be deleted
:return: returns nothing
"""
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
print str(e.message)
In this case folder = "path"