Why is "gs" is not defined in a python script accessing google cloud storage? - google-cloud-storage

The purpose of the script is to import products to google vision API product search by using a CSV file containing information about the products. In this example the cloud storage address is the default one given in the documentation(https://cloud.google.com/vision/product-search/docs/create-product-set-search-products#using_a_dataset)(https://cloud.google.com/vision/product-search/docs/create-product-set#using_bulk_import_to_create_a_product_set_with_products). When I try to set the cvs_file_uri variable, vscode states that "gs" is not defined, despite google cloud storage being imported.
from google.cloud import vision
from google.protobuf import field_mask_pb2 as field_mask
from google.cloud import storage
#"""Import images of different products in the product set.
#Args:
# project_id: Id of the project.
# location: A compute region name.
# gcs_uri: Google Cloud Storage URI.
# Target files must be in Product Search CSV format.
#"""
client = vision.ProductSearchClient()
# A resource that represents Google Cloud Platform location.
location_path = f"projects/[project id]/locations/europe-west1"
# Set the input configuration along with Google Cloud Storage URI
############error is on the 2 lines below
gcs_source = vision.ImportProductSetsGcsSource(
csv_file_uri=gs://cloud-samples-data/vision/product_search/product_catalog.csv)
########################
input_config = vision.ImportProductSetsInputConfig(
gcs_source=gcs_source)
# Import the product sets from the input URI.
response = client.import_product_sets(
parent=location_path, input_config=input_config)
print('Processing operation name: {}'.format(response.operation.name))
# synchronous check of operation status
result = response.result()
print('Processing done.')
for i, status in enumerate(result.statuses):
print('Status of processing line {} of the csv: {}'.format(
i, status))
# Check the status of reference image
# `0` is the code for OK in google.rpc.Code.
if status.code == 0:
reference_image = result.reference_images[i]
print(reference_image)
else:
print('Status code not OK: {}'.format(status.message))

Related

GCP Dataflow runner error when deploying pipeline using beam-nuggets library - "Failed to read inputs in the data_plane."

I have been testing an Apache beam pipeline within Apache beam notebooks provided by GCP using a Kafka instance as a input and Bigquery as output. I have been able to successfully use the pipeline via Interactive runner, but when I deploy the same pipeline to Dataflow runner it seems to never actually read from the Kafka topic that has been defined. Looking into the logs gives me the error:
Failed to read inputs in the data plane. Traceback (most recent call
last): File
/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py,
Implementation based on this post here
Any ideas? Code provided below:
from __future__ import print_function
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
kafka_config = {"topic": kafka_topic, "bootstrap_servers": ip_addr}
# p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options) # <- use for test
p = beam.Pipeline(DataflowRunner(), options=options) # <- use for dataflow implementation
notifications = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
preprocess = notifications | "Pre-process for model" >> beam.ParDo(preprocess())
model = preprocess | "format & predict" >> beam.ParDo(model())
newWrite = model | beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
Error message from logs:
Failed to read inputs in the data plane. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 528, in _read_inputs for elements in elements_iterator: File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 689, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "DNS resolution failed" debug_error_string = "{"created":"#1595595923.509682344","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1595595923.509650517","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":216,"referenced_errors":[{"created":"#1595595923.509649070","description":"DNS resolution failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":375,"grpc_status":14,"referenced_errors":[{"created":"#1595595923.509645878","description":"unparseable host:port","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":417,"target_address":""}]}]}]}" >
and also
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "DNS resolution failed" debug_error_string = "{"created":"#1594205651.745381243","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1594205651.745371624","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":216,"referenced_errors":[{"created":"#1594205651.745370349","description":"DNS resolution failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":375,"grpc_status":14,"referenced_errors":[{"created":"#1594205651.745367499","description":"unparseable host:port","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":417,"target_address":""}]}]}]}" >
Pipeline settings:
Python sdk harness started with pipeline_options: {'streaming': True, 'project': 'example-project', 'job_name': 'beamapp-root-0727105627-001796', 'staging_location': 'example-staging-location', 'temp_location': 'example-staging-location', 'region': 'europe-west1', 'labels': ['goog-dataflow-notebook=2_23_0_dev'], 'subnetwork': 'example-subnetwork', 'experiments': ['use_fastavro', 'use_multiple_sdk_containers'], 'setup_file': '/root/notebook/workspace/setup.py', 'sdk_location': '/root/apache-beam-custom/packages/beam/sdks/python/dist/apache-beam-2.23.0.dev0.tar.gz', 'sdk_worker_parallelism': '1', 'environment_cache_millis': '0', 'job_port': '0', 'artifact_port': '0', 'expansion_port': '0'}
As far as I know Failed to read inputs in the data plane ... status = StatusCode.UNAVAILABLE details = "DNS resolution failed" could be an issue in Python Beam SDK, it is recommended to update to Python Beam SDK 2.23.0.
It seems this isn't possible in my implementation plan, but with multi language pipelines it appears to be more viable. I opened a ticket with google support on this matter and got the following reply after some time investigating:
“… at this moment Python doesn't have any KafkaIO that works with
DataflowRunner. You can use Java as a workaround. In case you need
Python for something in particular (TensorFlow or similar), a
possibility is to send the message from Kafka to a PubSub topic (via
another pipeline that only reads from Kafka and publish to PS or an
external application).”
So feel free to take their advice, or you might be able to hack something together. I just revised my architecture to use pubsub instead of kafka.

Reading file content from Cloud Storage

I have been trying to read contents of a json file from Google Cloud Storage and encounter an error. Here is my code
from google.cloud import storage
import jsonclient = storage.Client()
bucket = client.get_bucket('bucket_name')
blob = bucket.get_blob('file.json')
u = blob.download_as_string()
print(u)
I see following error
TypeError: request() got an unexpected keyword argument 'data'
pretty much lost. Help is appreciated
You don't have to import the Client(), you just have to declare it like
client = storage.Client().
Use the code below to load the JSON file from Google Cloud Storage bucket. I have tested it myself and it is working.
from google.cloud import storage
client = storage.Client()
BUCKET_NAME = '[BUCKET_NAME]'
FILE_PATH = '[path/to/file.json]'
bucket = client.get_bucket(BUCKET_NAME)
blob = bucket.get_blob(FILE_PATH)
print('The JSON file is: ')
print(blob.download_as_string())
Replace the [BUCKET_NAME] with your bucket's name and the [path/to/file.json] to the path where your JSON file is located inside the bucket.

moving local data to google cloud bucket using python api

I can move data in google storage to buckets using the following:
gsutil cp afile.txt gs://my-bucket
How to do the same using the python api library:
from google.cloud import storage
storage_client = storage.Client()
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
Cant find anything more than the above.
There is an API Client Library code sample code here. My code typically looks like below which is a slight variant on the code they provide:
from google.cloud import storage
client = storage.Client(project='<myprojectname>')
mybucket = storage.bucket.Bucket(client=client, name='mybucket')
mydatapath = 'C:\whatever\something' + '\\' #etc
blob = mybucket.blob('afile.txt')
blob.upload_from_filename(mydatapath + 'afile.txt')
In case it is of interest, another method is to run the "gsutil" command line how you have typed in your Original Post using the subprocess command, e.g.:
import subprocess
subprocess.call("gsutil cp afile.txt gs://mybucket/", shell=True)
In my view, there are pros and cons of both methods depending on what you are trying to achieve - the latter method allows multi-threading if you have many files to upload whereas the former method perhaps allows better control, specification of metadata for each file, etc.

Deleting all blobs inside a path prefix using google cloud storage API

I am using google cloud storage python API. I came across a situation where I need to delete a folder that might have hundred of files using API. Is there an efficient way to do it without making recursive and multiple delete call?
One solution that I have is to list all blob objects in the bucket with given path prefix and delete them one by one.
The other solution is to use gsutil:
$ gsutil rm -R gs://bucket/path
Try something like this:
bucket = storage.Client().bucket(bucket_name)
blobs = bucket.list_blobs()
while True:
blob = blobs.next()
if not blob: break
if blob.name.startswith('/path'): blob.delete()
And if you want to delete the contents of a bucket instead of a folder within a bucket you can do it in a single method call as such:
bucket = storage.Client().bucket(bucket_name)
bucket.delete_blobs(bucket.list_blobs())
from google.cloud import storage
def deleteStorageFolder(bucketName, folder):
"""
This function deletes from GCP Storage
:param bucketName: The bucket name in which the file is to be placed
:param folder: Folder name to be deleted
:return: returns nothing
"""
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
print str(e.message)
In this case folder = "path"

Create file in Google Cloud Storage with python

This is the method that i used to save a new file in Google Cloud Storage
cloud_storage_path = "/gs/[my_app_name].appspot.com/%s/%s" % (user_key.id(), img_title)
blobstore_key = blobstore.create_gs_key(cloud_storage_path)
cloud_storage_file = cloudstorage_api.open(
filename=cloud_storage_path, mode="w", content_type=img_type
)
cloud_storage_file.write(img_content)
cloud_storage_file.close()
But when execute this method. The log file printed :
Path should have format /bucket/filename but got /gs/[my_app_name].appspot.com/6473924464345088/background.jpg
PS: i changed [my_app_name] and, [my_app_name].appspot.com is my bucket name
So, what will I do next in this case ?
I can not save the file to that path