How do I get notified when an object is uploaded to my GCS bucket? - google-cloud-storage

I have an app that uploads photos regularly to a GCS bucket. When those photos are uploaded, I need to add thumbnails and do some analysis. How do I set up notifications for the bucket?

The way to do this is to create a Cloud Pub/Sub topic for new objects and to configure your GCS bucket to publish messages to that topic when new objects are created.
First, let's create a bucket PHOTOBUCKET:
$ gsutil mb gs://PHOTOBUCKET
Now, make sure you've activated the Cloud Pub/Sub API.
Next, let's create a Cloud Pub/Sub topic and wire it to our GCS bucket with gsutil:
$ gsutil notification create \
-t uploadedphotos -f json \
-e OBJECT_FINALIZE gs://PHOTOBUCKET
The -t specifies the Pub/Sub topic. If the topic doesn't already exist, gsutil will create it for you.
The -e specifies that you're only interested in OBJECT_FINALIZE messages (objects being created). Otherwise you'll get every kind of message in your topic.
The -f specifies that you want the payload of the messages to be the object metadata for the JSON API.
Note that this requires a recent version of gsutil, so be sure to update to the latest version of gcloud, or run gsutil update if you use a standalone gsutil.
Now we have notifications configured and pumping, but we'll want to see them. Let's create a Pub/Sub subscription:
$ gcloud beta pubsub subscriptions create processphotos --topic=uploadedphotos
Now we just need to read these messages. Here's a Python example of doing just that. Here are the relevant bits:
def poll_notifications(subscription_id):
client = pubsub.Client()
subscription = pubsub.subscription.Subscription(
subscription_id, client=client)
while True:
pulled = subscription.pull(max_messages=100)
for ack_id, message in pulled:
print('Received message {0}:\n{1}'.format(
message.message_id, summarize(message)))
subscription.acknowledge([ack_id])
def summarize(message):
# [START parse_message]
data = message.data
attributes = message.attributes
event_type = attributes['eventType']
bucket_id = attributes['bucketId']
object_id = attributes['objectId']
return "A user uploaded %s, we should do something here." % object_id
Here is some more reading on how this system works:
https://cloud.google.com/storage/docs/reporting-changes
https://cloud.google.com/storage/docs/pubsub-notifications

GCP also offers an earlier version of the Pub/Sub cloud storage change notifications called Object Change Notification. This feature will directly POST to your desired endpoint(s) when an object in that bucket changes. Google recommends the Pub/Sub approach.
https://cloud.google.com/storage/docs/object-change-notification

while using this example!
keep in mind two things
1) they have upgraded code to python 3.6 pub_v1 this might not be running on python 2.7
2) while calling poll_notifications(projectid,subscriptionname)
pass your GCP project id : e.g bold-idad & subscrition name e.g asrtopic

Related

Producing a CSV of Cloud Bucket files

What's the best way to create a CSV file listing images in a Google Cloud bucket to be imported into AutoML Vision?
If you want to listen the files that are saved on a bucket you can use a Google cloud function to listen the new files and create the csv file in another bucket
For example you can use this python code as starting point, this code log the details of a new uploaded file
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
print('Metageneration: {}'.format(data['metageneration']))
print('Created: {}'.format(data['timeCreated']))
print('Updated: {}'.format(data['updated']))
Basically the function is listening the storage events "google.storage.object.finalize
" (this happen when a file is uploaded)
To deploy this function on the cloud you can use this command
gcloud functions deploy hello_gcs_generic --runtime python37 --trigger-resource [your bucket name] --trigger-event google.storage.object.finalize
or you can use the GCP console (Web UI) to deploy this function.
selecting "cloud storage" on the trigger field
select "Finalize/create" on the event type
specifiying your bucket
Even you can directly process the files using Auto ML within a cloud function as is mentioned in this example.

Downloading public data directory from google cloud storage with command line utilities like wget

I would like to download publicly available data from google cloud storage. However, because I need to be in a Python3.x environment, it is not possible to use gsutil. I can download individual files with wget as
wget http://storage.googleapis.com/path-to-file/output_filename -O output_filename
However, commands like
wget -r --no-parent https://console.cloud.google.com/path_to_directory/output_directoryname -O output_directoryname
do not seem to work as they just download an index file for the directory. Neither do rsync or curl attempts based on some initial attempts. Any idea of how to download publicly available data on google cloud storage as a directory?
The approach you mentioned above does not work because Google Cloud Storage doesn't have real "directories". As an example, "path/to/some/files/file.txt" is the entire name of that object. A similarly named object, "path/to/some/files/file2.txt", just happens to share the same naming prefix.
As for how you could fetch these files: The GCS APIs (both XML and JSON) allow you to do an object listing against the parent bucket, specifying a prefix; in this case, you'd want all objects starting with the prefix "path/to/some/files/". You could then make individual HTTP requests for each of the objects specified in the response body. That being said, you'd probably find this much easier to do via one of the GCS client libraries, such as the Python library.
Also, gsutil currently has a GitHub issue open to track adding support for Python 3.

Pyspark and BigQuery using two different project-ids in Google Dataproc

I want to run some pyspark jobs using Google Dataproc with different project Ids without success so far. I'm newbie with pyspark and Google Cloud but I've followed this example and runs well (if the BigQuery dataset is either public or belongs to my GCP project, which is ProjectA). Input parameters look like this:
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
projectA = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectA',
'mapred.bq.input.dataset.id': 'my_dataset',
'mapred.bq.input.table.id': 'my_table',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
But what I need is to run a job from a BQ dataset of a ProjectB (I have credentials to query it), so when setting the input parameters, which look like this:
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectB',
'mapred.bq.input.dataset.id': 'the_datasetB',
'mapred.bq.input.table.id': 'the_tableB',
}
and try to load data in from BQ, my script keeps running infinitely. How should I set it up properly?
FYI, after running the example I mentioned before, I can see that 2 carpets (shard-0 and shard-1) are created in Google Storage and contain the corresponding BQ data, but with my job only shard-0 is created and it's empty.
I talked to my co-worker Dennis and here is his suggestion:
"Hmm, not sure, it should work. They might want to test with "bq" CLI inside the master node to manually try some "bq extract" job of the projectB table into their GCS bucket since that's all the connector is doing under the hood.
If I had to guess I'd suspect they only meant their personal username has the credentials to query projectB, but the default service account of projectA might not have the query permissions. Everything inside Dataproc VMs act on behalf of the compute service account assigned to the VMs, not the end-user.
They can
gcloud compute instances describe -m
and somewhere in there it lists the service-account email address."

Change storage class of (existing) objects in Google Cloud Storage

I recently learnt of the new storage tiers and reduced prices announced on the Google Cloud Storage platform/service.
So I wanted to change the default storage class for one of my buckets from Durable Reduced Availability to Coldline, as that is what is appropriate for the files that I'm archiving in that bucket.
I got this note though:
Changing the default storage class only affects objects you add to this bucket going forward. It does not change the storage class of objects that are already in your bucket.
Any advice/tips on how I can change class of all existing objects in the bucket (using Google Cloud Console or gsutil)?
The easiest way to synchronously move the objects to a different storage class in the same bucket is to use rewrite. For example, to do this with gsutil, you can run:
gsutil -m rewrite -s coldline gs://your-bucket/**
Note: make sure gsutil is up to date (version 4.22 and above support the -s flag with rewrite).
Alternatively, you can use the new SetStorageClass action of the Lifecycle Management feature to asynchronously (usually takes about 1 day) modify storage classes of objects in place (e.g. by using a CreatedBefore condition set to some time after you change the bucket's default storage class).
To change the storage class from NEARLINE to COLDLINE, create a JSON file with the following content:
{
"lifecycle": {
"rule": [
{
"action": {
"type": "SetStorageClass",
"storageClass": "COLDLINE"
},
"condition": {
"matchesStorageClass": [
"NEARLINE"
]
}
}
]
}
}
Name it lifecycle.json or something, then run this in your shell:
$ gsutil lifecycle set lifecycle.json gs://my-cool-bucket
The changes may take up to 24 hours to go through. As far as I know, this change will not cost anything extra.
I did this:
gsutil -m rewrite -r -s <storage-class> gs://my-bucket-name/
(-r for recursive, because I want all objects in my bucket to be affected).
You could now use "Data Transfer" to change a storage class by moving your bucket objects to a new bucket.
Access this from the left panel of Storage.
If you couldn't access to the gsutil console, as in Google Cloud Function environment because Cloud Functions server instances don't have gsutil installed. Gsutil works on your local machine because you do have it installed and configured there. For all these cases I suggest you to evaluate the update_storage_class() blob method in python. This method is callable when you retrieve the single blob (in other words it refers to your specific object inside your bucket). Here an example:
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
print(blob.storage_class)
all_classes = ['NEARLINE_STORAGE_CLASS', 'COLDLINE_STORAGE_CLASS', 'ARCHIVE_STORAGE_CLASS', 'STANDARD_STORAGE_CLASS', 'MULTI_REGIONAL_LEGACY_STORAGE_CLASS', 'REGIONAL_LEGACY_STORAGE_CLASS']
new_class = all_classes[my_index]
update_storage_class(new_class)
References:
Blobs / Objects documentation: https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.update_storage_class
Storage classes: https://cloud.google.com/storage/docs/storage-classes

How do you use storage service in Bluemix?

I'm trying to insert some storage data onto Bluemix, I searched many wiki pages but I couldn't come to conclude how to proceed. So can any one tell me how to store images, files in storage of Bluemix through any language code ( Java, Node.js)?
You have several options at your disposal for storing files in your app. None of them include doing it in the app container file system as the file space is ephemeral and will be recreated from the droplet each time a new instance of your app is created.
You can use services like MongoLab, Cloudant, Object Storage, and Redis to store all kinda of blob data.
Assuming that you're using Bluemix to write a Cloud Foundry application, another option is sshfs. At your app's startup time, you can use sshfs to create a connection to a remote server that is mounted as a local directory. For example, you could create a ./data directory that points to a remote SSH server and provides a persistent storage location for your app.
Here is a blog post explaining how this strategy works and a source repo showing it used to host a Wordpress blog in a Cloud Foundry app.
Note that as others have suggested, there are a number of services for storing object data. Go to the Bluemix Catalog [1] and select "Data Management" in the left hand margin. Each of those services should have sufficient documentation to get you started, including many sample applications and tutorials. Just click on a service tile, and then click on the "View Docs" button to find the relevant documentation.
[1] https://console.ng.bluemix.net/?ace_base=true/#/store/cloudOEPaneId=store
Check out https://www.ng.bluemix.net/docs/#services/ObjectStorageV2/index.html#gettingstarted. The storage service in Bluemix is OpenStack Swift running in Softlayer. Check out this page (http://docs.openstack.org/developer/swift/) for docs on Swift.
Here is a page that lists some clients for Swift.
https://wiki.openstack.org/wiki/SDKs
As I search There was a service that name was Object Storage service and also was created by IBM. But, at the momenti I couldn't see it in the Bluemix Catalog. I guess , They gave it back and will publish new service in the future.
Be aware that pobject store in bluemix is now S3 compatible. So for instance you can use Boto or boto3 ( for python guys ) It will work 100% API comaptible.
see some example here : https://ibm-public-cos.github.io/crs-docs/crs-python.html
this script helps you to list recursively all objects in all buckets :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to specify your credentials this would be :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
for bucket in s3.buckets.all():
print(bucket.name)
for obj in bucket.objects.all():
print(" - %s") % obj.key
If you want to create a "hello.txt" file in a new bucket. :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
s3.Object(my_bucket, 'hello.txt').put(Body=b"I'm a test file")
If you want to upload a file in a new bucket :
import boto3
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
s3 = boto3.resource('s3', endpoint_url=endpoint, aws_access_key_id=YouRACCessKeyGeneratedOnYouBlueMixDAShBoard, aws_secret_access_key=TheSecretKeyThatCOmesWithYourAccessKey, use_ssl=True)
my_bucket=s3.create_bucket('my-new-bucket')
timestampstr = str (timestamp)
s3.Bucket(my_bucket).upload_file(<location of yourfile>,<your file name>, ExtraArgs={ "ACL": "public-read", "Metadata": {"METADATA1": "resultat" ,"METADATA2": "1000","gid": "blabala000", "timestamp": timestampstr },},)