Setting "default" metadata for all+new objects in a GCS bucket? - google-cloud-storage

I run a static website (blog) on Google Cloud Storage.
I need to set a default metadata header for cache-control header for all existing and future objects.
However, editing object metadata instructions show the gsutil setmeta -h "cache-control: ..." command, which doesn't seem to be neither applying to "future" objects in the bucket, nor giving me a way to set a
bucket-wide policy that can be inherited to existing/future objects (since the command is executed per-object).
This is surprising to me because there are features like gsutil defaclwhich let you set a policy for the bucket that is inherited by objects created in the future.
Q: Is there a metadata policy for the entire bucket that would apply to all existing and future objects?

There is no way to set default metadata on GCS objects. You have to set the metadata at write time, or you can update it later (e.g., using gsutil setmeta).

Extracted from this question
According to the documentation, if an object does not have a Cache-Control entry, the default value when serving that object would be public,max-age=3600 if the object is publicly readable.
In the case that you still want to modify this meta-data, you could do that using the JSON API inside a Cloud Funtion that would be triggered every time a new object is created or an existing one is overwritten.

Related

How to store AWS S3 object data to a postgres DB

I'm working on a Golang application where users will be able to upload files:Images & PDFs.
The files will be stored in AWS S3 bucket which I've implemented. However I dont know how to go about retrieving identifiers for the stored items to save them in Postgres.
I was thinking of using an item.ID but the AWS sdk for go method does not provide an object ID:
for _,item:=range response.Contents{
log.Printf("Name : %s\n",item.Key)
log.Printf("ID : %s\n",*item.)
}
What other options are available to retrieve stored object references from AWS S3?
A common approach is to event source a lambda with an S3 bucket event. This way, you can get more details about the object created within your bucket. Then you can make this lambda function to persist the object metadata into postgres
Another option would be simply to append the object key you are using in your SDK to the bucket name you're targeting, then the final result would be full URI that points to the object stored. Something like this
s3://{{BUCKET_NAME/{{OBJECT_KEY}}

Passing Cloud Storage custom metadata into Cloud Storage Notification

We have a Python script that copies/creates files in a GCS bucket.
# let me know if my setting of the custom-metadata is correct
blob.metadata = { "file_capture_time": some_timestamp_var }
blob.upload(...)
We want to configure the bucket such that it generates Cloud Storage notifications whenever an object is created. We also want the custom metadata above to be passed along with the Pub/Sub message to the topic and use that as an ordering key in the Subscription side. How can we do this?
The recommended way to receive notification when an event occurs on the intended GCS bucketis to create a Cloud Pub/Sub topic for new objects and to configure your GCS bucket to publish messages to that topic when new objects are created.
Initially, make sure you've activated the Cloud Pub/Sub API, and use the gsutil command similar to below:
gsutil notification create -f json -e OBJECT_FINALIZE gs://example-bucket
The -e specifies that you're only interested in OBJECT_FINALIZE messages (objects being created)
The -f specifies that you want the payload of the messages to be the object metadata for the JSON API
The -m specifies a key:value attribute that is appended to the set of attributes sent to Cloud Pub/Sub for all events associated with this notification config.
You may specify this parameter multiple times to set multiple attributes.
The full Firebase example which explains the parsing the filename and other info from its context/data with
Here is a good example with a similar context.

How to access the latest uploaded object in google cloud storage bucket using python in tensorflow model

I am woking on tensorflow model where I want to make use of the latest ulpoad object, in order get output from that uploaded object. Is there way to access latest object uploaded to Google cloud storage bucket using python.
The below is what I use for grabbing the latest updated object.
Instantiate your client
from google.cloud import storage
# first establish your client
storage_client = storage.Client()
Define bucket_name and any additional paths via prefix
# get your blobs
bucket_name = 'your-glorious-bucket-name'
prefix = 'special-directory/within/your/bucket' # optional
Iterate the blobs returned by the client
Storing these as tuple records is quick and efficient.
blobs = [(blob, blob.updated) for blob in storage_client.list_blobs(
bucket_name,
prefix = prefix,
)]
Sort the list on the second tuple value
# sort and grab the latest value, based on the updated key
latest = sorted(blobs, key=lambda tup: tup[1])[-1][0]
string_data = latest.download_as_string()
Metadata key docs and Google Cloud Storage Python client docs.
One-liner
# assumes storage_client as above
# latest is a string formatted response of the blob's data
latest = sorted([(blob, blob.updated) for blob in storage_client.list_blobs(bucket_name, prefix=prefix)], key=lambda tup: tup[1])[-1][0].download_as_string()
There is no a direct way to get the latest uploaded object from Google Cloud Storage. However, there is a workaround using the object's metadata.
Every object that it is uploaded to the Google Cloud Storage has different metadata. For more information you can visit Cloud Storage > Object Metadata documentation. One of the metadatas is "Last updated". This value is a timestamp of the last time the object was updated. Which can happen only in 3 occasions:
A) The object was uploaded for the first time.
B) The object was uploaded and replaced because it already existed.
C) The object's metadata changed.
If you are not updating the metadata of the object, then you can use this work around:
Set a variable with very old date_time object (1900-01-01 00:00:00.000000). There is no chance of an object to have this update metadata.
Set a variable to store the latest's blob's name and set it to "NONE"
List all the blobs in the bucket Google Cloud Storage Documentation
For each blob name load the updated metadata and convert it to date_time object
If the blob's update metadata is greater than the one you have already, then update it and save the current name.
This process will continue until you search all the blobs and only the latest one will be saved in the variables.
I have did a little bit of coding my self and this is my GitHub code example that worked for me. Take the logic and modify it based on your needs. I would also suggest to test it locally and then use it in your code.
BUT, in case you update the blob's metadata manually then this is another workaround:
If you update the blob's any metadata, see this documentation Viewing and Editing Object Metadata, then the "Last update" timestamp of that blob will also get updated so running the above method will NOT give you the last uploaded object but the last modified which are different. Therefore you can add a custom metadata to your object every time you upload and that custom metadata will be the timestamp at the time you upload the object. So no matter what happen to the metadata later, the custom metadata will always keep the time that the object was uploaded. Then use the same method as above but instead of getting blob.update get the blob.metadata and then use that date with the same logic as above.
Additional notes:
To use custom metadata you need to use the prefix x-goog-meta- as it is stated in Editing object metadata section in Viewing and Editing Object Metadata documentation.
So the [CUSTOM_METADATA_KEY] should be something like x-goog-meta-uploaded and [CUSTOM_METADATA_VALUE] should be [CURRENT_TIMESTAMP_DURING_UPLOAD]

Different S3 behavior using different endpoints?

I'm currently writing code to use Amazon's S3 REST API and I notice different behavior where the only difference seems to be the Amazon endpoint URI that I use, e.g., https://s3.amazonaws.com vs. https://s3-us-west-2.amazonaws.com.
Examples of different behavior for the the GET Bucket (List Objects) call:
Using one endpoint, it includes the "folder" in the results, e.g.:
/path/subfolder/
/path/subfolder/file1.txt
/path/subfolder/file2.txt
and, using the other endpoint, it does not include the "folder" in the results:
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Using one endpoint, it represents "folders" using a trailing / as shown above and, using the other endpoint, it uses a trailing _$folder$:
/path/subfolder_$folder$
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Why the differences? How can I make it return results in a consistent manner regardless of endpoint?
Note that I get these same odd results even if I use Amazon's own command-line AWS S3 client, so it's not my code.
And the contents of the buckets should be irrelevant anyway.
Your assertion notwithstanding, your issue is exactly about the content of the buckets, and not something S3 is doing -- the S3 API has no concept of folders. None. The S3 console can display folders, but this is for convenience -- the folders are not really there -- or if there are folder-like entities, they're irrelevant and not needed.
In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
So why are you seeing this?
Either you've been using EMR/Hadoop, or some other code written by someone who took a bad example and ran with it... or is doing something differently than it should have been done for quite some time.
Amazon EMR is a web service that uses a managed Hadoop framework to process, distribute, and interact with data in AWS data stores, including Amazon S3. Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the <directoryname>_$folder$ suffix.
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
This may have been something the S3 console did many years ago, and apparently (since you don't report seeing them in the console) it still supports displaying such objects as folders in the console... but the S3 console no longer creates them this way, if it ever did.
I've mirrored the bucket "folder" layout exactly
If you create a folder in the console, an empty object with the key "foldername/" is created. This in turn is used to display a folder that you can navigate into, and upload objects with keys beginning with that folder name as a prefix.
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
If you just create objects using the API, then "my/object.txt" appears in the console as "object.txt" inside folder "my" even though there is no "my/" object created... so if the objects are created with the API, you'd see neither style of "folder" in the object listing.
That is probably a bug in the API endpoint which includes the "folder" - S3 internally doesn't actually have a folder structure, but instead is just a set of keys associated with files, where keys (for convenience) can contain slash-separated paths which then show up as "folders" in the web interface. There is the option in the API to specify a prefix, which I believe can be any part of the key up to and including part of the filename.
EMR's s3 client is not the apache one, so I can't speak accurately about it.
In ASF hadoop releases (and HDP, CDH)
The older s3n:// client uses $folder$ as its folder delimiter.
The newer s3a:// client uses / as its folder marker, but will handle $folder$ if there. At least it used to; I can't see where in the code it does now.
The S3A clients strip out all folder markers when you list things; S3A uses them to simulate empty dirs and deletes all parent markers when you create child file/dir entries.
Whatever you have which processes GET should just ignore entries with "/" or $folder at the end.
As to why they are different, the local EMRFS is a different codepath, using dynamo for implementing consistency. At a guess, it doesn't need to mock empty dirs, as the DDB tables will host all directory entries.

How can I change key/name of Amazon S3 object using REST or SOAP?

How can I change key/name of Amazon S3 object using REST or SOAP?
The only way to rename an object is to copy the old object to a new object, and set the new name on the new copy.
The REST call you need is detailed here.
Syntax
PUT /destinationObject HTTP/1.1
Host: destinationBucket.s3.amazonaws.com
x-amz-copy-source: /source_bucket/sourceObject
x-amz-metadata-directive: metadata_directive
x-amz-copy-source-if-match: etag
x-amz-copy-source-if-none-match: etag
x-amz-copy-source-if-unmodified-since: time_stamp
x-amz-copy-source-if-modified-since: time_stamp
<request metadata>
Authorization: signatureValue
Date: date
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
Keep in mind the existing ACLs, however:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request.