Google Cloud Storage Python API: blob rename, where is copy_to - google-cloud-storage

I am trying to rename a blob (which can be quite large) after having uploaded them to a temporary location in the bucket.
Reading the documentation it says:
Warning: This method will first duplicate the data and then delete the old blob. This means that with very large objects renaming could be a very (temporarily) costly or a very slow operation. If you need more control over the copy and deletion, instead use google.cloud.storage.blob.Blob.copy_to and google.cloud.storage.blob.Blob.delete directly.
But I can find absolutely no reference to copy_to anywhere in the SDK (or elsewhere really).
Is there any way to rename a blob from A to B without the SDK copying the file. In my case overwriting B, but I can remove B first if it's easier.
The reason is checksum validation, I'll upload it under A first to make sure it's successfully uploaded (and doesn't trigger DataCorruption) and only then replace B (the live object)

GCS itself does not support renaming objects. Renaming with a copy+delete is done in the client as a helper, and there is no better way to rename an object at the moment.
As you say your goal is checksum validation, there is a better solution. Upload directly to your destination and use GCS's built in checksum verification. How you do this depends on the API:
JSON objects.insert: Set crc32c or md5Hash header.
XML PUT object: Set x-goog-hash header.
Python SDK Blob.upload_from_* methods: Set checksum="crc32c" or checksum="md5" method parameter.

Related

ADF decode and uncompress data on the fly

I have a pipeline in ADF v2 that calls a SOAP endpoint which returns a base64 encoded string, which is actually a zip file containing 2 files. I am only interested in file[1] (the 2nd one). I want to take this file, and write it to a storage account.
What's the best way to do this in ADF without resorting to external things like Functions call etc.

Saving CKAsset from Core Data

I have some data like pictures, stored in Core Data as binary data and marked as "Allows External Storage". I'd like to write this data to the CloudKit. Is it possible to get URLs for this data and pass it to CKAsset, or transform somehow this data to CKAsset without double writing this data to some temporary files? Thank you.
Accessing external binary data directly is not supported and there's no API for it. Unofficially it's not hard to figure out what directory the files are stored in, but it's not useful because
Filenames are UUIDs, and there's no documented way to link a managed object to a UUID, so you don't know which file to use.
The option is to allow external storage, so there's no guarantee that an external file exists. Some instances may not use external storage.
I'm not sure what CKAsset requires but you'll have to look up the binary data via the managed object first.

Different S3 behavior using different endpoints?

I'm currently writing code to use Amazon's S3 REST API and I notice different behavior where the only difference seems to be the Amazon endpoint URI that I use, e.g., https://s3.amazonaws.com vs. https://s3-us-west-2.amazonaws.com.
Examples of different behavior for the the GET Bucket (List Objects) call:
Using one endpoint, it includes the "folder" in the results, e.g.:
/path/subfolder/
/path/subfolder/file1.txt
/path/subfolder/file2.txt
and, using the other endpoint, it does not include the "folder" in the results:
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Using one endpoint, it represents "folders" using a trailing / as shown above and, using the other endpoint, it uses a trailing _$folder$:
/path/subfolder_$folder$
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Why the differences? How can I make it return results in a consistent manner regardless of endpoint?
Note that I get these same odd results even if I use Amazon's own command-line AWS S3 client, so it's not my code.
And the contents of the buckets should be irrelevant anyway.
Your assertion notwithstanding, your issue is exactly about the content of the buckets, and not something S3 is doing -- the S3 API has no concept of folders. None. The S3 console can display folders, but this is for convenience -- the folders are not really there -- or if there are folder-like entities, they're irrelevant and not needed.
In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
So why are you seeing this?
Either you've been using EMR/Hadoop, or some other code written by someone who took a bad example and ran with it... or is doing something differently than it should have been done for quite some time.
Amazon EMR is a web service that uses a managed Hadoop framework to process, distribute, and interact with data in AWS data stores, including Amazon S3. Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the <directoryname>_$folder$ suffix.
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
This may have been something the S3 console did many years ago, and apparently (since you don't report seeing them in the console) it still supports displaying such objects as folders in the console... but the S3 console no longer creates them this way, if it ever did.
I've mirrored the bucket "folder" layout exactly
If you create a folder in the console, an empty object with the key "foldername/" is created. This in turn is used to display a folder that you can navigate into, and upload objects with keys beginning with that folder name as a prefix.
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
If you just create objects using the API, then "my/object.txt" appears in the console as "object.txt" inside folder "my" even though there is no "my/" object created... so if the objects are created with the API, you'd see neither style of "folder" in the object listing.
That is probably a bug in the API endpoint which includes the "folder" - S3 internally doesn't actually have a folder structure, but instead is just a set of keys associated with files, where keys (for convenience) can contain slash-separated paths which then show up as "folders" in the web interface. There is the option in the API to specify a prefix, which I believe can be any part of the key up to and including part of the filename.
EMR's s3 client is not the apache one, so I can't speak accurately about it.
In ASF hadoop releases (and HDP, CDH)
The older s3n:// client uses $folder$ as its folder delimiter.
The newer s3a:// client uses / as its folder marker, but will handle $folder$ if there. At least it used to; I can't see where in the code it does now.
The S3A clients strip out all folder markers when you list things; S3A uses them to simulate empty dirs and deletes all parent markers when you create child file/dir entries.
Whatever you have which processes GET should just ignore entries with "/" or $folder at the end.
As to why they are different, the local EMRFS is a different codepath, using dynamo for implementing consistency. At a guess, it doesn't need to mock empty dirs, as the DDB tables will host all directory entries.

How do I do bulk file storage with IBM Object Storage?

I'm using IBM Object Storage to store huge amounts of very small files,
say more than 1500 small files in one hour. (Total size of the 1500 files is about 5 MB)
I'm using the object store api to post the files, one file at a time.
The problem is that for storing 1500 small files it takes about 15 minutes in total. This is with setting up and closing the connection with the object store.
Is there a way to do a sort of bulk post, to send more than one file in one post?
Regards,
Look at the archive-auto-extract feature available within Openstack Swift (Bluemix Object Storage). I assume that you are familiar with obtaining the X-Auth-Token and Storage_URL from Bluemix object storage. If not, my post about large file manifests explains the process. From the doc, the constraints include:
You must use the tar utility to create the tar archive file.
You can upload regular files but you cannot upload other items (for example, empty directories or symbolic links).
You must UTF-8-encode the member names.
Basic steps would be:
Confirm that IBM Bluemix supports this feature by viewing info details for the service # https://dal.objectstorage.open.softlayer.com/info . You'll see a JSON section within the response similar to:
"bulk_upload": {
"max_failed_extractions": 1000,
"max_containers_per_extraction": 10000
}
Create a tar archive of your desired file set. tar gzip is most common.
Upload this tar archive to object storage with a special parameter that tells swift to auto-extract the contents into the container for you.PUT /v1/AUTH_myaccount/my_backups/?extract-archive=tar.gz
From the docs: To upload an archive file, make a PUT request. Add the extract-archive=format query parameter to indicate that you are uploading a tar archive file instead of normal content. Include within the request body the contents of the local file backup.tar.gz.
Something like:
AUTH_myaccount/my_backups/etc/config1.conf
AUTH_myaccount/my_backups/etc/cool.jpg
AUTH_myaccount/my_backups/home/john/bluemix.conf
...
Inspect the results. Any top-level directory in the archive should create a new container in your Swift object-storage account.
Voila! Bulk upload. Hope this helps.

Determine S3 file last modified timestamp

I have a Scala Play 2 app and using AWS S3 API to read from S3 files. I have a need to determine when the last modified timestamp is for a file, what's the best way to do that? Is it using getObjectMetadata or perhaps listObjects or ? If possible, I would like to determine the timestamps for multiple files in one call. Are there other open source libraries built on top of AWS S3 APIs?
A representation of S3 Object in AWS Java SDK is S3ObjectSummary, which has method getLastModified. It returns the modified timestamp.
Ideally just list all of the files using listObjects and than call getObjectSummaries on a returned object.