How do you perform a real 'move' operation of a GCS storage object, where real 'move' is one that maintains the Last Modified date? - google-cloud-storage

When I move a Google Cloud Storage Object from one bucket to another bucket, or from one "folder" to another "folder" within the same bucket (i.e., a name identity change within the bucket), the Last Modified date is always changed to the date/time of the move.
I would like to move a storage object and have Google Cloud maintain the Last Modified time. If the storage object's contents has not changed, I do not want a move to change the metadata's Last Modified date/time.
I have tried the following tests, but none maintain Modified time:
gsutil .\File.txt gs://bucket-name1/
gsutil gs://bucket-name1/File.txt gs://bucket-name1/SameBucketNameChange/
gsutil gs://bucket-name1/File.txt gs://bucket-name2
Using the GCS portal/console to manually select File.txt, choose move, select a destination bucket that is different from the first bucket.
In all cases, both the Last Modified and Created times change. I would expect at least the Last Modified time to remain unchanged just as it does in both Windows/Linux when there is a move operation.
Especially with cloud storage objects, I would think a positive value add would be date/time integrity, that at least some date/time would be tied to storage object content changes (without such changes that date/time would not change... usually this would be the Last Modified date/time).
The best option I could find so far is when using a Transfer Job with metadata preservation specified for Created time but the result is that GCS still modifies the destination Created/Modified times, but it carries over (copies) the original object's Created time as a new "Custom time" field which seems somewhat odd.

There is no Last Modified date in Cloud Storage. Objects cannot be modified therefore there cannot be a date that you modified the object. You can only change an object by creating a new object and copying the data. Even changing the name requires a copy operation.
Cloud Storage does not support Move. That is emulated with Copy and Delete.

Related

Logic App Blob Trigger for a group of blobs

I'm creating a Logic App that has to process all blobs that in a certain container. I would like to periodically check whether there are any new blobs and, if yes, start a run. I tried using the "When a blob is added or modified". However, if at the time of checking there are several new blobs, several new runs are initiated. Is there a way to only initiate one run if one or more blobs are added/modified?
I experimented with the "Number of blobs to return from the trigger" and also with the split-on setting, but I haven't found a way yet.
If you want to trigger with multiple blob files, yes you have to use When a blob is added or modified. From the connector description you could know
This operation triggers a flow when one or more blobs are added or modified in a container.
And you must set the maxFileCount also you already find the result is split into separate parts. This is because the default setting the splitOn is on, if you want the result be a whole you need to set it OFF.
The the result should be what you want.

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.

Deleting Folders From Azure Storage Containers or File Shares That are Older than X Days

I am using Azure Storage Accounts and trying to work with powershell to delete folders that exist on a container (I know the container is just a 2 layer hierarchy and the blobs concept and that folders do not actually exist per say).
Apart from not being able to check a folder date/time properties, on the blobs themselves the only property I could find is "last modified" which is generally OK for our purpose, although having creation property is better.
As I understand the only solution for this is to create a table and list each file and its creation time and date? seems like a lot of work for this matter.
I can enumerate a file from that folder as they are all copied together and then delete all blobs sharing the root "folder" but I would prefer to know the actual last modified time of the folder itself than the files in it. Is there any way to achieve this? Now, I am not LOCKED on using azure storage containers, file shares are also possible, but when I tried that, enumerating the folders was possible, but the modifed date and time property is just not filled for some reason, and that is the only property there aside of "ETag".
Thanks in advance.
As far as I know, allowing users to define expiration policies on blobs natively from storage is still planned, we can find it in this Azure storage feedback.
If you’d like to delete ''expired'' folders/files using powershell script, you can try to include path information with datetime in blob names (such as 2017/10/test.txt), and then you can list and traverse the blobs to compare datetime part in blob name with current datetime, if the blob is older than x days, delete it.
Besides, if you do not want to include path information with datetime in blob names, you can try to store creation datetime in properties or metadata, and then you can retrieve creation datetime of blob from properties or metadata, and compare creation datetime with current datetime to determine if delete the blob.

Is the age of an object in Google Cloud Storage affected by calls to set meta?

I'm trying to use Google Cloud Storage's lifecycle management features on a bucket, but I want to circumvent it for certain files (basically auto delete all files after 1 day, except for specific files that I want to keep). If I call the set metadata API endpoint will that update the age of the object and prevent the delete from occurring?
Set metadata changes the last updated time, not the creation time. TTL is keyed off of creation time, so that will not prevent TTL cleanup.
However, you could do a copy operation, and just set the destination to be the same as the source. That would update the creation time, and would be a fast operation as it can copy in the cloud.
That being said, it would probably be safer to just use a different bucket for these files. If your job to keep touching the files goes down they may get deleted.

Update/overwrite DNS record Google Cloud

Does anyone know what is a best practice to overwrite records under Google DNS Cloud, using API? https://cloud.google.com/dns/api/v1/changes/create does not help!
I could delete and create, but it is not nice ;) and could cause an outage.
Regards
The Cloud DNS API uses Changes objects to perform the update actions; you can create Changes but you don't ever delete them. In the Cloud DNS API, you never operate directly on the resource record sets. Instead, you create a Changes object with your desired additions and deletions and if that is created successfully, it applies those updates to the specified resource record sets in your managed DNS zone.
It's an unusual mental model, sort of like editing a file by specifying a diff to be applied, or appending to the commit history of a Git repository to change the contents of a file. Still, you can certainly achieve what you want to do using this API, and it is applied atomically at the authoritative servers (although the DNS system as a whole does not really do anything atomically, due to caching, so if you know you will be making changes, reduce your TTLs before you make the changes). The atomicity here is more about the updates themselves: if you have multiple applications making changes to your managed zones, and there are conflicts in changes to the particular record sets, the create operation will fail, and you will have retry the change with modified deletions (rather than having changes be silently overwritten).
Anyhow, what you want to do is to create a Changes object with deletions that specifies the current resource record set, and additions that specifies your desired replacement. This can be rather verbose, especially if you have a domain name with a lot of records of the same type. For example, if you have four A records for mydomain.example (1.1.1.1, 2.2.2.2, 3.3.3.3, and 4.4.4.4) and want to change the 3.3.3.3 address to 5.5.5.5, you need to list all four original A records in deletions and then the new four (1.1.1.1, 2.2.2.2, 4.4.4.4, and 5.5.5.5) in additions.
The Cloud DNS documentation provides example code boilerplate that you can adapt to do what you want: https://cloud.google.com/dns/api/v1/changes/create#examples, you just need to set the deletions and additions for the Changes object you are creating.
I have never used APIs for this purpose, but if you use command line i.e. gcloud to update DNS records, it binds the change in a single transaction and both tasks of deleting the record and adding the updated record are executed as a single transaction. Since transactions are atomic in nature, it shouldn't cause any outage.
Personally, I never witnessed any outage while using gcloud for updating DNS settings for my domain.