ADF decode and uncompress data on the fly - azure-data-factory

I have a pipeline in ADF v2 that calls a SOAP endpoint which returns a base64 encoded string, which is actually a zip file containing 2 files. I am only interested in file[1] (the 2nd one). I want to take this file, and write it to a storage account.
What's the best way to do this in ADF without resorting to external things like Functions call etc.

Related

Google Cloud Storage Python API: blob rename, where is copy_to

I am trying to rename a blob (which can be quite large) after having uploaded them to a temporary location in the bucket.
Reading the documentation it says:
Warning: This method will first duplicate the data and then delete the old blob. This means that with very large objects renaming could be a very (temporarily) costly or a very slow operation. If you need more control over the copy and deletion, instead use google.cloud.storage.blob.Blob.copy_to and google.cloud.storage.blob.Blob.delete directly.
But I can find absolutely no reference to copy_to anywhere in the SDK (or elsewhere really).
Is there any way to rename a blob from A to B without the SDK copying the file. In my case overwriting B, but I can remove B first if it's easier.
The reason is checksum validation, I'll upload it under A first to make sure it's successfully uploaded (and doesn't trigger DataCorruption) and only then replace B (the live object)
GCS itself does not support renaming objects. Renaming with a copy+delete is done in the client as a helper, and there is no better way to rename an object at the moment.
As you say your goal is checksum validation, there is a better solution. Upload directly to your destination and use GCS's built in checksum verification. How you do this depends on the API:
JSON objects.insert: Set crc32c or md5Hash header.
XML PUT object: Set x-goog-hash header.
Python SDK Blob.upload_from_* methods: Set checksum="crc32c" or checksum="md5" method parameter.

How can you get XML out of a Data Factory?

How can you get XML out of a Data Factory?
Great there is an XML format but this is only a source ... not a sink
So how can ADF write XML output?
I've looked around and there have been suggestions of using external services, but I'd like to keep it all "in Data Factory"
e.g. I could knock together an Azure Function, which could take JSON, and convert it to XML, using an example like so
But how can I then get ADF to, e.g. to this XML to a File System ?
No, this is not possible.
If you just want to copy, then use binary format is ok. But if you are trying to let ADF output XML, it is not possible.(As the document you mentioned told.)

How to extract a substring from a filename (which is the date) when reading a file in Azure Data Factory v2?

I have this Pipeline where I'm trying to process a CSV file with client data. This file is located in an Azure Data Lake Storage Gen1, and it consists of client data from a certain period of time (i.e. from January 2019 to July 2019). Therefore, the file name would be something like "Clients_20190101_20190731.csv".
From my Data Factory v2, I would like to read the file name and the file content to validate that the content (or a date column specifically) actually matches the range of dates of the file name.
So the question is: how can I read the file name, extract the dates from the name, and use them to validate the range of dates inside the file?
I haven't tested this, but you should be able to use the get metadata activity to get the filename. Then you can access the outputs of the metadata activity and build an expression to split out the file name. If you want to validate data in the file based on the metadata output (filename expression you built) your option would be to use Mapping Data Flows or to pass in the expression to a Databricks Notebook. Mapping Data Flows uses Databricks under the hood. ADF natively does not have transformation tools that you could accomplish this. You can't look at the data in the file except to move it (COPY activity). With the exception of the lookup activity which has a 5000 record limit.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

How to preserve the timestamp of an io.Reader when copying a file by using a REST service in go?

I am writing some microservices in Go which handle different files.
I want to transfer files from one service, the client, to another, the server, via PUT method. The service works, but there is a small point which is not elegant. The files I transfer are getting a new modification date, when I write them on the file system of the server.
At the moment I handle the http.Request at the server like this:
ensure that there is a file at the server
copy the body from the request to the server io.Copy(myfile, r.Body)
When I do that the file has the last modification date from now(). To solve this problem I could transfer a timestamp of the original file and set it via io.Chtimes(). But the request.Body implements an io.ReadCloser interface, so I think there must be a more elegant way to implement the writing of the file onto the server. Is there a function, which takes an io.Reader which preserves the timestamp of the file?
If not, is there a solution for REST services for this problem?

Determine S3 file last modified timestamp

I have a Scala Play 2 app and using AWS S3 API to read from S3 files. I have a need to determine when the last modified timestamp is for a file, what's the best way to do that? Is it using getObjectMetadata or perhaps listObjects or ? If possible, I would like to determine the timestamps for multiple files in one call. Are there other open source libraries built on top of AWS S3 APIs?
A representation of S3 Object in AWS Java SDK is S3ObjectSummary, which has method getLastModified. It returns the modified timestamp.
Ideally just list all of the files using listObjects and than call getObjectSummaries on a returned object.