Copy image from REST to Blob using DataFacotry - azure-data-factory

In my scenario I need to consume an external REST api. One of the fields in the response is a url to an image. What I'm trying to achieve is to grab the pic behind that url and store it in the blob storage. This would be easy with a Function or WebJob but is there a way to do it with DataFactory on its own?

Based on my research,only Http Connector supports downloading file which could be used in the copy activity as source dataset.

Related

Posting large files using azure data factory

I am looking for some advices. I need to read a large file (>500MB) from the blob storage (it is encrypted at rest) with keys from the key vault. And then I need to push the file (it is a zip) as it is to a REST service with the file as body using form-data.
I tried to use a web request but apparently the payload is limited to 1.5 MB.
Any advice?

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.
Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar
Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

Image metadata in blob storage to cosmos

What is the simplest way to transfer image metadata I have stored for each image in a container to cosmosDB on the Azure portal? For clarification I am trying to retrieve the image properties I have specified as key value pairs under metadata, not the default image metadata. I have attached an image of what I am referring to. enter image description here
I think using logic app is relatively simple, the design of logic app is as follows:
You can first use List blobs to list all your images, then use for each to traverse the blobs and use Create or update document to save the image metadata in cosmosDB.
I have tested it for you and found no problems.

tus.io with Azure Functions

I want to process big files with Azure Functions using HTTP(S). I need something with resumable file upload like tus.io. Is it possible to implement an Azure Function(s) with tus.io, for example by augmenting "HTTP & webhooks".
Many thanks in advance, X.
You are very likely to hit the HTTP timeout limit of 230 seconds and instead of building and maintaining your own upload service, you could directly use Azure Blob Storage for it.
By generating and using a SAS key, you can let clients directly upload to blob storage which is already designed for massive scale and supports resuming uploads/downloads.

Cleanup Assets and Blobs after media encoding job

I am using the following workflow which leaves several copies of the original assets and blobs that should be cleaned up. I want to make sure I only keep the necessary assets in order to playback the videos that have been encoded. Also I am wondering if there is a more efficient way of creating encoded assets. It seems the only improvements that could be made is uploading the blob directly to a media service container instead of having to copy the blob.
I am using the following workflow:
From my website, a video file is uploaded to a non media service container
After file is uploaded, a message queue is created for the blob
Azure Web Job receives the message queue
The uploaded blob is copied to the media service container
Create a media service asset from the copied blob
Start a media encoder job from the new asset for H264 Adaptive Bitrate MP4 Set 720p
After the job is complete, delete the original blob, the first asset, and the queue message
As you already mentioned one of optimization step is to eliminate uploading a media file to none media associated storage. Also since you already using azure queues you can use them to be notified when job is done. With proposed changes your workflow will be.
in UI you calling asset creation before uploads starts.
User directly uploading to storage associated with media account. see https://stackoverflow.com/a/28951408/774068
Once upload is finished, trigger creation of media jobs with azure queues associated with it. See https://learn.microsoft.com/en-us/azure/media-services/media-services-dotnet-check-job-progress-with-queues
Listen when azure queue get a message about job completion and execute source asset deletion once message received. You can utilize azure functions for it. https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage