Posting large files using azure data factory

Posting large files using azure data factory - azure-data-factory

I am looking for some advices. I need to read a large file (>500MB) from the blob storage (it is encrypted at rest) with keys from the key vault. And then I need to push the file (it is a zip) as it is to a REST service with the file as body using form-data.
I tried to use a web request but apparently the payload is limited to 1.5 MB.
Any advice?

Related

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.

Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar

Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

Copy image from REST to Blob using DataFacotry

In my scenario I need to consume an external REST api. One of the fields in the response is a url to an image. What I'm trying to achieve is to grab the pic behind that url and store it in the blob storage. This would be easy with a Function or WebJob but is there a way to do it with DataFactory on its own?

Based on my research,only Http Connector supports downloading file which could be used in the copy activity as source dataset.

tus.io with Azure Functions

I want to process big files with Azure Functions using HTTP(S). I need something with resumable file upload like tus.io. Is it possible to implement an Azure Function(s) with tus.io, for example by augmenting "HTTP & webhooks".
Many thanks in advance, X.

You are very likely to hit the HTTP timeout limit of 230 seconds and instead of building and maintaining your own upload service, you could directly use Azure Blob Storage for it.
By generating and using a SAS key, you can let clients directly upload to blob storage which is already designed for massive scale and supports resuming uploads/downloads.

How to Sync Queryable Metadata with Cloud Blob Storage

I am trying to understand the general architecture and components needed to link metadata with blob objects stored into the Cloud such as Azure Blob Storage or AWS.
Consider an application which allows users to upload a blob files to the cloud. With each file there would be a miriade of metadata describing the file, its cloud URL and perhaps emails of users the file is shared with.
In this case, the file gets save to the cloud and the metadata into some type of database somewhere else. How would you go about doing this transactionally so that it is guaranteed both the file was saved and the metadata? If one of the two fails the application would need to notify the user so that another attempt could be made.

There's no built-in mechanism to span transactions across two disparate systems, such as Neo4j/mongodb and Azure/AWS blob storage as you mentioned. This would be up to your app to manage. And how you go about that is really a matter of opinion/discussion.

Adding to user metadata after an object has been created in Google Cloud Store

I want to add user metadata that is calculated from the stream as it is uploaded
I am using the Google Cloud Storage Client from inside a Servlet during a file upload.
The only solutions I can come up and tried are not really satisfactory for a couple of reasons.
Buffer the stream in memory, calculate the metadata as the stream is buffered then write the stream out to the Cloud Store after it has been completely read.
Write the stream to a bucket and calculate the metadata. Then read the object from the temporary bucket and write it to its final location with the calculated metadata.
Pre-calculate the metadata on the client and send it with the upload.
Why these aren't acceptable:
Doesn't work for large objects, which some of these will be.
Will cost a fortune if lots of objects are uploaded, which there will be.
Can't trust the clients, and some of the clients can't calculate some of what I need.
Is there any way to update the metadata of a Google Cloud Store object after the fact?

You are likely using the Google Cloud Storage Java Client for AppEngine library. This library is great for AppEngine users, but it offers only a subset of the features of Google Cloud Storage. It does not to my knowledge happen to support updating the metadata of existing objects. However, Google Cloud Storage definitely supports this.
You can use the Google API Java client library, which exposes the Google Cloud Storage's JSON API. With this library, you'll be able to use the storage.objects.update method or the storage.objects.patch method, both of which can update metadata (the difference is that update replaces any properties of the object that are already there, while patch just changes the specified fields). The code would look something like this:
StorageObject objectMetadata = new StorageObject();
.setName("OBJECT_NAME")
.setMetadata(ImmutableMap.of("key1", "value1", "key2", "value2"));
Storage.Objects.Patch patchObject = storage.objects().patch("mybucket", objectMetadata);
patchObject.execute();

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse