I have a ForEach activity which uses a Copy activity with an HTTP source and a blob storage sink to download a json file for each item. The HTTP source is set to Binary Copy whereas the blob storage sink is not, since I want to both copy the complete json file to blob storage and also extract some data from each json file and store this data in a database table in the following activity.
In the blob storage sink, I've added a column definition which extracts some data from the json file. The json files are stored in blob storage successfully, but how can I access the extracted data in the subsequent stored procedure activity?
You can try using lookup activity to extract the data in the Blob and use the output in later activity. Refer to https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity to see whether lookup activity can satisfy your requirements.
Related
I want to update a source excel column with a particular string.
My source contains n columns. I need to check where the string apple exists in any one of the columns. If the value exist in any column I need to replace the apple with orange string. And output the excel. How can I do this in ADF?
Note:I cannot use dataflows since we were using a self hosted vm
Excel files has lot of limitations in ADF like it is not supported in the copy activity sink and in Data flow sink as well.
You can raise the feature request for that in ADF.
So, try the above operation with a csv and copy the result to a csv in blob which later you can change it to Excel in your local machine.
To do the operations like above, Data flow can be a better option than doing it with normal activities as Dataflow deals with the transformations.
But Data flow won't support Self hosted linked service.
So, as a workaround first copy the Excel file as csv to Blob storage using copy activity. Create a Blob linked service for that to use in dataflow.
Now follow the below process in Data flow.
Source CSV from Blob:
Derived column transformation:
give the condition for each column case(col1=="apple", "orange", col1)
Sink :
In Sink settings specify as Output to single file.
After Pipeline execution a csv will be generated in the blob. You can convert it to Excel in your local machine.
I have flat files in adls source,
for full load we are adding 2 columns Insert and datatimestamp.
For change load we need to Lookup with full data, the data available in full should be taken as Updated and not available data as Insert and copy.
below is the approach I tried to work out, but i'm unable to perform.
Can any one help me on this.
Thanks you and waiting for quick response.
Currently, the feature to update the existing flat file using the Azure data factory sink is not supported. You have to create a new flat file.
You can also use data flow activity to read full and incremental data and load to a new file in sink transformation.
I have done data flow tutorial. Sink currently created 4 files to Azure Data Lake Gen2.
I suppose this is related to HDFS file system.
Is it possible to save without success, committed, started files?
What is best practice? Should they be removed after saving to data lake gen2?
Are then needed in further data processing?
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
There are a couple of options available.
You can mention the output filename in Sink transformation settings.
Select Output to single file from the dropdown of file name option and give the output file name.
You could also parameterize the output file name as required. Refer to this SO thread.
You can add delete activity after the data flow activity in the pipeline and delete the files from the folder.
I'm using ADF pipeline to copy data from data lake to blob storage and then from blob storage to table storage.
As you can see below, here are the column types in ADF Data Flow Sink - Blob Storage (integer, string, timestamp):
Here is the Mapping settings in Copy data activity:
On checking the output in table storage, I see all columns are of string type:
Why is table storage saving data in string values? How do I resolve this issue in table storage so that it will accept columns in the right type (integer, string, timestamp)? Please let me know. Thank you!
In usually, when load data from blob storage in Data Factory, all the default data type in blob file are String, Data Factory will help you convert the data type automatically to Sink.
But it also can not meet all our requests.
I tested copy data from Blob to Table Storage and found that: if we don't specify the data type manually in Source, after pipeline executed, all the data type will be String in Sink(Table Storage).
For example, this my Source blob file:
If I don't change the source data type, it seems that everything is ok in Sink table:
But after the pipeline executed, the data type in table storage are all String:
If we change the data type in Source blob manually, and it works ok!
For another question, a little confuse that just from you screenshot, that seems the UI of Mapping Data Flow Sink, but Mapping Data Flow doesn't support Table Storage as Sink.
Hope this helps.
Finally figured out the issue - I was using DelimitedText format for Blob Storage. After converting to Parquet format, I can see data being written to table storage in correct type.
I want to copy a file with given path from One Zone of Azure Data Lake to Other Zone of Data Lake.
Example:
Source: /RawZone/Incremental/2020/05/01/file.parquet
Destination: /StdZone/Incremental/2020/05/01/file.parquet
Should i be using Copy Activity to read source as dataset and write to Destination. Or is there a way to just copy file from source to destination in Azure Data Factory.
As far as I am aware the Copy Activity is the only way.
You will need a dataset to define where the file is coming from and going (though the path can be parameterised) and its format.
If you want to copy the file as is without alteration, set the dataset format to binary to avoid having to define the file structure and 'waste time' extracting and parsing the data within.