I would like to use data factory to regularly download 500000 json files from a web API and store them in a blob storage container. Then I need to parse the json files to extract some values from each file and store these values together with an ID (part of filename) in a database. I can do this using a ForEach activity and run a custom activity for each file, but this is very slow, so I would prefer some batch activity which could run the same parsing code on each file. Is there some way to do this?
If your source json files have same schema, you can leverage the Copy Activity which can parse those files in a single run. But if possible, I would suggest to split those files into different sub folder (e.g. 1000 files per folder), so that each copy run needs less time and ease the management.
Refer to this doc for more details: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
Related
I'm currently using Azure Data Factory to load flat file data from our Gen 2 data lake into Synapse database tables. Unfortunately, we receive (many) thousands of files into timestamped folders for each feed. I'm currently using Synapse external tables to copy this data into standard heap tables.
Since each folder contains so many files, I'd like to move (or Copy/Delete) the entire folder (after processing) somewhere else in the lake. Is there some practical way to do that with Azure Data Factory?
Yes, you can use copy activity with a wild card. I tried to reproduce the same in my environment and I got the below results:
First, add source dataset and select wildcard with folder name. In my scenario, I have a folder name pool.
Then select sink dataset with file path
The pipeline run is successful. It transferred the file from one location to another location with the required name. Look at the following image for reference.
I have flat files in adls source,
for full load we are adding 2 columns Insert and datatimestamp.
For change load we need to Lookup with full data, the data available in full should be taken as Updated and not available data as Insert and copy.
below is the approach I tried to work out, but i'm unable to perform.
Can any one help me on this.
Thanks you and waiting for quick response.
Currently, the feature to update the existing flat file using the Azure data factory sink is not supported. You have to create a new flat file.
You can also use data flow activity to read full and incremental data and load to a new file in sink transformation.
When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.
is there a possibility to in bulk import documents and their metadata in alfresco. In fact what I want is upload a bunch of documents and inject their metadata from a xml file.
thanks in advance
The link that Abbas pointed you to is the best option. The Bulk File System Import Tool supports bulk importing content as well as metadata.
Write a script that exports your spreadsheet into the format the BFSIT expects. Then upload your content and each of the content's metadata descriptor files (generated from your spreadsheet) to the server. Finally, run the import.
If instead what you are trying to do is not import files and metadata but instead you just want to set metadata from your spreadsheet on a bunch of existing content that is already in the repository, then what you can do is write a script that reads your spreadsheet and uses something like Python cmislib or OpenCMIS (both are from Apache Chemistry) to set that metadata on objects in the repository in bulk.
You can also use CMIS to upload files, but the BFSIT is much more efficient.
I have been looking at a way to automate my data loads into vertica instead of manually exporting flat files each time, and stumbled upon the ETL Talend.
I have been working with a test folder containing multiple csv files, and am attempting to find a way to build a job so the file can be put into vertica.
However, I see in the open studio version (free), if your files do not have the same schema, this becomes next to impossible without having the dynamic schema option which is in the enterprise version.
I start with tFileList and attempt to iterate through tFileInputDelimited, but the schemas are not uniform, so of course it will stop the processing.
So, long story short, am I correct in assuming that there is no way to automate data loads in the free version of Talend if you have a folder consisting of files with different schemas?
If anyone has any suggestions for other open source ETLs to look at or a solution that would be great.
You can access the CURRENT_FILE variable from a tFileList compenent and then send a file down different route depening on the file name. You'd then create a tFileInputDelimited for each file. For example if you had two files named file1.csv and file2.csv, right click the tFileList and choose Trigger>Run If. In the run if condition type ((String)globalMap.get("tFileList_1_CURRENT_FILE")).toLowerCase().matches("file1.csv") and drag it to the tFileInputDelimited set up to handle file1.csv. Do the same for file2.csv, changing the filename in the run if condition.