Merge files in Azure Blob using powershell - powershell

I have a requirement to Merge multiple files with same keyword having different timestamp in Azure blobs and move them from one folder to another so that downstream service can consume them.
I was able to move the files from one folder to other but I don't find an option to concatenate them using powershell (With in Blob folders) into one single file. Is there any way to achieve this specifically using powershell ?
Note: All files in the folder are text/csv files with same layout.

Copy Blob operation cannot concatenate/join/combine blobs, it’s intended as a background copy operation and can only make "copies" the destination blob will be overwritten each time
There you may use case it will require the content of all the required blobs to be retrieved, a new blob constructed locally and then uploaded to destination
get-azurestorageblobcontent will bring down the blob content
Reference: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-powershell#download-blobs Concatenate Locally
Upload blobs to the container: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-powershell#upload-blobs-to-the-container

Related

Move Entire Azure Data Lake Folders using Data Factory?

I'm currently using Azure Data Factory to load flat file data from our Gen 2 data lake into Synapse database tables. Unfortunately, we receive (many) thousands of files into timestamped folders for each feed. I'm currently using Synapse external tables to copy this data into standard heap tables.
Since each folder contains so many files, I'd like to move (or Copy/Delete) the entire folder (after processing) somewhere else in the lake. Is there some practical way to do that with Azure Data Factory?
Yes, you can use copy activity with a wild card. I tried to reproduce the same in my environment and I got the below results:
First, add source dataset and select wildcard with folder name. In my scenario, I have a folder name pool.
Then select sink dataset with file path
The pipeline run is successful. It transferred the file from one location to another location with the required name. Look at the following image for reference.

Infer Schema from .csv file appear in azure blob and generaate DDL script automatically using Azure SQL

Every time the .csv file appearing in the blob storage, i have to create DDL from that manually on azure sql. The data type is based on the value specified for that field.
The file have 400 column, and manually it is taking lots of time.
May someone please suggest how to automate this using SP or script, so when we execute the script, it will create TABLE or DDL script, based on the file in the blob storage.
I am not sure if it is possible, or is there any better way to handle such scenario.
Appreciate yours valuable suggestion.
Many Thanks
This can be achieved in multiple ways. As you mentioned about automating it, you can use Azure function as well.
Firstly create a function that reads the csv file from blob storage:
Read a CSV Blob file in Azure
Then add the code to generate the DDL statement:
Uploading and Importing CSV File to SQL Server
Azure function can be scheduled or run when new files are added to blob storage.
If this is once a day kind of requirement and can manually be done as well, we can download the file from blob and use the 'Import Flat File' functionality available within SSMS where we can just specify the csv file and it creates the schema based on existing column values.

Load ForEach SQL into single CSV in Azure Data Factory

I have an ADF where I am executing a stored procedure in a ForEach and using Copy Data to load the output into a CSV file
On each iteration of the ForEach the CSV is being cleared down and loading that iteration's data
I require it to preserve the already loaded data and insert the output from the iteration
The CSV should have a full dataset of all iterations
How can I achieve this? I tried using the "Merge Files" option in the Sink Copy Behavior but doesn't work for SQL to CSV
As #All About BI mentioned, currently the append behavior which you are looking for is not supported.
You can raise a feature request from the ADF portal.
Alternatively, you can check the below process to append data in CSV.
In my repro, I am generating the loop items using Set Variable activity and passing it to ForEach activity.
Inside ForEach activity, using copy data activity, executing the stored procedure in Source, and copying data of Stored procedure to a CSV file.
In the Copy data activity sink, generate the file name using the current item of ForEach loop, to get data into different files for each iteration. Also adding a constant to identify the file name which can be deleted at the end after merging the files.
File_name: #concat('1_sp_sql_data_',string(item()),'.csv')
Add another copy data activity after the ForEach activity, to combine all the files data from the ForEach iteration to a single file. Here I am using the wildcard path (*) to get all files from the folder.
In Sink, add the destination filename with copy behavior as Merge files to copy all source data to a single sink file.
After merging the files data is copied to a single file, but the files will not be deleted. So when you run the pipeline next time, there is a chance the old files were also been merged with new files again.
• To avoid, this adding delete activity to delete the files generated in ForEach activity.
• As I have added a constant to generate these files, it will be easy to delete the files based on the filename (deleting all files which start with “1_”).
Destination file:
You could try this -
Load each iteration data to a separate csv file.
Later union them all or merge
As right now we don't have the ability to append rows in csv.

How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?

I have done data flow tutorial. Sink currently created 4 files to Azure Data Lake Gen2.
I suppose this is related to HDFS file system.
Is it possible to save without success, committed, started files?
What is best practice? Should they be removed after saving to data lake gen2?
Are then needed in further data processing?
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
There are a couple of options available.
You can mention the output filename in Sink transformation settings.
Select Output to single file from the dropdown of file name option and give the output file name.
You could also parameterize the output file name as required. Refer to this SO thread.
You can add delete activity after the data flow activity in the pipeline and delete the files from the folder.

Parse multiple json files in one activity

I would like to use data factory to regularly download 500000 json files from a web API and store them in a blob storage container. Then I need to parse the json files to extract some values from each file and store these values together with an ID (part of filename) in a database. I can do this using a ForEach activity and run a custom activity for each file, but this is very slow, so I would prefer some batch activity which could run the same parsing code on each file. Is there some way to do this?
If your source json files have same schema, you can leverage the Copy Activity which can parse those files in a single run. But if possible, I would suggest to split those files into different sub folder (e.g. 1000 files per folder), so that each copy run needs less time and ease the management.
Refer to this doc for more details: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview