Azure data factory File Content Replace in the Azure blob - azure-data-factory

Good morning,
We have Azure Data Factory (ADF). We have 2 files that we want to merge into one another. The files are currently in the Azure Blob storage. Below are the contents of the files. We are trying to take the contents of File2.txt and replace the '***' in File1.txt. When finished, it should look like File3.txt.
File1.txt
OP01PAMTXXXX01997
***
CL9900161313
File2.txt
ZCBP04178 2017052520220525
NENTA2340 2015033020220330
NFF232174 2015052720220527
File3.txt
OP01PAMTXXXX01997
ZCBP04178 2017052520220525
NENTA2340 2015033020220330
NFF232174 2015052720220527
CL9900161313
Does anyone know how we can do this? I have been working with this for 2 days and it would seem that this should not be a difficult thing to do.
All the best,
George

You can merge 2 files or more using ADF but i can't see a way where we can merge with a condition / control the way we merge files, so what i can recommend is to use Azure Function and do the merge programmatically.
if you want to know how to merge files without preserving line priorities use my approach:
create a pipeline
add a "Copy activity"
in copy activity use this basic settings:
in source -> chose WildCard file path (select the folder that files are located at), make sure in wildcard path to write "*" in filename this will guarantee to chose all files under the same folder.
this will merge all the files under the same folder.
in Sink -> make sure to select in Copy behavior Merge Files mode.
Output :

Related

Azure Data Factory - What is the fastest way to copy a lot of files from OnPrem to blob storage when they are deeply nested

I must get to two different excel files that are nested within 360 parent directories XXXX with a \ME (Month End directory) then a year directory, and finally a yyyymm directory.
Example: Z500\ME\2022\202205\Z500_contributions_202205.xls.
I tried with the copy data activity and killed it after it was still spinning on the listing source step. I thought about the lookup and metadata activities and those have limits of 5000 rows. Any thoughts on what would be the fastest way to do this?
Code for creating the filelist. I'll clean the results up in Excel
dir L:*.xls /s /b > "C:\Foo.txt"
Right now I am creating a list of files with the DOS dir command and hoping that if I have a filelist with the copy activity that it will runs faster if it doesn't have to go through the "list sources" step and interrogate the filesystem.
Thoughts on a ADF option?
If you are facing issues with copy activity, instead you can try azcopy which also can be used for copying from OnPrem to Blob storage.
You can try the below code:
azcopy copy "local path/*" "https://<storage account
name>.blob.core.windows.net/<container name><path to blob" --recursive=true --include-pattern "*.xlsx"
Please go through this Microsoft documentation to know how to use azcopy.
The above script copies all the excel files from nested folders recursively. But it also copies the folders to the blob storage as well.
After copying to blob storage, you can use the Start-AzureStorageBlobCopy in the Powershell to list out all the excel files in the nested folders to a single folder.
Start-AzureStorageBlobCopy -SrcFile $sourcefile -DestCloudBlob “Destination path”
Please refer this SO thread to list out the files in the blob recursively.
If you are creating list of files in OnPrem then you can use either azcopy or copy activity as your wish.
Please check these screenshots of azcopy for your reference:
Here I am using azcopy with SAS. You can use it in both ways with SAS and with Active Directory as mentioned in the documentation above.
Excel file in Blob storage:

Copy activity with simultaneous renaming of a file. From blob to blob

I have a "copy data" activity in Azure Data Factory. I want to copy .csv files from blob container X to Blob container Y. I don't need to change the content of the files in any way, but I want to add a timestamp to the name, e.g. rename it. However, I get the following error "Binary copy does not support copying from folder to file". Both the source and the sink are set up as binary.
If you want to copy the files and rename them, you pipeline should like this:
Create a Get Metadata active to get the file list(dataset Binary1):
Create For Each active to copy the each file:#activity('Get Metadata1').output.childItems:
Foreach inner active, create a copy active with source dataset
Binary2(same with Binary2) with dataset parameter to specify the source file:
Copy active sink setting, create the sink Binary3 also with
parameter to rename the files:
#concat(split(item().name,'.')[0],utcnow(),'.',split(item().name,'.')[1]):
Run the pipeline and check the output:
Note: The example I made just copy the files to the same container but with new name.

Youtube-dl how to get a direct download link to the merged file without creating a temp file on server

Is there any way to create a direct download link to the merged file without creating a temp file on the server in youtube-dl?
youtube-dl -f 255+160 https://youtu.be/p-flvm1szbI
The above code will merge the file and output the merged file.
I want to allow users to directly download the merged file to their computers -- without creating any temp file on my server. Is this possible?
(Creating a temp file and then letting the user download it is already possible.)

Zip files with encryption in a remote share, keeping orignal names and location

My team faces the need to encrypt all files in a repository with AES256. For this purpose, we decided we are going to zip all files with such encryption, using the same key for all of them.
The problem we have is that these files sit in a NAS, so from windows boxes they are accessible by \ to them.
The directory structure is something like this:
Original Structure:
Root
-1
|--folder1
|---file1.ext
|---file2.ext
|--folder2
|---filea.ext
|---fileb.ext
|--folder2.a
|---filec.ext
and so on...
Essentially, what we need is to have all the original files contained in a zip file, keeping their original names, which would be something like this:
Desired Outcome:
|-Root
|-1
|--folder1
|---file1.zip
|---file2.zip
|--folder2
|---filea.zip
|---fileb.zip
|--folder2a
|---filec.zip
and so on...
To accomplish this, we tried a batch script that calls 7zip, but it only works if it's run from the root directory, which is something we cannot use as the files are not in a server.
Here is the syntax of the batch script we came up with:
FOR /R %%i IN ("*.wmv") DO "C:\Program Files\7-Zip\7z.exe" a -mx0 -tzip -pPasswordHere "%%~dpni.zip" "%%i"
But, as wrote previously, it only works when run from the root folder, which is something we cannot do as files sit on a network location.
Mapping the drive or making a symbolic link to it doesn't do the trick either.
I've also checked on 7zip to do this, namely, making use of its "-r" operator, but I couldn't find a way to get the desired outcome (namely, recurse through all folders in the remote tree structure -there are a lot of them...- and keep the original file name).
I'm open to any suggestions as any kind of script, trick or guizmo that gets the job done will be more than welcome. =)
Thanks a million in advance!,
Sebas.
----SOLUTION----
I actually found a sollution here, mapping the drive in a different way (it's so simple it just made me feel stupid(er), but it's altogheter beautiful).
Using the batch script below, the remote share can be mapped like so:
You can map a drive using
net use X: \\server\directory
and then you can change to that directory using
pushd X:
(Post from which the answer was taken from: Batch File Iterating through files on a local network server)

flexible merge command for unison to pick newer or older file?

I've been using unison as my file synchronizer of choice and life has been great.
Essentially I could modify any files on any side at any time without ever worrying who's master and slave, etc. It's bidirectional.
However with four roots failing over to each other when each's primary partner cannot be reached, I'm starting to push the limits of this tool. Conflicts arise that halt automatic syncing for the files involved. Aspects of my business logic are distributed across the different hosts, which modify sometimes the same files when run.
The merge option in the configuration file comes into play. It lets you specify different merge commands for different file types.
For example for log files only I like to interpolate their lines with:
merge = Name *.log -> diff3 -m CURRENT1 CURRENTARCH CURRENT2 > NEW || echo "differences detected"
Question: for *.last files only, what merge command would always favor the older copy?
For *.rb *.sh and other source files, I'm not looking to merge but always pick the newer version in case of conflicts. I can do that by default with the prefer = newer global option though.
For *.png files I typically prefer to keep the smaller(optimized) size.
Regarding the .rb and .sh files, you could use the preferpartial = Name *.rb -> newer and the same for .ssh files. For .last files, you can use older instead.
Regarding .png files, you could write your own merge command that checks the size of both files. I would then set merge = Name *.png -> mycmp CURRENT1 CURRENT2 NEW, and have the mycmp command takes three file path, compare the size of the first two, and copy it to the third path.