export column in mongodb and replace with file path - mongodb

I'm new to Mongodb. I'm trying to export a column content from MongoDB collection into flat files and store them in Azure blob and replace the column content with path for exported files.
Originally, pdf files were stored in a column in the collection, but now the decision was made to export the column contents back into pdf files and reference the files by location instead in the same column.
Hope this makes sense.
Thank you

Related

Pipeline filter modifies date every time I run the pipeline which prevent me from pulling only the last modified date to SQL

Once a week, a file is generated to an onprem folder. My pipeline pulls from that onprem file to blob storage, then from blob to blob, during this part the pipeline filters my data goes to sql. Problem is, when it gets filtered the modified date changes and all the files in the blob storage are pulled rather than the one that got originally pulled for that week. I have attached images of my pipeline and the onprem files and what I filter for.
Instead of trying to proceed with last modified date of the file, you can proceed using file name instead.
Since you have date (yyyyddMM format) in the filename itself, you can dynamically create the filename and check if this file is present in the filtered files list or not.
Look at the following demonstration. Let's say I have the following 2 files as my filtered files. I used Get metadata activity (child items) on the blob storage.
Since we know the format of how the filename would be (SalesWeekly_yyyyddMM.csv), create the present filename value dynamically using the following dynamic content in set variable activity (variable name is file_name_required).
#concat('SalesWeekly_',formatDateTime(utcnow(),'yyyyddMM'),'.csv')
Now, create an array containing all the filenames returned by our get metadata activity. The for each activity items value is given as #activity('Get Metadata1').output.childItems.
Inside this, use an append variable activity with value as #item().name.
Now, you have file name you actually need (dynamically build) and the filtered file names array. You can check if the filename is present in the array of filtered file names or not and take necessary actions. I used if condition activtiy with the following dynamic content.
#contains(variables('files_names'),variables('file_name_required'))
The following are reference images of the output flow.
When current week file is not present in the filtered files.
When current week file is present in the filtered files.
I have used wait for demo here. You can replace it with copy activity (from blob to SQL) in True case. If you don't want to insert when current week file is missing, then leave the false case empty.

Azure Data Factory V2 Copy Activity - Save List of All Copied Files

I have pipelines that copy files from on-premises to different sinks, such as on-premises and SFTP.
I would like to save a list of all files that were copied in each run for reporting.
I tried using Get Metadata and For Each, but not sure how to save the output to a flat file or even a database table.
Alternatively, is it possible to fine the list of object that are copied somewhere in the Data Factory logs?
Thank you
Update:
Items:#activity('Get Metadata1').output.childItems
If you want record the source file names, yes we can. As you said we need to use Get Metadata and For Each activity.
I've created a test to save the source file names of the Copy activity into a SQL table.
As we all know, we can get the file list via Child items in Get metadata activity.
The dataset of Get Metadata1 activity specify the container which contains several files.
The list of file in test container is as follows:
At inside of the ForEach activity, we can traverse this array. I set a Copy activity named Copy-Files to copy files from source to destnation.
#item().name represents every file in the test container. I key in the dynamic content #item().name to specify the file name. Then it will sequentially pass the file names in the test container. This is to execute the copy task in batches, each batch will pass in a file name to be copied. So that we can record each file name into the database table later.
Then I set another Copy activity to save the file names into a SQL table. Here I'm using Azure SQL and I've created a simple table.
create table dbo.File_Names(
Copy_File_Name varchar(max)
);
As this post also said, we can use similar syntax select '#{item().name}' as Copy_File_Name to access some activity datas in ADF. Note: the alias name should be the same as the column name in SQL table.
Then we can sink the file names into the SQL table.
Select the table which created previously.
After I run debug, I can see all the file names are saved into the table.
If you want add more infomation, you can reference the post I maintioned previously.

How to index azure blob custom metadata fields to Azure search

I was trying to index the blob content to Azure search. I added blob content to the search index through blob indexer.
I am using MongoDB to store the uploaded file information along with blob path. We have to add some tags to the file which were stored in MongoDB. Now I want to add these tags into Azure search for that file along with file content.
The problem I am facing is,
Problem 1: To maintain the uniqueness(search key field) between MongoDB record and blob indexer. Initially, I want to use the metadata_storage_path from blob indexer and the base64 encoded blob path which I was stored in MongoDB. But the problem is it never matches the metadata_storage_path and base64 encoded blob path from my node.js.
Problem 2: TO solve the Problem 1, I came into another approach to store my MongoDB file id(FID) as a custom metadata field to the blob to get the uniqueness(search key field) for search index and mongoDB record. The problem here how can I map the custom metadata field to key field? I am not able to index the blob custom metadata fields.
In both the scenarios I am not able to achieve the expected results. How can I achieve the search index key field between MongoDB and Azure blob?
You can use base64 encoded blob path as the document key, which you can get in both indexers by using base64 field mapping. Check https://learn.microsoft.com/en-us/azure/search/search-indexer-field-mappings#base64EncodeFunction for all the options to match your node.js encoding function.

Retrieve blob file name in Copy Data activity

I download json files from a web API and store them in blob storage using a Copy Data activity and binary copy. Next I would like to use another Copy Data activity to extract a value from each json file in the blob container and store the value together with its ID in a database. The ID is part of the filename, but is there some way to extract the filename?
You can do the following set of activities:
1) A GetMetadata activity, configure a dataset pointing to the blob folder, and add the Child Items in the Field List.
2) A forEach activity that takes every item from the GetMetadata activity and iterates over them. To do this you configure the Items to be #activity('NameOfGetMetadataActivity').output.childItems
3) Inside the foreach, you can extract the filename of each file using the following function: item().name
After this continue as you see fit, either adding functions to get the ID or copy the entire name.
Hope this helped!
After Setting up Dataset for source file/file path with wildcard and destination/sink as some table
Add Copy Activity setup source, sink
Add Additional Columns
Provide a name to the additional column and value "$$FILEPATH"
Import Mapping and voila - your additional column should be in the list of source columns marked "Additional"

How to get column names of CSV files saved in documents directory?

I am importing CSV files saved in documents directory. There are multiple CSV files stored in documents directory.
I am getting the list of all CSV files. I want a selected CSV file to get imported.
When I select any CSV file, I want to get Column names of that CSV File.
How to Fetch only cloumn names of that CSV files?
How can I do this?
Check this out http://www.cocoawithlove.com/2009/11/writing-parser-using-nsscanner-csv.html
There is also a sample program in the end. Its a lot of work, but it will help
Parsing CSV File will take your efforts.
Though hoping this links can help you.
Link 1
Link 2
Link 3