Nifi : How to move CSV content and its meta data to single table in Postgresdatabase using NiFi - postgresql

I have csv files and I want to move the content of files along with its meta data (File name, source (To be hard coded), control number (Part of file name - to be extracted from file name itself) using NiFi. So here is the sample File name and layout -
File name - 12345_user_data.csv (control_number_user_data.csv)
source - Newyork
CSV File Content/columns - 
Fields - abc1, abc2, abc3, abc4 
values - 1,2,3,4
Postgres Database table layout
Table name - User_Education
fields name  -
control_number, file_name, source, abc1, abc2, abc3, abc4
Values - 
12345, 12345_user_data.csv, Newyork, 1,2,3,4
I am planning to use below processors - 
ListFile
FetchFile
UpdateAttributes
PutDatabaseRecords
LogAttributes
But I am not sure how to combine the actual content with the meta data to load into one single table. Please help

You can use UpdateRecord before PutDatabaseRecord to add the control_number, file_name, and source fields to each record, setting the populating the "Replacement Value Strategy" property to "Literal Value" and use Expression Language to set the values to the corresponding attributes.
For example, you could have a user-defined property /file_name set to ${filename}, that will add the file_name field to each record and set the value to whatever is in the "filename" attribute of the FlowFile.

Related

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Merge different schema from blob container into a single sql table

I have to read 10 files from a folder in blob container with different schema(most of the schema among the table macthes) and merge them into a single SQL table
file 1: lets say there are 25 such columns
file 2: Some of the column in file2 matches with columns in file1
file 3:
output: a sql table
How to setup a pipeline in azure data factory to merge these columns into a single SQL table.
my approach:
get Metadata Activity---> for each childitems--- copy activity
for the mapping--- i constructed a json that containes all the source/sink columns from these files
You can create a JSON file which contains your each source file name and Tabular Translator. Then use Lookup activity to get this file's content(Don't check first row only). Loop this array in For Each activity and pass source file name in your dataset. Finally, create a copy data activity and use Tabular Translator as your mapping.

How to keep unstructured file name as value and insert into database

I'm new in using IBM Data Stage, i need to keep the file name that i set in the unstructured file in filepath as a value. Then i need to insert that value in original_file column of my table for all rows automatically. Is there any way to do this?
Assuming the file name is a job parameter and will be provided each job run you could use a Transformer - add a new column "original_file" and use the parameter name as derivation.
Note: A parameter is provided i.e. file_name will be referenced in DataStage with #file_name# (i.e. in the file stage) but will be referenced in the Transformer as file_name (without the #s)

Filename as Column using Data Factory V2

I have a lot of JSON files in Blob Storage and what I would like to do is to load the JSON files via Data factoryV2 into SQL Data Warehouse. I would like the filename in a column for each JSON file. I know how to
do this in SSIS but I am not sure how to replicate this in Data Factory.
e.g File Name: CornerShop.csv as CornerShop in the filename column in SQL Data Warehouse
Firstly,please see the limitation in the copy activity column mapping:
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
So,i don't think you could do the data transfer plus file name at one time.My idea is:
1.First use a GetMetadata activity. It should get the filepaths of each file you want to copy. Use the "Child Items" in the Field list.
2.On success of GetMetaData activity, do ForEach activity. For the ForEach activity's Items, pass the list of filepaths.
3.Inside the ForEach activity's Activities, place the Copy activity. Reference the iterated item by #item() or #item().name on the blob storage source file name.
4.Meanwhile,configure the filename as a parameter into stored procedure. In the stored procedure, merge the filename into fileName column.

Check the details of columns which have attributes applied to them

In the splayed tables we can find the details/order of columns in the .d file.
I was searching if there is any file which maintains the attributes information of the columns in a table.
How can we find the details of the attributes in the file system?
t:([] a:1 2 3; b:4 5 6; c:`a`b`c)
`:/home/st set .Q.en[`:/home/st;t]
get `:/home/st/.d / Output - `a`b`c
#[`:/home/st/;`a;`s#] / Is there any place in file system where we can find the attribute applied to a column
meta get `:/home/st/ / Show that attribute s is applied on column a
Attributes details are stored in the column file itself. For example, in your case file /home/st/a will contain sorted attribute information.
But since these files are serialized data (binary format), and structure of splayed binary data is not open, we can not get the attribute information directly from the file.
You can actually read the attributes from columns on disk, it's just not recommended (and potentially subject to change):
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/a
`s
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/b
`