NiFi passing date from ExecuteSQL to PutDatabaseRecord - postgresql

I need to insert the result of an SQL query into a Postgres table.
For that I use ExecuteSQL("Use Avro Logical Types" true) and PutDatabaseRecord("StatementType" INSERT, "Record Reader" AvroReader) processors.
The insert doesn’t work because Nifi converts the date to this number: '1322683200000' and the column in the destination table is of type date.
I suppose I should either add the "UpdateRecord" processor between "ExecuteSQL" and "PutDatabaseRecord" processors or use "Data Record Path" property in the "PutDatabaseRecord" processor.
But I can't find an example of configuring the UpdateRecord processor or filling the "Data Record Path" property

I tried to do the same process on my side with a dummy select current_date as value; in the sql execute processor and then passing the same over to insert the data into PostgresDB.
The flow was able to insert the date value into the sample table that I created for testing. Can you try the same and see if you have any issues or provide some samples of data and how you are building the data flow.

Insert started to work when I put columns in sql query in the same order as in destination table.
I thought that Nifi should match columns by names

Related

Azure Data Factory Copy Pipeline with Geography Data Type

I am trying to get a geography data type from a production DB to another DB on a nightly occurrence. I really wanted to leverage upsert as the write activity, but it seems that geography is not supported with this method. I was reading a similar post about bringing the data through ADF as a well known text data type and then changing it, but I keep getting confused on what to do with the data once it is brought over as a well known data type. I would appreciate any advice, thank you.
Tried to utilize ADF pipelines and data flows. Tried to convert the data type once it was in the destination, but then I was not able to run the pipeline again.
I tried to upsert the data with geography datatype from one Azure SQL database to another using copy activity and got error message.
Then, I did the upsert using dataflow activity. Below are the steps.
A source table is taken in dataflow as in below image.
CREATE TABLE SpatialTable
( id int ,
GeogCol1 geography,
GeogCol2 AS GeogCol1.STAsText() );
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (1,geography::STGeomFromText('LINESTRING(-122.360 46.656, -122.343 46.656 )', 4326));
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (2,geography::STGeomFromText('POLYGON((-122.357 47.653 , -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));
Then Alter Row transformation is taken and in Alter Row Conditions, Upsert if isNull(id)==false()is given. (Based on the column id, sink table upserted)
Then, in Sink dataset for target table is given. In sink settings, Update method is selected as Allow Upsert and required Key column is given. (Here column id is selected)
When pipeline is run for the first time, data is inserted into target table.
When pipeline is run for the second time by updating the existing data and inserting new records to source, data is upserted correctly.
Source Data is changed for id=1 and new row is inserted with id=3
Sink data is reflecting the changes done in source.

Custom logging in Azure Data Factory

I'm new to ADF and trying to build an Azure Data Flow Pipeline. I'm reading from a Snowflake data source and checking the data against several business rules. After each check, I'm writing the bad records to a csv file. Now, my requirement is that I need to create a log table which shows the business rule and the number of records that failed to pass that particular business rule. I've attached a screenshot of my ADF data flow as well as the structure of the table I'm trying to populate.
My idea was to create a stored proc that will be called at the end of each business rule, so that a record is created in the database. However, I'm unable to add an SP from the data flow. I found that I can get the rows written to a sink from the pipeline. However, I am not getting as to how I can tie the sink name and the rows written together and iterate the stored procedure for all the business rules?
Snapshot of how my data flow looks like
The columns that I want to populate
I have considered sink1 and sink2 for storing the data that violates business-rule1 and rule2 respectively in my dataflow activity. I've created a stored procedure for recording the business rule and failed count in log-files. Then execute stored procedure activity in ADF is used and records are inserted in log files. Below are the steps.
Table for log-file.
CREATE TABLE [dbo].[log_file](
[BusinessRule] [varchar](50) NULL,--Business Rule
[count] [varchar](50) NULL--failed rows count
) ON [PRIMARY]
GO
Stored procedure for inserting records in log file through data factory.
Create proc [dbo].[usp_insert_log_file] (#BusinessRule varchar(100),#count varchar(10))
as
begin
insert into log_file values (#BusinessRule,#count)
end
-Dataflow activity has two sinks. It is chained with execute Stored procedure activity.
Stored Procedure has two parameters,
Enter the business rule in business_rule parameter and for count parameter,
In Stored procedure 1, Enter the corresponding business rule for sink1 in BusinessRule field, and for count field, pass the sink1's rowsWritten value from the output of data flow activity.
BusinessRule: 'Business_Rule_1'
Count:
#string(activity('Data flow1').output.runStatus.metrics.sink1.rowsWritten)`
Similarly in Stored Procedure2 activity, pass the sink2 count value and enter the corresponding business rule in parameters
BusinessRule: 'Business_Rule_2'
Count:
#string(activity('Data flow1').output.runStatus.metrics.sink2.rowsWritten)
In this way, We can insert data to logfile from dataflow activity using exec stored procedure activity.

SCD2 Implementation in Redshift using AWS GLue Pyspark

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .

Hive Partition Table with Date Datatype via Spark

I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.

"Invalid column name" when "Specify database fields" in table output is unchecked in pentaho PDI V7

I'm trying to insert data to SQL database. I have all the columns in the same order as the data flow. But I'm getting this "Invalid column name name_of_the_actual_data_column" error
Column order won't matter, but exact column names will. Your SQL implementation probably isn't picky enough to require case-sensitive matches, but spaces and punctuation will matter. With Specify database fields unchecked, all field names MUST exist as columns in the target table.
I've found a good way to troubleshoot SQL inserts is to put a Select step before the Table output and make sure that you're really only getting the columns you want to insert.
You can also right-click on the Table output step and choose Input fields... to see the column metadata being passed into the step.