XML-file produce by the SqlPivotScriptProducer does not contain additonal indexes - codefluent

We are using the following producers :
sqlServer producer
Template Producer
SqlPivotScriptProducer
With the template producer we create additional indexes.
The xml-file produced by the SqlPivotScriptProducer does not contain these additonal indexes.
Has anybody an idea how to fix this?

The Pivot Script Producer generates the pivot file with the information from the model and the SQL Server database. In short, it uses the model to get a list of the objects that should be in the pivot file and uses the database to get the real definition of each object. For instance if your template replaces a stored procedure defined in the model, the pivot script will contain the definition of the stored procedure as in the template.So if your template creates new database objects (not in the model), they won't be in the pivot file.
You can customize the PivotRunner using the Action event
PivotRunner pivotRunner = new PivotRunner("Pivot\\Model1.pivot.xml");
pivotRunner.ConnectionString = CodeFluentContext.Get(Constants.Model1StoreName).Configuration.ConnectionString;
pivotRunner.Action += OnAction;
pivotRunner.Run();

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Azure Data Factory - Insert Sql Row for Each File Found

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?
You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.
Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

Copying data from a single spreadsheet into multiple tables in Azure Data Factory

The Copy Data activity in Azure Data Factory appears to be limited to copying to only a single destination table. I have a spreadsheet containing rows that should be expanded out to multiple tables which reference each other - what would be the most appropriate way to achieve that in Data Factory?
Would multiple copy tasks running sequentially be able to perform this task, or does it require calling a custom stored procedure that would perform the inserts? Are there other options in Data Factory for transforming the data as described above?
If the columnMappings in your source and sink dataset don't against the error conditions mentioned in this link,
1.Source data store query result does not have a column name that is specified in the input dataset "structure" section.
2.Sink data store (if with pre-defined schema) does not have a column name that is specified in the output dataset "structure" section.
3.Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
4.Duplicate mapping.
you could connect the copy activities in series and execute them sequentially.
Another solution is Stored Procedure which could meet your custom requirements.About configuration,please refer to my previous detailed case:Azure Data Factory mapping 2 columns in one column

How to read AWS Glue Data Catalog table schemas programmatically

I have a set of daily CSV files of uniform structure which I will upload to S3. There is a downstream job which loads the CSV data into a Redshift database table. The number of columns in the CSV may increase and from that point onwards the new files will come with the new columns in them. When this happens, I would like to detect the change and add the column to the target Redshift table automatically.
My plan is to run a Glue Crawler on the source CSV files. Any change in schema would generate a new version of the table in the Glue Data Catalog. I would then like to programmatically read the table structure (columns and their datatypes) of the latest version of the Table in the Glue Data Catalog using Java, .NET or other languages and compare it with the schema of the Redshift table. In case new columns are found, I will generate a DDL statement to alter the Redshift table to add the columns.
Can someone point me to any examples of reading Glue Data Catalog tables using Java, .NET or other languages? Are there any better ideas to automatically add new columns to Redshift tables?
If you want to use Java, use the dependency:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-glue</artifactId>
<version>{VERSION}</version>
</dependency>
And here's a code snippet to get your table versions and the list of columns:
AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
.withDatabaseName("glue_catalog_database_name")
.withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();
Here you can see AWS Doc for the TableVersion and the StorageDescriptor objects.
You could also use the boto3 library for Python.
Hope this helps.