issue in loading data with Azure data factory - azure-data-factory

I am trying to load a lot of csv files from blob storage to Azure SQL Data Warehouse through Azure data factory. As I am dealing with massive number of rows, the desired approach is to use PolyBase to bulk loading the data. When I point the source to one single file, SQL DW PolyBase is displayed as true but when I point to all csv files, the SQL DW PolyBase is displayed as false. Does anyone have experienced this issue?

You could always change the allow polybase to true in UI.
Or specify this property in json:
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}

Related

How to properly handle control records in Aurora-DMS-Kinesis-Redshift pipeline?

I am really stuck here and spend the last 2 days researching this topic. I have the following data sync pipeline with the full-load-and-cdc migration type configured:
[Aurora MySQL RDS]->[DMS]->[Kinesis Streams]->[Kinesis Firehose]->S3 (intermediate)->[Redshift]
When I start DMS migration task, besides JSON data files for the source table, the pipeline also delivers JSON control records to the intermediate S3 bucket, even the JSON for creating awsdms_apply_exceptions control table. Redshift then, in turn, tries to load these JSON files from S3 and fails with this error:
Error 1213: "Missing data for not-null field"
This happens, I believe, because Redshift tries to parse JSON with control records as source table data records. My questions are:
Is it correct that the JSON for control tables (and other tables DDL) is delivered by Firehose to the intermediate S3 bucket? When I had [Aurora MySQL RDS]->[DMS]->[Firehose] pipeline before, I didn't see any DDL delivered to S3 - only data CSV files.
If #1 is not correct, how can I ensure that only the source table data files (which Redshift can successfully load) are pushed to the intermediate S3 bucket?
And if JSON with DDL is not going through a Kinesis Stream, how would DMS communicate it to Redshift when a new column is added, for example?
I appreciate any input as I completely ran out of any ideas at this point. Thank you very much.
Here is an example of control records I get in S3 delivered by Firehose which make Redshift error out with 1213 above:
{
"metadata": {
"timestamp": "2023-01-09T18:59:13.214656Z",
"record-type": "control",
"operation": "create-table",
"partition-key-type": "task-id",
"schema-name": "public",
"table-name": "awsdms_apply_exceptions"
}
}{
"metadata": {
"timestamp": "2023-01-09T18:59:13.872312Z",
"record-type": "control",
"operation": "create-table",
"partition-key-type": "task-id",
"schema-name": "epulse",
"table-name": "add_voter_preference_options"
}
}

Issue while updating copy activity in ADF

I want to update a source excel column with a particular string.
My source contains n columns. I need to check where the string apple exists in any one of the columns. If the value exist in any column I need to replace the apple with orange string. And output the excel. How can I do this in ADF?
Note:I cannot use dataflows since we were using a self hosted vm
Excel files has lot of limitations in ADF like it is not supported in the copy activity sink and in Data flow sink as well.
You can raise the feature request for that in ADF.
So, try the above operation with a csv and copy the result to a csv in blob which later you can change it to Excel in your local machine.
To do the operations like above, Data flow can be a better option than doing it with normal activities as Dataflow deals with the transformations.
But Data flow won't support Self hosted linked service.
So, as a workaround first copy the Excel file as csv to Blob storage using copy activity. Create a Blob linked service for that to use in dataflow.
Now follow the below process in Data flow.
Source CSV from Blob:
Derived column transformation:
give the condition for each column case(col1=="apple", "orange", col1)
Sink :
In Sink settings specify as Output to single file.
After Pipeline execution a csv will be generated in the blob. You can convert it to Excel in your local machine.

Infer Schema from .csv file appear in azure blob and generaate DDL script automatically using Azure SQL

Every time the .csv file appearing in the blob storage, i have to create DDL from that manually on azure sql. The data type is based on the value specified for that field.
The file have 400 column, and manually it is taking lots of time.
May someone please suggest how to automate this using SP or script, so when we execute the script, it will create TABLE or DDL script, based on the file in the blob storage.
I am not sure if it is possible, or is there any better way to handle such scenario.
Appreciate yours valuable suggestion.
Many Thanks
This can be achieved in multiple ways. As you mentioned about automating it, you can use Azure function as well.
Firstly create a function that reads the csv file from blob storage:
Read a CSV Blob file in Azure
Then add the code to generate the DDL statement:
Uploading and Importing CSV File to SQL Server
Azure function can be scheduled or run when new files are added to blob storage.
If this is once a day kind of requirement and can manually be done as well, we can download the file from blob and use the 'Import Flat File' functionality available within SSMS where we can just specify the csv file and it creates the schema based on existing column values.

Azure Table Storage Sink in ADF Data Flow

Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage

Can I force flush a Databricks Delta table, so the disk copy has latest/consistent data?

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.