Azure Data Factory - Dataverse data ingestion and data type mapping - azure-data-factory

We are performing data ingestion of Dataverse[Common data service apps] Entities into ADLS Gen2 using Azure Data Factory. We see few columns missing from Dataverse source which are not copied into ADLS, specifically with Dataverse Data type - Choice.
Are all Dataverse column data types supported by ADF linked service? Please suggest fix or any workaround.

Are all Dataverse column data types supported by ADF linked service?
Yes, dataverse supports all column data types.
For missing columns, you should consider the below given points:
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly recommend the mapping to ensure a deterministic copy result.
When the service imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top rows are omitted. The same behavior also applies to data preview and copy executions if there is no explicit mapping. You can review and add more columns into the mapping, which are honored during copy runtime.
To consume the dataverse choices using ADF, you should use data flow activity and use the derived transformation because choice values are written as an integer label and not a text label to maintain consistency during edits. The integer-to-text label mapping is stored in the Microsoft.Athena.TrickleFeedService/table-EntityMetadata.json file.
Refer this Microsoft official document to implement the same.

Related

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

How to read Schema from the config table and attach to Pyspark DataFrame?

I have 50 tables in my on premise server. I want to migrate those 50 tables from on-premise to delta table in data bricks. But every table has specific schema defined but i need to design the single adf pipeline to move those fifty tables from on-premises to delta table.
How to attach the schema to the data frame at the run time based on the table name ?
I would use mapping data flows for this scenario:
Create a static table/file with the list of your tables.
Add a for each loop in ADF pipeline.
Within foreach, create a mapping data flow.
As a source provide your on-prem database (schema will be detected automatically).
Delta table as a destination (sink).
Mapping data flows are spark-based, therefore you can see in "projection" tab that it is translated to the spark type already.

Azure Data Factory data flow - drops null columns

When using a data flow in azure data factory to move data, I've noticed that the data (at the sink) is missing columns that contains NULL values. When using the copy activity to copy the same data, the columns are present in the sink with their NULL values.
Record after a copy activity:
Record after a data flow:
Source is parquet, sink is azure cosmos db. My goal is to avoid defining any schemas, as I simply want to copy all of the data "as is". I've used the "allow schema drift" option on the source and sink.
I would just use the copy activity, but it doesn't appear to have the ability to define a maximum speed (RU consumption) like the data flow does, so the copy activity ends up consuming all of the cosmos db's RUs very quickly (as described here)
EDIT:
sink data preview shows all columns
sink inspect tab shows all columns
Dataflows always skip writing JSON tags with NULLs. There is no workaround currently other than copy activity.
This is really not a good design or behavior on Microsoft's part because you can't Standardize in Cosmos weather to "Keep" or "Remove" null fields in your JSON.
Querying Cosmos
Where field1 = NULL is completely different than where NOT IS_DEFINED (field1) and will yield an entirely different result set.
And if your users don't know if the ADF developer used a Dataflow with a Sink vs a Copy Activity in a Pipeline then the may get erroneous results in a query. The only way to ensure you get all the data is to always use:
Where field1 = NULL or where NOT IS_DEFINED (field1)
Users should not have to depend on knowing what kind of ADF functionality was chosen for a specific JSON document in a Cosmos NoSQL collection to do a query. Plus you can't standardize that you will "Keep" null across all Cosmos documents or you will "Remove" nulls across all Cosmos documents. Unless you force everyone to use Pipelines only or Dataflows only. Depending on the complexity using Pipeline only is not always possible. But using Dataflow only is also not always needed.

Can I force flush a Databricks Delta table, so the disk copy has latest/consistent data?

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.

Does IBM Dataworks support CSV delimiters?

Does the IBM DataWorks Data Load API support CSV files as input source?
The answer is yes. To accomplish this, you have provide the structure of the file in the request payload. This is explained in the API documentation Creating a Data Load Activity. This an excerpt of the documentation:
Within the columns array, specify the columns to provision data
from. If Analytics for Hadoop, Amazon S3, or SoftLayer Object Storage
is the source, you must specify the columns. If you specify columns,
only the columns that you specify are provisioned to the target...
The Data Load application included in DataWorks is provided just as an example and assumes the input file has 2 columns, the first being an INTEGER and the second one a VARCHAR.
Note: This question was answered on dW Answers by user emalaga.