Can I force flush a Databricks Delta table, so the disk copy has latest/consistent data? - azure-data-factory

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?

Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).

ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.

Related

Azure data factory: Implementing the SCD2 on txt files

I have flat files in adls source,
for full load we are adding 2 columns Insert and datatimestamp.
For change load we need to Lookup with full data, the data available in full should be taken as Updated and not available data as Insert and copy.
below is the approach I tried to work out, but i'm unable to perform.
Can any one help me on this.
Thanks you and waiting for quick response.
Currently, the feature to update the existing flat file using the Azure data factory sink is not supported. You have to create a new flat file.
You can also use data flow activity to read full and incremental data and load to a new file in sink transformation.

Do I need a storage (of some sort) when pulling data in Azure Data factory

*Data newbie here *
Currently, to run analytics report on data pulled from Dynamics 365, I use Power BI.
Issue with this is, Power BI is quite slow processing large data. I carry out a number of transform steps (e.g. Merge, Join, deleting or renaming columns, etc). So, when I try to run a query in Power BI with said steps, it takes a long time to complete.
So, as a solution, I decided to make use of Azure Data Factory(ADF). The plan is to use ADF to pull the data from CRM (i.e. Dynamics 365), perform transformations and publish the data. Then I'll use Power BI for visual analytics.
My question is:
What azure service will I need in addition to Data Factory? Will I need to store the data I pulled from CRM somewhere - like Azure Data Lake or Blob storage? Or can I do the transformation on the fly, right after the data is ingested?
Initially, I thought I could use the 'copy' activity to ingest data from CRM and start playing with the data. But using the copy activity, I needed to provide a sink (destination for the data. Which has to be a storage of some sort).
I also thought, I could make use of the 'lookup' activity. I tried to use it, but getting errors (no exception message is produced).
I have scoured the internet for a similar process (i.e. Dynamics 365 -> Data Factory -> Power BI), but I've not been able to find any.
Most of the processes I've seen however, utilises some sort of data storage right after data ingest.
All response welcome. Even if you believe I am going about this the wrong way.
Thanks.
Few things here:
The copy activity just moves data from a source, to a sink. It doesnt modify it on the fly.
The lookup activity is just to look for some atributes to use later on the same pipeline.
ADF cannot publish a dataset to power bi (although it may be able to push to a streaming dataset).
You approach is correct, but you need that last step of transforming the data. You have a lot of options here, but since you are already familiar with Power Bi you can use the Wrangling Dataflows, which allows you to take a file from the datalake, apply some power query and save a new file in the lake. You can also use Mapping Dataflows, databricks, or any other data transformation tool.
Lastly, you can pull files from a data lake with Power Bi to make your report with the data on this new file.
Of course, as always in Azure there are a lot of ways to solve problems or architect services, this is the one I consider simpler for you.
Hope this helped!

Azure Table Storage Sink in ADF Data Flow

Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

How can we handle Data validations in snowpipe in Snowflake

My Scenario is I have data in AWS S3 flat files.
I am using SNS to trigger the Snow-pipe when new file arrives in S3.
To load the data from flat files in S3 to Snowflake table I am using Snow-pipe.
So While loading data from flat files to snowflake table by Snow-pipe,
Can I handle data-validation and couple of calculations on source data?
Please help me if we have any way to do this...
Thanks in Advance.
Validation_mode copy option is not yet supported by snowpipe. However, snowpipe does support simple transformations like column reordering, cast etc are supported. The best way to perform calculations and transform your data would be to load the data into a staging table and process downstream into target tables.
Reference:
https://docs.snowflake.net/manuals/sql-reference/sql/create-pipe.html#usage-notes
https://docs.snowflake.net/manuals/user-guide/data-load-transform.html