Insert Current Date in ADL Gen2 (DataBricks) - pyspark

I am new to Databricks, i have a requirement where in silver layer after transformation is happening i have to take the max(load_date) from my dataset and update that value in storage account (Transient folder). A .csv file is already available in the Transient folder where i have to overwrite the max(load_date) value every time my notebook runs.
for now i am doing it creating a empty Dataframe then assigning the max date and then loading it to the file but it seems not working that way.
Any idea to do it in a efficient way?

Create an empty SQL table then assign the max date into the data frame by using the merge operation.
I have stored my Max date in the kDF data frame.
kDF.write.option("mergeSchema","true").format("delta").mode("append").saveAsTable("test3")
After the merge operation, Max date will be stored in the test3 table and stored in this location /mnt/defaultDatalake/filedemo.
Now You can check the data:
df1 = spark.read.format("delta").option("header", "true").load("/mnt/defaultDatalake/filedemo")
display(df1)
Then perform the write operation with azure data lake gen2:
#Connect Gen2 with azure databricks
spark.conf.set(
"fs.azure.account.key.<storage_account>.dfs.core.windows.net",
"Access_key"
)
Overwrite the data into Gen2
df1.coalesce(1).write.format('csv').mode("overwrite").save("abfss://demo12#vamblob.dfs.core.windows.net/vam")

Related

Azure Data Factory DataFlow Sink to Delta fails when schema changes (column data types and similar)

We have an Azure Data Factory dataflow, it will sink into Delta. We have Owerwrite, Allow Insert options set and Vacuum = 1.
When we run the pipeline over and over with no change in the table structure pipeline is successfull.
But when the table structure being sinked changed, ex data types changed and such the pipeline fails with below error.
Error code: DFExecutorUserError
Failure type: User configuration issue
Details: Job failed due to reason: at Sink 'ConvertToDelta': Job aborted.
We tried setting Vacuum to 0 and back, Merge Schema set and now, instead of Overwrite Truncate and back and forth, pipeline still failed.
Can you try enabling Delta Lake's schema evolution (more information)? By default, Delta Lake has schema enforcement enabled which means that the change to the source table is not allowed which would result in an error.
Even with overwrite enabled, unless you specify schema evolution, overwrite will fail because by default the schema cannot be changed.
I created ADLS Gen2 storage account and created input and output folders and uploaded parquet file into input folder.
I created pipeline and created dataflow as below:
I have taken Parquet file as source.
Dataflow Source:
Dataset of Source:
Data preview of Source:
I created derived column to change the structure of the table.
Derived column:
I updated 'difficulty' column of parquet file. I changed the datatype of 'difficulty' column from long to double using below code:
difficulty : toDouble(difficulty)
Image for reference:
I updated 'transactions_len' column of parquet file. I changed the datatype of 'transactions_len' column from Integer to float using below code:
transactions_len : toFloat(transactions_len)
I updated 'number' column of parquet file. I changed the datatype of 'number' column from long to string using below code:
number : toString(number)
Image for reference:
Data preview of Derived column:
I have taken delta as sink.
Dataflow sink:
Sink settings:
Data preview of Sink:
I run the pipeline It executed successfully.
Image for reference:
I t successfully stored in my storage account output folder.
Image for reference:
The procedure worked in my machine please recheck from your end.
The source (Ingestion) was generated to azure blob with given a specific filename. Whenever we generated to source parquet files without specifying a specific filename but only a directory the sink worked

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

Snowflake Copy component in Azure Data factory cannot default a column timestamp

i have been able to copy a file directly from Azure blob storage to Snowflake into a table. I have one column that i defaulted to current timestamp to store when the data was imported. This field however is ending up as null even though i have it defaulted. anyone have ideas on how to do this? there isn't a post script i can run with this component.

How do I dynamically map files in Copy Activity to load the data into destination

Azure Data factory V2 - Copy Activity - Copy data from Changing Column names and number of columns to Destination. I have to copy data from a Flat File where number of Columns will change in each file and even the column names. How do I dynamically map them in Copy Activity to load the data into destination in Azure Data factory V2.
Suppose my destination has 20 columns, but source will come sometimes as 10 columns or 15 or sometimes 20. If the source columns are less than destination then remaining column values in destination should be passed as Null.
Use data flows in ADF. Data Flow sinks can generate the table schema on the fly if you wish. Or you can just "auto-map" any changing schema to your target. If your source schema changes often, just use "schema drift" with no schema defined in your dataset.

use adf pipeline parameters as source to sink columns while mapping

I have an ADF pipeline with copy activity, I'm copying data from blob storage CSV file to SQL database, this is working as expected. I need to map Name of the CSV file (this coming from pipeline parameters) and save it in the destination table. I'm wondering if there is a way to map parameters to destination columns.
Column name can't directly use parameters. But you can use parameter for the whole structure property in dataset and columnMappings property in copy activity.
This might be a little tedious as you will need to write the whole structure array and columnMappings on your own and pass them as parameters into pipeline.
In DF v2 in Copy Data activity, it is possible to add a new column to the source with value $$FILEPATH, and then each record will have a name of the input file.
Azure DF v2, CopyData activity -> Source