How to create an BQ external table on top of the delta table in GCS and show only latest snapshot - pyspark

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?

So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.

I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Azure Data Factory - Insert Sql Row for Each File Found

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?
You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.
Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

How do I dynamically map files in Copy Activity to load the data into destination

Azure Data factory V2 - Copy Activity - Copy data from Changing Column names and number of columns to Destination. I have to copy data from a Flat File where number of Columns will change in each file and even the column names. How do I dynamically map them in Copy Activity to load the data into destination in Azure Data factory V2.
Suppose my destination has 20 columns, but source will come sometimes as 10 columns or 15 or sometimes 20. If the source columns are less than destination then remaining column values in destination should be passed as Null.
Use data flows in ADF. Data Flow sinks can generate the table schema on the fly if you wish. Or you can just "auto-map" any changing schema to your target. If your source schema changes often, just use "schema drift" with no schema defined in your dataset.

How to create partitioned table from Google Bucket?

Every weekend I add a few files to a google bucket and then run something from the command line to "update" a table with the new data.
By "update" I mean that I delete the table and then remake it by using all the files in the bucket, including the new files.
I do everything by using python to execute the following command in the Windows command line:
bq mk --table --project_id=hippo_fence-5412 mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
This table is getting massive (>200 GB) and it would be much cheaper for the lab to use partitioned tables.
I've tried a to partition the table in a few ways, including what is recommened by the official docs but I can't make it work.
The most recent command I tried was just inserting --time_partitioning_type=DAY like:
bq mk --table --project_id=hippo_fence-5412 --time_partitioning_type=DAY mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
but that didn't work, giving me the error:
FATAL Flags parsing error: Unknown command line flag 'time_partitioning_type'
How can I make this work?
For the old data, a possible solution would be to create an empty partitioned table and then import each bucket file in the desired day partition. Unfortunately it didn’t work with wildcards when I tested it.
1. Create the partitioned table
bq mk --table --time_partitioning_type=DAY [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
2. Import each file in the desired partition day. Here is an example for a file from 22nd February 2018.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE]$20180222 gs://MY-BUCKET/my_file.csv
3. Process the current uploads normally and they will be automatically counted in the day of the upload partition.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] gs://MY-BUCKET/files*

How to read AWS Glue Data Catalog table schemas programmatically

I have a set of daily CSV files of uniform structure which I will upload to S3. There is a downstream job which loads the CSV data into a Redshift database table. The number of columns in the CSV may increase and from that point onwards the new files will come with the new columns in them. When this happens, I would like to detect the change and add the column to the target Redshift table automatically.
My plan is to run a Glue Crawler on the source CSV files. Any change in schema would generate a new version of the table in the Glue Data Catalog. I would then like to programmatically read the table structure (columns and their datatypes) of the latest version of the Table in the Glue Data Catalog using Java, .NET or other languages and compare it with the schema of the Redshift table. In case new columns are found, I will generate a DDL statement to alter the Redshift table to add the columns.
Can someone point me to any examples of reading Glue Data Catalog tables using Java, .NET or other languages? Are there any better ideas to automatically add new columns to Redshift tables?
If you want to use Java, use the dependency:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-glue</artifactId>
<version>{VERSION}</version>
</dependency>
And here's a code snippet to get your table versions and the list of columns:
AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
.withDatabaseName("glue_catalog_database_name")
.withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();
Here you can see AWS Doc for the TableVersion and the StorageDescriptor objects.
You could also use the boto3 library for Python.
Hope this helps.