Databricks Delta files adding new partition causes old ones to be not readable - pyspark

I have a notebook using which i am doing a history load. Loading 6 months data everytime, starting with 2018-10-01.
My delta file is partitioned by calendar_date
After the initial load i am able to read the delta file and look the data just fine.
But after the second load for date 2019-01-01 to 2019-06-30, the previous partitons are not loading normally using delta format.
Reading my source delta file like this throws me error saying
file dosen't exist
game_refined_start = (
spark.read.format("delta").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)
However reading like below just works fine any idea what could be wrong
spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")
game_refined_start = (
spark.read.format("parquet").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)

If the overwrite mode is used, then it completely replaces previous data. You see old data via parquet because delta doesn't remove old versions immediately (but if you do with parquet, then it will remove data immediately).
To fix your problem - you need to use append mode. If you need to get previous data, you can read specific version from a table, and append it. Something like this:
path = "s3://game_events/refined/game_session_start/"
v = spark.sql(f"DESCRIBE HISTORY delta.`{path}` limit 2")
version = v.take(2)[1][0]
df = spark.read.format("delta").option("versionAsOf", version).load(path)
df.write.mode("append").save(path)

Related

How can I sample and show parquet files using fugue?

Say I have this file s3://some/path/some_partiitoned_data.parquet.
I would like to sample a given count of rows and display them nicely, possibly in a jupyter notebook.
some_partiitoned_data.parquet could be very large, I would like to do this without loading the data into memory, even without downloading the parquet files to disk.
Spark doesn't let you sample a given number of rows, you can only sample a given fraction, but with Fugue 0.8.0 this is a solution to get n rows
import fugue.api as fa
df = fa.load("parquetfile", engine=spark)
fa.show(fa.sample(df, frac=0.0001), n=10)
You just need to make sure with the frac, there are still more than 10 rows.
You can use fa.head to get the dataframe instead of printing it.
See the API reference at https://fugue.readthedocs.io/en/latest/top_api.html

Create table in spark taking a lot of time

We have a table creation databricks script like this,
finalDF.write.format('delta').option("mergeSchema", "true").mode('overwrite').save(table_path)
spark.sql("CREATE TABLE IF NOT EXISTS {}.{} USING DELTA LOCATION '{}' ".format('GOLDDB', table, table_path))
So in the table_path initially during first load we just have 1 file.. So this runs as incremental and everyday files accumulates.. So after 10 incremental loads, this takes around 10 hours to complete. Could you please help me on how to optimise the load? Is it possible to merge files?
I just tried removing some files for testing purpose but it failed with error that some files present in log file is missing and this occurs when you manually delete the files..
please suggest on how to optimize this query
Instead of write + create table you can just do everything in one step using the path option + saveAsTable:
finalDF.write.format('delta')\
.option("mergeSchema", "true")\
.option("path", table_path)\
.mode('overwrite')\
.saveAsTable(table_name) # like 'GOLDDB.name'
To cleanup old data you need to use VACUUM command (doc), maybe you may need to decrease retention from default 30 days (see doc on delta.logRetentionDuration option)

Delta merge logic whenMatchedDelete case

I'm working on the delta merge logic and wanted to delete a row on the delta table when the row gets deleted on the latest dataframe read.
My sample DF as shown below
df = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000'),
('Python', '10000')
],
["language", "users_count"] # add your column names here
)
Insert the data to a delta table
df.write.format("delta").mode("append").saveAsTable("xx.delta_merge_check")
On the next read, i've removed the row that shows ('python', '10000'), and now I want to delete this row from the delta table using delta merge API.
df_latest = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000')
],
["language", "users_count"] # add your column names here
)
I'm using the below code for the delta merge API
Read the existing delta table:
from delta.tables import *
test_delta = DeltaTable.forPath(spark,
"wasbs://storageaccount#xx.blob.core.windows.net/hive/warehouse/xx/delta_merge_check")
merge the changes:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.language =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
But unfortunately this doesn't delete the row('python', '10000') from the delta table, is there any other way to achieve this any help would be much appreciated.
This won't work the way you think it should - the basic problem is that your 2nd dataset doesn't have any information that data was deleted, so you somehow need to add this information to it. There are different approaches, based on the specific details:
Instead of just removing the row, you keep it, but add an another column that will show if data is deleted or not, something like this:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.is_deleted =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
use some other method to find diff between your destination table and input data - but this will really depends on your logic. If you able to calculate the diff, then you can use the approach that I described in the previous item.
If your input data is always a full set of data, you can just overwrite all data using overwrite mode - this will be even more performant than merge, because you don't rewrite the data

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

Databricks - failing to write from a DataFrame to a Delta location

I wanted to change a column name of a Databricks Delta table.
So I did the following:
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Dropped and recereated the Delta files location
dbutils.fs.rm("dbfs:/mnt/main/sales", true)
dbutils.fs.mkdirs("dbfs:/mnt/main/sales")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
Here I'm getting an Error at the last step when writing to Delta:
java.io.FileNotFoundException: dbfs:/mnt/main/sales/sale_date_partition=2019-04-29/part-00000-769.c000.snappy.parquet
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement
Obviously the data was deleted and most likely I've missed something in the above logic. Now the only place that contains the data is the new_data_DF.
Writing to a location like dbfs:/mnt/main/sales_tmp also fails
What should I do to write data from new_data_DF to a Delta location?
In general, it is a good idea to avoid using rm on Delta tables. Delta's transaction log can prevent eventual consistency issues in most cases, however, when you delete and recreate a table in a very short time, different versions of the transaction log can flicker in and out of existence.
Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table you can:
df.write.format("delta").mode("overwrite").save("/delta/events")
If you have a table that has already been corrupted, you can fix it using FSCK.
You could do that in the following way.
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.mode("overwrite") // this would overwrite the whole data files
.option("overwriteSchema", "true") //this is the key line.
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
OverWriteSchema option will create new physical files with latest schema that we have updated during transformation.