We have a table creation databricks script like this,
finalDF.write.format('delta').option("mergeSchema", "true").mode('overwrite').save(table_path)
spark.sql("CREATE TABLE IF NOT EXISTS {}.{} USING DELTA LOCATION '{}' ".format('GOLDDB', table, table_path))
So in the table_path initially during first load we just have 1 file.. So this runs as incremental and everyday files accumulates.. So after 10 incremental loads, this takes around 10 hours to complete. Could you please help me on how to optimise the load? Is it possible to merge files?
I just tried removing some files for testing purpose but it failed with error that some files present in log file is missing and this occurs when you manually delete the files..
please suggest on how to optimize this query
Instead of write + create table you can just do everything in one step using the path option + saveAsTable:
finalDF.write.format('delta')\
.option("mergeSchema", "true")\
.option("path", table_path)\
.mode('overwrite')\
.saveAsTable(table_name) # like 'GOLDDB.name'
To cleanup old data you need to use VACUUM command (doc), maybe you may need to decrease retention from default 30 days (see doc on delta.logRetentionDuration option)
Related
I learn pyspark. I'm trying to build DataFrame from sql, for example
DF=spark.sql("with a as (select ....) select ...")
My sql is a little complex, so it's executed for 20 minutes.
I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
The only way I can think of is to use "create table", for example:
xx=spark.sql("create table yyy as with a as (select ....) select ...")
DF=sqlContext.sql("select * from yyy")
It works but it looks strange to me.
What are the best practices to create DataFrame in pyspark from complex SQL ? I would like to skip the step with "create table".
I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
Pandas load your data into memory from the moment you read that data, that's why it's lightning-fast. But remember the data size you can load is limited to your computer's memory
My sql is a little complex, so it's executed for 20 minutes. I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
Spark does not load the data when you read it. It only read data when there is an "action" like cache or count or head.
The only way I can think of is to use "create table"
Yes, creating a table is also an action, where your query is executed entirely. Next time when you read it, it doesn't have to re-compute it. Tha alternative of creating table is caching, you can do something like this DF.cache().count(), spark will load entire data into memory and all other actions later will be much faster.
I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.
I have a notebook using which i am doing a history load. Loading 6 months data everytime, starting with 2018-10-01.
My delta file is partitioned by calendar_date
After the initial load i am able to read the delta file and look the data just fine.
But after the second load for date 2019-01-01 to 2019-06-30, the previous partitons are not loading normally using delta format.
Reading my source delta file like this throws me error saying
file dosen't exist
game_refined_start = (
spark.read.format("delta").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)
However reading like below just works fine any idea what could be wrong
spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")
game_refined_start = (
spark.read.format("parquet").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)
If the overwrite mode is used, then it completely replaces previous data. You see old data via parquet because delta doesn't remove old versions immediately (but if you do with parquet, then it will remove data immediately).
To fix your problem - you need to use append mode. If you need to get previous data, you can read specific version from a table, and append it. Something like this:
path = "s3://game_events/refined/game_session_start/"
v = spark.sql(f"DESCRIBE HISTORY delta.`{path}` limit 2")
version = v.take(2)[1][0]
df = spark.read.format("delta").option("versionAsOf", version).load(path)
df.write.mode("append").save(path)
Every weekend I add a few files to a google bucket and then run something from the command line to "update" a table with the new data.
By "update" I mean that I delete the table and then remake it by using all the files in the bucket, including the new files.
I do everything by using python to execute the following command in the Windows command line:
bq mk --table --project_id=hippo_fence-5412 mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
This table is getting massive (>200 GB) and it would be much cheaper for the lab to use partitioned tables.
I've tried a to partition the table in a few ways, including what is recommened by the official docs but I can't make it work.
The most recent command I tried was just inserting --time_partitioning_type=DAY like:
bq mk --table --project_id=hippo_fence-5412 --time_partitioning_type=DAY mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
but that didn't work, giving me the error:
FATAL Flags parsing error: Unknown command line flag 'time_partitioning_type'
How can I make this work?
For the old data, a possible solution would be to create an empty partitioned table and then import each bucket file in the desired day partition. Unfortunately it didn’t work with wildcards when I tested it.
1. Create the partitioned table
bq mk --table --time_partitioning_type=DAY [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
2. Import each file in the desired partition day. Here is an example for a file from 22nd February 2018.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE]$20180222 gs://MY-BUCKET/my_file.csv
3. Process the current uploads normally and they will be automatically counted in the day of the upload partition.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] gs://MY-BUCKET/files*
I am trying to use the UnloadCopyUtility to migrate an instance to an encrypted instance but some of the tables fail because it is trying to insert the values into the wrong columns. Is there are a way I can ensure the columns are mapped to the values correctly? I can adjust the python script locally if need be
I feel, this should be possible in UnloadCopy utility as well.
But here I'm trying to answer more of generic solution withput UnloadCopy utility, so that it may be helpful to others as an alternate solution.
In unload command you could specify the columns like C1,C2,C3,...
Use same sequence columns in copy command while loading data in RedShift.
Unload command example.
unload ('select C1,C2,C3,... from venue') to 's3://mybucket/tickit/unload/venue_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' parallel off;
Copy command example with specific columns sequence of above unloaded files.
copy table (C1,C2,C3,...) from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;