Every weekend I add a few files to a google bucket and then run something from the command line to "update" a table with the new data.
By "update" I mean that I delete the table and then remake it by using all the files in the bucket, including the new files.
I do everything by using python to execute the following command in the Windows command line:
bq mk --table --project_id=hippo_fence-5412 mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
This table is getting massive (>200 GB) and it would be much cheaper for the lab to use partitioned tables.
I've tried a to partition the table in a few ways, including what is recommened by the official docs but I can't make it work.
The most recent command I tried was just inserting --time_partitioning_type=DAY like:
bq mk --table --project_id=hippo_fence-5412 --time_partitioning_type=DAY mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
but that didn't work, giving me the error:
FATAL Flags parsing error: Unknown command line flag 'time_partitioning_type'
How can I make this work?
For the old data, a possible solution would be to create an empty partitioned table and then import each bucket file in the desired day partition. Unfortunately it didn’t work with wildcards when I tested it.
1. Create the partitioned table
bq mk --table --time_partitioning_type=DAY [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
2. Import each file in the desired partition day. Here is an example for a file from 22nd February 2018.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE]$20180222 gs://MY-BUCKET/my_file.csv
3. Process the current uploads normally and they will be automatically counted in the day of the upload partition.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] gs://MY-BUCKET/files*
Related
We have a table creation databricks script like this,
finalDF.write.format('delta').option("mergeSchema", "true").mode('overwrite').save(table_path)
spark.sql("CREATE TABLE IF NOT EXISTS {}.{} USING DELTA LOCATION '{}' ".format('GOLDDB', table, table_path))
So in the table_path initially during first load we just have 1 file.. So this runs as incremental and everyday files accumulates.. So after 10 incremental loads, this takes around 10 hours to complete. Could you please help me on how to optimise the load? Is it possible to merge files?
I just tried removing some files for testing purpose but it failed with error that some files present in log file is missing and this occurs when you manually delete the files..
please suggest on how to optimize this query
Instead of write + create table you can just do everything in one step using the path option + saveAsTable:
finalDF.write.format('delta')\
.option("mergeSchema", "true")\
.option("path", table_path)\
.mode('overwrite')\
.saveAsTable(table_name) # like 'GOLDDB.name'
To cleanup old data you need to use VACUUM command (doc), maybe you may need to decrease retention from default 30 days (see doc on delta.logRetentionDuration option)
I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.
I've looked at this resource, but it's not quite what I need. This question is what I want to accomplish, but I want to run it in the BQ terminal.
For instance, in the past I've exported table information as a .json in bq command-line as so:
bq show --schema --format=prettyjson Dataset.TableView > /home/directory/Dataset.TableView.json
This gives a prettyjson of Table information of a specified dataset in a set project. I would like to just have a .csv (or any type of list) of all dataset names in the project. But I can't figure out how to change that command-line appropriately to output what I want.
In order to further contribute to the community, as an alternative to #DanielZagales answer, using the bq command line. According to the documentation, you can use the bq ls to list all the datasets in a project. Such as follows,
bq ls -a --format=pretty --project_id your-project-id
The flag -a is short for --all, which guarantees that all the datasets will be included in the list. The flag --format=pretty will output the list as a table format, you can use other formatting such as described here. Furthermore, you can also filter the datasets which match an expression with --filter labels.key:value or set the maximum number of results with --max_results or -n.
Note: you can also list all the tables within a dataset, such as described here.
You should be able to query the information schema to get the results you want.
example:
select * from `project_id.INFORMATION_SCHEMA.SCHEMATA`;
You can then add that to the bq command like:
bq query --format=csv 'select * from `project_id.INFORMATION_SCHEMA.SCHEMATA`;'
I am trying to use the UnloadCopyUtility to migrate an instance to an encrypted instance but some of the tables fail because it is trying to insert the values into the wrong columns. Is there are a way I can ensure the columns are mapped to the values correctly? I can adjust the python script locally if need be
I feel, this should be possible in UnloadCopy utility as well.
But here I'm trying to answer more of generic solution withput UnloadCopy utility, so that it may be helpful to others as an alternate solution.
In unload command you could specify the columns like C1,C2,C3,...
Use same sequence columns in copy command while loading data in RedShift.
Unload command example.
unload ('select C1,C2,C3,... from venue') to 's3://mybucket/tickit/unload/venue_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' parallel off;
Copy command example with specific columns sequence of above unloaded files.
copy table (C1,C2,C3,...) from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;
I have a large csv file which contains over 30million rows. I need to load this file on a daily basis and identify which of the rows have changed. Unfortunately there is no unique key field but it's possible to use four of the fields to make it unique. Once I have identified the changed rows I will then want to export the data. I have tried using a traditional SQL Server solution but the performance is so slow it's not going to work. Therefore I have been looking at Mongodb - this has managed to import the file in about 20 minutes (which is fine). Now I don't have any experience using Monogdb and more importantly knowing best practices. So, my idea is the following:
As a one off - Import data into a collection using the mongoimport.
Copy all of the unique id's generated by mongo and put them in a separate collection.
Import new data into the existing collection using upsert fields which should create a new id for each new and changed row.
Compare the 'copy' to the new collection to list out all the changed rows.
Export changed data.
This to me will work but I am hoping there is a much better way to tackle this problem.
Use unix sort and diff.
Sort the file on disk
sort -o new_file.csv -t ',' big_file.csv
sort -o old_file.csv -t ',' yesterday.csv
diff new_file.csv old_file.csv
Commands may need some tweeking.
You can also use mysql to import the file via
http://dev.mysql.com/doc/refman/5.1/en/load-data.html (LOAD FILE)
and then create KEY (or primary key) on the 4 fields.
Then load yesterday's file into a different table and then use a 2 sql statements to compare the files...
But, diff will work best!
-daniel