BigQuery Merge - Size of Query expands with DELETE clause - merge

When attempting a MERGE statement, BigQuery is only scanning the requested partitions UNTIL the DELETE statment is added, at which point it reverts to scanning the whole dataset (blossoming from 1GB to >1TB in this case).
Is there a way to use the full features of MERGE, including DELETE, without incurring the extra cost?
Generic sample that matches my effort below:
MERGE target_table AS t *## All Dates, partitioned on
activity_date*
USING source_table AS s ## one date, only yesterday
ON t.field_a = s.field_a
AND t.activity_date >=
DATE_ADD(DATE(current_timestamp(),'America/Los_Angeles'), INTERVAL -1 DAY) ## use partition to limit to yesterday
WHEN MATCHED
THEN UPDATE SET
field_b = s.field_b
WHEN NOT MATCHED
THEN INSERT
(field_a, field_b)
VALUES
(field_a, field_b)
WHEN NOT MATCHED BY SOURCE
THEN DELETE

Based on the query you have provided, it is not expected behavior for it to apply the merge on the whole dataset. After the query has run, you should analyze your dataset and check its validity to ensure that the query only ran on the specific partitions.
If, after further inspection, no unexpected changes were made to your dataset, the 1 TB of data noted may be simply explained as BigQuery ingesting that data into memory as a side step to be able to run the query.
However, to confirm it is recommended to submit a ticket in the issue tracker with your BigQuery JobID so that BigQuery engineering can properly inspect the issue.

Related

Delta Live CDC for Aggregate State Tables

As far as I can tell from the documentation, I can not accomplish a specific migration from Delta to Delta Live that I would love to do... but I want to see if I might be missing a solution.
Currently, i have a number of aggregate batch Delta tables that upsert new records on a daily basis, keeping only a very basic set of information.
key, first_seen, last_seen
with normal Delta upserts, I do this by grabbing the new data, creating a data frame with the last seen information and conditionally updating the last_seen value or inserting all. so it is like
existing.alias(‘existing’).merge(
summary.alias('updates'), "existing.key = updates.key")\
.whenMatchedUpdate(condition="updates.last_seen > existing.last_seen",
set = { "last_seen": "updates.last_seen"})\
.whenNotMatchedInsertAll()\
.execute()
I really would like to bring this into Delta Live pipelines and change it to an incremental update. Life would be so great. I only want to keep current information so this should match SCD Type 1.
From the Databricks documentation on CDC, I can not see how to do an upsert on a subset of columns which also performs an whenNotMatchedInsertAll().
Any thoughts?

Issue - BigQuery Merge statement won't allow date partitioning - is now reading full table

We are trying to perform a merge 1 day of a date-partitioned table. We've followed google's recommended merge setup:
MERGE dataset.target T
USING (SELECT * FROM dataset.source WHERE _PARTITIONTIME = '2018-01-01') S
ON T.c1 = S.c1
WHEN MATCHED THEN
DELETE
We even put an extra piece of query into the when statement:
WHEN MATCHED and tgt._PARTITIONTIME = '2018-01-01' THEN
Running this query results in BQ querying our entire table instead of hitting only a single day.
Running the same query without the MERGE statement only hits 1 day of data as expected. We are wondering if anything changed on BQs side, since we made no changes to our datastructure whatsoever.
Cheers,

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Are idx_scan statistics reset automatically (default)?

I was looking at the tables (pg_stat_user_indexes and pg_stat_user_tables) and discovered many indices that are not being used.
But before I think about doing any operations to remove these indices, I need to understand what period was the analysis of this data (idx_scan), has it been since the database was created?
In the pg_stat_database table (column stats_reset) there is a date that normally is today or up to 15 days ago, but does this process interfere with the tables I mentioned above?
No command pg_stat_reset() was executed.
Does the pg_stat_reset() command clear the tables (pg_stat_user_indexes and pg_stat_user_tables)?
My goal is to understand the period of data collected so that I can make a decision.
Statistics are cumulative and are kept from the time of cluster creation on.
So if you see the pg_stat_database.stats_reset change regularly, there must be somebody or something doing that explicitly with the pg_stat_reset() function.
Doing so is somewhat problematic, because this resets all statistics, including those in pg_stat_user_tables which govern when autovacuum and autoanalyze take place. So after a reset these will be a little out of whack until autoanalyze has collected new statistics.
The better way is to take regular snapshots and calculate the difference.
You are right that you should collect data over a longer time before you determine if an index can be canned or not. For example, some activity may only take place once a month, but require certain indexes.
Before dropping indexes, consider that indexes also serve other purposes besides being scanned:
They can be UNIQUE or back a constraint, in which case they serve a purpose even when they are never scanned.
Indexes on expressions make PostgreSQL collect statistics on the distribution of the indexed expression, which can have a notable effect on query planning and the quality of your execution plans.
You could use the query in this blog to find all the indexes that serve no purpose at all.
Only superuser is allowed to reset statistic. Query planer depends on statistic.
Use snapshots:
CREATE TABLE stat_idx_snap_m10_d29_16_12 AS SELECT * FROM pg_stat_user_indexes;
CREATE TABLE stat_idx_snap_m10_d29_16_20 AS SELECT * FROM pg_stat_user_indexes;
Analyze difference any time later:
SELECT
s2.relid, s2.indexrelid, s2.schemaname, s2.relname, s2.indexrelname,
s2.idx_scan - s1.idx_scan as idx_scan,
s2.idx_tup_read - s1.idx_tup_read as idx_tup_read,
s2.idx_tup_fetch - s1.idx_tup_fetch as idx_tup_fetch
FROM stat_idx_snap_m10_d29_16_20 s2
FULL OUTER JOIN stat_idx_snap_m10_d29_16_12 s1
ON s2.relid = s1.relid AND s2.indexrelid = s1.indexrelid
ORDER BY s2.idx_scan - s1.idx_scan ASC;

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).