Issue - BigQuery Merge statement won't allow date partitioning - is now reading full table

Issue - BigQuery Merge statement won't allow date partitioning - is now reading full table - merge

We are trying to perform a merge 1 day of a date-partitioned table. We've followed google's recommended merge setup:
MERGE dataset.target T
USING (SELECT * FROM dataset.source WHERE _PARTITIONTIME = '2018-01-01') S
ON T.c1 = S.c1
WHEN MATCHED THEN
DELETE
We even put an extra piece of query into the when statement:
WHEN MATCHED and tgt._PARTITIONTIME = '2018-01-01' THEN
Running this query results in BQ querying our entire table instead of hitting only a single day.
Running the same query without the MERGE statement only hits 1 day of data as expected. We are wondering if anything changed on BQs side, since we made no changes to our datastructure whatsoever.
Cheers,

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost

If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)

Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

BigQuery Merge - Size of Query expands with DELETE clause

When attempting a MERGE statement, BigQuery is only scanning the requested partitions UNTIL the DELETE statment is added, at which point it reverts to scanning the whole dataset (blossoming from 1GB to >1TB in this case).
Is there a way to use the full features of MERGE, including DELETE, without incurring the extra cost?
Generic sample that matches my effort below:
MERGE target_table AS t *## All Dates, partitioned on
activity_date*
USING source_table AS s ## one date, only yesterday
ON t.field_a = s.field_a
AND t.activity_date >=
DATE_ADD(DATE(current_timestamp(),'America/Los_Angeles'), INTERVAL -1 DAY) ## use partition to limit to yesterday
WHEN MATCHED
THEN UPDATE SET
field_b = s.field_b
WHEN NOT MATCHED
THEN INSERT
(field_a, field_b)
VALUES
(field_a, field_b)
WHEN NOT MATCHED BY SOURCE
THEN DELETE

Based on the query you have provided, it is not expected behavior for it to apply the merge on the whole dataset. After the query has run, you should analyze your dataset and check its validity to ensure that the query only ran on the specific partitions.
If, after further inspection, no unexpected changes were made to your dataset, the 1 TB of data noted may be simply explained as BigQuery ingesting that data into memory as a side step to be able to run the query.
However, to confirm it is recommended to submit a ticket in the issue tracker with your BigQuery JobID so that BigQuery engineering can properly inspect the issue.

SalesForce ADO.NET Source in SSDT 2013 SOQL Statement Using Date Literals Not Returning Results

I am having an issue with a date literal not returning any results when I run an SOQL query. The query is as follows:
Select * From dbo.Case WHERE CreatedDate = YESTERDAY
With the query, I would like to obtain case data from the previous day. I know there is data available that was created the previous day. The results of the query when I preview it, though, are an empty set with no error message.
A different wrinkle that makes this not quite a strict SOQL issue is that I am trying to use this query as an SQL command on an ADO.NET connection using the CData ADO.NET driver to connect to a SalesForce.com instance. My goal is to be able to build SSDT packages that will allow me to stage the data from SalesForce into our SQL Server for processing there.
I have similar issues using the LAST_N_DAYS date literal as well.
I believe I should be using SOQL to query in the SQL command text field for the ADO.NET source connection but I am not 100% sure about that. I know for certain that I cannot use T-SQL because it does not recognize the GETDATE().
Any guidance on how to pull the records from Case for the previous day or where the query I am using might be wrong would be greatly appreciated.

Found an answer. The following SQL command will pull the data from the previous day:
Select * From dbo.Case Where CreatedDate = 'YESTERDAY'
The single quotes evaluate the yesterday date literal as expected.
Likewise, the following SQL statement will get the last 30 days of data.
Select * From dbo.Case Where CreatedDate = 'LAST_N_DAYS:30'
Thanks to anyone who researched and attempted the question! :)

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3

Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?

Does DataReader.NextResult retrieves the result is always the same order

I have a SELECT query that yields multiple results and do not have any ORDER BY clause.
If I execute this query multiple times and then iterate through results using DataReader.NextResult(), would I be guaranteed to get the results in the same order?
For e.g. if I execute the following query that return 199 rows:
SELECT * FROM products WHERE productid < 200
would I always get the first result with productid = 1 and so on?
As far as I have observed it always return the results in same order, but I cannot find any documentation for this behavior.
======================================
As per my research:
Check out this blog Conor vs. SQL. I actually wanted to ask if the query-result changes even if the data in table remains the same (i.e no update or delete). But it seems like in case of large table, when SQL server employees parallelism, the order can be different

First of all, to iterate the rows in a DataReader, you should call Read, not NextResult.
Calling NextResult will move to the next result set if your query has multiple SELECT statements.
To answer your question, you must not rely on this.
A query without an ORDER BY clause will return rows in SQL Server's default iteration order.
For small tables, this will usually be the order in which the rows were added, but this is not guaranteed and is liable to change at any time. For example, if the table is indexed or partitioned, the order will be different.

No, DataReader will return the results in the order they come back from SQL. If you don't specify an ORDER BY clause, that will be the order that they exist in the table.

It is possible, perhaps even likely that they will always return in the same order, but this isn't guaranteed. The order is determined by the queryplan (at least in SQL Server) on the database server. If something changes that queryplan, the order could change. You should always use ORDER BY if the order of results is in anyway important to your processing of the data.