Why is this q/kdb+ join more efficient? replacing a 2-column left-join with each-xgroup-1-column left joins

Why is this q/kdb+ join more efficient? replacing a 2-column left-join with each-xgroup-1-column left joins - left-join

I was playing around with left-joining two in-memory tables. Both tables include columns `date`sym with a couple of millions of rows each and are sorted by date/sym with `s#date set.
I was doing initially a naive left-join using:
t: t1 lj 2!t2;
and it was taking ~15s, then I played around and wrote it like:
t: ungroup {flip flip[x] lj 1!flip y}'[`date xgroup t1;`date xgroup t2];
and this dropped runtime to 0.5s while yielding the same result (at least on my tables).
I don't understand why this would be more efficient. Would this be expected? Or am I doing something wrong that's preventing lj from automatically doing something more efficiently than it currently is?

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.

UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.

That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

Is it possible to use one field in a cte to limit the data in another cte

I am new to using CTEs but I work with a humongous database and think they would cause less stress to the system that subqueries. I'm not sure if what I want to do is possible.
I have 2 CTEs with different columns from different tables but each CTE has the same sample_num (same data type of int) in them that could be used to join them if possible. I use the first CTE to limit the data for samples. I want the second CTE to look into the first and if the sample numbers match, include that sample number data in the second CTE. The reason I have the second CTE is because I use it's data to create a pivot table.
Ultimately what I want to do in my outer query is to use the fields from the first CTE and add the pivot table columns from the second CTE to the left. Basically marry the two CTEs side by side in the final outer query.
Is this possible or am I making this a lot harder than it needs to be. Remember, I work on a huge database with thousands of users.

T-SQL different JOIN approaches, same results, which one would you prefer?

these are 3 approaches how to make a join. I would like to hear some word on perforance of these 3 queries.
Thank you
SELECT * FROM
tableA A LEFT JOIN tableB B
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
ON B.ColumnB = A.ColumnB
WHERE ColumnX = 'XY'
Versus
SELECT * FROM
tableA A LEFT JOIN tableB B
ON B.ColumnB = A.ColumnB
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
WHERE ColumnX = 'XY'
Versus Common Table Expression
WITH T...

It does not matter.
SQL Server has a cost-based optimizer (as opposed to a rule-based optimizer). That means that the engine is able to figure out that both of your first two options are identical. Run your estimated and actual execution plans and you will see that this is the case.
The only reason you would choose one option over the other is for readability's sake. I go with your second option, because it's a lot easier to read when there are a great many joins involved. ON clauses in reverse order become quite difficult to track.

In my experience, any of the above could be quicker depending on your tables.
As you're setting up joins, you want to start with the most restrictive as possible (without negatively affecting your end result, obviously). This same logic also applies to the Where clause for the same reason. By starting with the most restrictive, you're limiting the number of rows that are being joined and thus evaluated by the Where clause and then returned/manipulated in the select clause. For my answers below regarding the three specific scenarios, I'm assuming a sufficiently complicated query that is doing more than just looking to combine data from multiple tables (i.e., queries answering specific questions).
If Table A is huge and Tables B & C are smaller and more directly related to the data you're trying to isolate, then the first option would likely be fastest.
If Table B or C are huge and Table A is more related to your desired data, the second option would likely be fastest.
As far as option 3 goes, I love CTEs but I try to only use them when I need to do so. Using a CTE will speed up your overall query if the data joined, manipulated, and returned by the CTE is only related to the rest of the query in a limited fashion. Including tables that are only partially related to your end result in your primary string of joins is going to needlessly slow down your query. If you can parse out that data into a CTE, it can run quickly by itself and then be incorporated back into the main query at the end.

Using OR in WHERE produces unexpected results

In a query today I wanted to select rows that had a particular company name in either of two columns from two different tables. I did it like
WHERE (N.COMPANY LIKE '%ACME%' OR C.COMPANY LIKE '%ACME%')
First, it took 20 minutes to run, which was odd because if I only filtered on one of the columns it took about 10 seconds. Second, when it finished, values for some columns in some rows were NULL when I know for a fact there are values in the database for those columns in those records. So what is going on here? Why can't SQL do the OR on two columns from two tables?
(I worked around it with a UNION - I ran it for C.COMPANY LIKE '%ACME%' and UNIONED it with a SELECT... WHERE N.COMPANY LIKE '%ACME%')

It is bad practice to mix up the JOIN and the WHERE keyword in one single query, because WHERE is an INNER JOIN behind the scenes. From my experience this often results in uncontrolled joins which one don't want.
Try to place your LIKE's after the join of the corresponding tables directly like this:
...
JOIN N ON <Your existing relationship> AND N.COMPANY LIKE '%ACME%'
...
JOIN C ON <Your existing relationship> AND C.COMPANY LIKE '%ACME%'
...

Questionable performance using IF EXISTS with inner existence checks

This is in a stored procedure..This if statement, then I do a little work. The #AsOfDate is a passed in variable of date datatype. The question I have is Why do I get better performance by removing the inner-most exists, but ONLY when the entire statement is in an IF EXISTS?
The two tables:
dbo.TXXX_InventoryDetail -- 1.3 billion records..stats up to date
dbo.TXXX_InventoryFull -- 9.8 million records..stats up to date
Statement:
if exists (select 1
from dbo.TXXX_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryFull'. Scan count 9240262, logical reads 29548864
Table 'T001_InventoryDetail'. Scan count 1, logical reads 17259
If I remove the second where exists and do a join:
if exists (select 1
from dbo.TXXX_InventoryDetail o,
dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut)
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryDetail'. Scan count 0, logical reads 333952
Table 'TXXX_InventoryFull'. Scan count 1, logical reads 630
Now..the reason I think it is the if exists is that if I remove it and do a select count(*) like this:
select COUNT(*)
from dbo.T001_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate
TXXX_InventoryFull'. Scan count 41, logical reads 692
T001_InventoryDetail'. Scan count 65, logical reads 17477
Worktable'. Scan count 0, logical reads 0

It is generally said that one should avoid doing coordinated subqueries in the predicate, as these tend to force nested loop joins. When querying large datasets, especially where one's trying to discover a difference between the sets, it's important to allow the query optimizer to choose dynamically between hash, merge and nested loop algorhithms, which may not be possible if the query is structured using a coordinated subquery. Better to create these as derived tables in the FROM clause.

I have found similar issues using the EXISTS statement on a SQL 08 R2 server, where the exact same statement runs fine on SQL 08 and SQL 05.
I found that changing something like
WHILE EXISTS(SELECT * FROM X)
Would be super slow, but:
WHILE ISNULL((SELECT TOP 1 ID FROM X), 0) <> 0
Runs perfectly fast again.
To me, it seems like an R2 issue...

I would guess that the plan you get is quite different when you use the join. Perhaps the imbalance in the number of rows (very large outer table, smaller inner table) is giving the optimizer fits, but it can probably eliminate rows much easier with the join (you'll probably see additional loop operators with the worse query). Tough to really guess without seeing the plans or being able to reproduce, but you should always aim at eliminating the most rows as early in the plan as possible. Pulling back millions of rows through several operators / subqueries only to eliminate most of them later in the plan is almost certainly going to yield worse performance.