Fully matching sets of records of two many-to-many tables - tsql

I have Users, Positions and Licenses.
Relations are:
users may have many licenses
positions may require many licenses
So I can easily get license requirements per position(s) as well as effective licenses per user(s).
But I wonder what would be the best way to match the two sets? As logic goes user needs at least those licenses that are required by a certain position. May have more, but remaining are not relevant.
I would like to get results with users and eligible positions.
PersonID PositionID
1 1 -> user 1 is eligible to work on position 1
1 2 -> user 1 is eligible to work on position 2
2 1 -> user 2 is eligible to work on position 1
3 2 -> user 3 is eligible to work on position 2
4 ...
As you can see I need a result for all users, not a single one per call, which would make things much much easier.
There are actually 5 tables here:
create table Person ( PersonID, ...)
create table Position (PositionID, ...)
create table License (LicenseID, ...)
and relations
create table PersonLicense (PersonID, LicenseID, ...)
create table PositionLicense (PositionID, LicenseID, ...)
So basically I need to find positions that a particular person is licensed to work on. There's of course a much more complex problem here, because there are other factors, but the main objective is the same:
How do I match multiple records of one relational table to multiple records of the other. This could as well be described as an inner join per set of records and not per single record as it's usually done in TSQL.
I'm thinking of TSQL language constructs:
rowsets but I've never used them before and don't know how to use them anyway
intersect statements maybe although these probably only work over whole sets and not groups

Final solution (for future reference)
In the meantime while you fellow developers answered my question, this is something I came up with and uses CTEs and partitioning which can of course be used on SQL Server 2008 R2. I've never used result partitioning before so I had to learn something new (which is a plus altogether). Here's the code:
with CTEPositionLicense as (
select
PositionID,
LicenseID,
checksum_agg(LicenseID) over (partition by PositionID) as RequiredHash
from PositionLicense
)
select per.PersonID, pos.PositionID
from CTEPositionLicense pos
join PersonLicense per
on (per.LicenseID = pos.LicenseID)
group by pos.PositionID, pos.RequiredHash, per.PersonID
having pos.RequiredHash = checksum_agg(per.LicenseID)
order by per.PersonID, pos.PositionID;
So I made a comparison between these three techniques that I named as:
Cross join (by Andriy M)
Table variable (by Petar Ivanov)
Checksum - this one here (by Robert Koritnik, me)
Mine already orders results per person and position, so I also added the same to the other two to make return identical results.
Resulting estimated execution plan
Checksum: 7%
Table variable: 2% (table creation) + 9% (execution) = 11%
Cross join: 82%
I also changed Table variable version into a CTE version (instead of table variable a CTE was used) and removed order by at the end and compared their estimated execution plans. Just for reference CTE version 43% while original version had 53% (10% + 43%).

One way to write this efficiently is to do a join of PositionLicences with PersonLicences on the licenceId. Then count the non nulls grouped by position and person and compare with the count of all licences for position - if equal than that person qualifies:
DECLARE #tmp TABLE(PositionId INT, LicenseCount INT)
INSERT INTO #tmp
SELECT PositionId as PositionId
COUNT(1) as LicenseCount
FROM PositionLicense
GROUP BY PositionId
SELECT per.PersonID, pos.PositionId
FROM PositionLicense as pos
INNER JOIN PersonLicense as per ON (pos.LicenseId = per.LicenseId)
GROUP BY t.PositionID, t.PersonId
HAVING COUNT(1) = (
SELECT LicenceCount FROM #tmp WHERE PositionId = t.PositionID
)

I would approach the problem like this:
Get all the (distinct) users from PersonLicense.
Cross join them with PositionLicense.
Left join the resulting set with PersonLicense using PersonID and LicenseID.
Group the results by PersonID and PositionID.
Filter out those (PersonID, PositionID) pairs where the number of licenses in PositionLicense does not match the number of those in PersonLicense.
And here's my implementation:
SELECT
u.PersonID,
pl.PositionID
FROM (SELECT DISTINCT PersonID FROM PersonLicense) u
CROSS JOIN PositionLicense pl
LEFT JOIN PersonLicense ul ON u.PersonID = ul.PersonID
AND pl.LicenseID = ul.LicenseID
GROUP BY
u.PersonID,
pl.PositionID
HAVING COUNT(pl.LicenseID) = COUNT(ul.LicenseID)

Related

Postgresql finding max transaction_id for each type giving duplicates (when it's not supposed to for PK)

Question as title; So I have a code as shown below to find the ID with highest amount transacted by type of card
SELECT tr.identifier, cc.type, tr.amount as max_amount
FROM credit_cards cc, transactions tr
WHERE (tr.amount, cc.type) IN (SELECT MAX(tr.amount), cc.type
FROM credit_cards cc, transactions tr
WHERE cc.number = tr.number
GROUP BY cc.type)
GROUP BY tr.identifier, cc.type;
When I run the code, I get duplicate transaction_identifier which shouldn't happen since it's the PK of the transactions table; output when I run above code is shown below
ID --------Card type--------------- Max amount
2196 "diners-club-carte-blanche" 1000.62
2196 "visa" 1000.62
11141 "mastercard" 1000.54
2378 "mastercard" 1000.54
e.g. 2196 in above exists for diners carte-blanche not visa;
'mastercard' is correct since 2 different IDs can have same max transaction.
However, this code should run because it is possible for 2 different id to have the same max amount for each type.
Does anyone know how to prevent the duplicates from occurring?
is this due to the WHERE ... IN clause which matches either the max amount or the card type? (the ones with duplicate is Visa and Diners-Carte-Blanche which both have same max value of 1000.62 so I think that's where they're matching wrong)
TL/DR: add WHERE cc.number = tr.number to the outer query.
Long version
When you query FROM table_1, table_2 in the outer query and don't connect the tables (via a join or where clause) the result is a cartesian product, meaning EVERY row from table_1 is joined to EVERY row from table_2. This is the same as a CROSS JOIN.
So while your inner query has a where clause and (correctly) returns the max for each credit card type... your outer query does not, and so all possible combinations of credit card and transaction are being compared to the maximums, not just the valid ones.
For example, if cc has rows three rows (mastercard, visa, amex) and tr has three rows (1,2,3) selecting "from cc, tr" is resulting in nine rows:
mastercard,1
mastercard,2
mastercard,3
visa,1
visa,2
visa,3
amex,1
amex,2
amex,3
where what you want is:
mastercard,1
visa,3
amex,2
Each row in the first table will be repeated for each row in the second. Then the WHERE (...) IN (...) restrict this set of rows to only those that match a row in the inner query. As you can imagine, this can easily lead to duplicate results. Some of those duplicates are being removed by the outer GROUP BY, which should not be necessary once this issue is fixed.
As a general rule, I never use join [table_1], [table_2] and prefer to ALWAYS be explicit about doing an inner or outer join (or, in some situations, a cross join) to help avoid this kind of issue and make it clearer to the reader.
SELECT tr.identifier, cc.type, tr.amount as max_amount
FROM credit_cards cc INNER JOIN transactions tr ON (cc.number = tr.number)
WHERE (tr.amount, cc.type) IN (
SELECT MAX(tr.amount), cc.type
FROM credit_cards cc
INNER JOIN transactions tr ON (cc.number = tr.number)
GROUP BY cc.type
)
NOTE: In the case of a tie, this will give you every transaction for each credit card type that is tied for the maximum amount.

Loop a result set and feed two tables

I have a select query that returns a huge result set (500k records). But for this example let's say it has only two records:
SELECT * FROM INVENTORY I
INNER JOIN PARTS P
ON I.partcode = P.partcode
ORDER BY I.partcode
The result will look more or less like this:
pk partcode genericname partname stock
1 001 mouse logitech 10
2 002 keyboard genius 8
I have to loop the result above and feed two tables (product and variant).
I first have to insert two of the columns into 'product' table, like this:
INSERT INTO PRODUCT
(p_code,product_name) values (partcode,genericname)
pk p_code product_name
5 001 mouse
6 001 keyboard
Then I have to grab the pk that was automatically generated into the table above (say ppk) and then insert it together with the other two columns into the 'variant' table, like this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock) values (ppk,partname,stock)
pk product_pk variant_name in_stock
10 5 logitech 10
11 6 genius 8
At the end I should have the product and the variant tables with 2 records each.
I could write a VB code to do that but I think that it can de done in pure SQL, and I just am not sure the best approach.
Someone could give me some help with this?
Thank you!
You could use a SQL cursor to loop through and insert a row at a time into PRODUCT and then use SCOPE_IDENTITY() to get the newly assigned identity value to insert a corresponding row into VARIANT, but best practice is to avoid cursors if there's another way. (There usually is, but not always.)
If the partcode/genericname combination will uniquely identify 1 record in PRODUCT, you could do this:
INSERT INTO PRODUCT (p_code,product_name)
SELECT partcode, genenricname
FROM INVENTORY I INNER JOIN PARTS P ON I.partcode = P.partcode
(I would eliminate the ORDER BY from your query unless you care about the order the identity values are assigned.)
Then, run this:
INSERT INTO VARIANT
(product_pk,variant_name,in_stock)
SELECT pr.ppk, i.partname, i.stock
FROM inventory i INNER JOIN parts p ON i.partcode = p.partcode
INNER JOIN product pr on i.partcode = pr.p_code and i.genericname = pr.product_name
You may have to clean up the aliases between i and p in the 2nd query. I can't tell which table (inventory or parts) the variant_name and in_stock fields are coming from so I just used i.
Again - this assumes that partcode/genericname combination is unique in the PRODUCT table.

Order picking in warehouse

In implementing the warehouse management system for an ecommerce store, I'm trying to create a picking list for warehouse workers, who will walk around a warehouse picking products in orders from different shelves.
One type of product can be on different shelves, and on each shelf there can be many of the same type of product.
If there are many of the same product in one order, sometimes the picker has to pick from multiple shelves to get all the items in an order.
To further make things trickier, sometimes the product will run out of stock as well.
My data model looks like this (simplified):
CREATE TABLE order_product (
id SERIAL PRIMARY KEY,
product_id integer,
order_id text
);
INSERT INTO "public"."order_product"("id","product_id","order_id")
VALUES
(1,1,'order1'),
(2,1,'order1'),
(3,1,'order1'),
(4,2,'order1'),
(5,2,'order2'),
(6,2,'order2');
CREATE TABLE warehouse_placement (
id SERIAL PRIMARY KEY,
product_id integer,
shelf text,
quantity integer
);
INSERT INTO "public"."warehouse_placement"("id","product_id","shelf","quantity")
VALUES
(1,1,E'A',2),
(2,2,E'B',2),
(3,1,E'C',2);
Is it possible, in postgres, to generate a picking list of instructions like the following:
order_id product_id shelf quantity_left_on_shelf
order1 1 A 1
order1 1 A 0
order1 2 B 1
order1 1 C 1
order2 2 B 0
order2 2 NONE null
I currently do this in the application code, but that feel quite clunky and somehow I feel like there should be a way to do this directly in SQL.
Thanks for any help!
Here we go:
WITH product_on_shelf AS (
SELECT warehouse_placement.*,
generate_series(1, quantity) AS order_on_shelf,
quantity - generate_series(1, quantity) AS quantity_left_on_shelf
FROM warehouse_placement
)
, product_on_shelf_with_product_order AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY quantity, shelf, order_on_shelf
) AS order_among_product
FROM product_on_shelf
)
, order_product_with_order_among_product AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY id
) AS order_among_product
FROM order_product
)
SELECT order_product_with_order_among_product.id,
order_product_with_order_among_product.order_id,
order_product_with_order_among_product.product_id,
product_on_shelf_with_product_order.shelf,
product_on_shelf_with_product_order.quantity_left_on_shelf
FROM order_product_with_order_among_product
LEFT JOIN product_on_shelf_with_product_order
ON order_product_with_order_among_product.product_id = product_on_shelf_with_product_order.product_id
AND order_product_with_order_among_product.order_among_product = product_on_shelf_with_product_order.order_among_product
ORDER BY order_product_with_order_among_product.id
;
Here's the idea:
We create a temporary table product_on_shelf, which is the same as warehouse_placement, except the rows are duplicated n times, n being the quantity of the product on the shelf.
We assign a number order_among_product to each row in product_on_shelf, so that each object on shelf knows its order among the same products.
We assign a symmetric number order_among_product to each row in order_product.
For each row in order_product, we try to find the product on shelf with the same order_among_product. If we can't find any, it means we've exhausted the products on any shelf.
Side note #1: Picking products off shelves is a concurrent action. You should make sure, either on the application side or on the DB side via smart locks, that any product on shelf can be attributed to one single order. Treating each row of product_order on the application side might be the best option to deal with concurrence.
Side note #2: I've written this query using CTEs for clarity. To boost performance, consider using subqueries instead. Make sure to run EXPLAIN ANALYZE

number of points within a radius of another set of points

I have two tables. One is a list of stores (with lat/long). The other is a list of customer addresses (with lat/long). What I want is a query that will return the number of customers within a certain radius for each store in my table. This gives me the total number of customers within 10,000 meters of ANY store, but I'm not sure how to loop it to return one row for each store with a count.
Note that I'm doing this queries using cartoDB, where the_geom is basically long/lat.
SELECT COUNT(*) as customer_count FROM customer_table
WHERE EXISTS(
SELECT 1 FROM store_table
WHERE ST_Distance_Sphere(store_table.the_geom, customer_table.the_geom) < 10000
)
This results in a single row :
customer_count
4009
Suggestions on how to make this work against my problem? I'm open to doing this other ways that might be more efficient (faster).
For reference, the column with store names, which would be in one column is store_identifier.store_table
I'll assume that you use the_geom to represent the coordinate (lat/lon) of store and customer. I will also assume that the_geom is of geography type. Your query will be something like this
select s.id, count(*) as customer_count
from customers c
inner join stores s
on st_dwithin(c.the_geom, s.the_geom, 10000)
group by s.id
This should give you neat table with a store id and count of customers within 10,000 meters from the store.
If the_geom is of type geometry, you query will be very similar but you should use st_distance_sphere() instead in order to express distance in kilometers (not degrees).

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.