Postgresql finding max transaction_id for each type giving duplicates (when it's not supposed to for PK) - postgresql

Question as title; So I have a code as shown below to find the ID with highest amount transacted by type of card
SELECT tr.identifier, cc.type, tr.amount as max_amount
FROM credit_cards cc, transactions tr
WHERE (tr.amount, cc.type) IN (SELECT MAX(tr.amount), cc.type
FROM credit_cards cc, transactions tr
WHERE cc.number = tr.number
GROUP BY cc.type)
GROUP BY tr.identifier, cc.type;
When I run the code, I get duplicate transaction_identifier which shouldn't happen since it's the PK of the transactions table; output when I run above code is shown below
ID --------Card type--------------- Max amount
2196 "diners-club-carte-blanche" 1000.62
2196 "visa" 1000.62
11141 "mastercard" 1000.54
2378 "mastercard" 1000.54
e.g. 2196 in above exists for diners carte-blanche not visa;
'mastercard' is correct since 2 different IDs can have same max transaction.
However, this code should run because it is possible for 2 different id to have the same max amount for each type.
Does anyone know how to prevent the duplicates from occurring?
is this due to the WHERE ... IN clause which matches either the max amount or the card type? (the ones with duplicate is Visa and Diners-Carte-Blanche which both have same max value of 1000.62 so I think that's where they're matching wrong)

TL/DR: add WHERE cc.number = tr.number to the outer query.
Long version
When you query FROM table_1, table_2 in the outer query and don't connect the tables (via a join or where clause) the result is a cartesian product, meaning EVERY row from table_1 is joined to EVERY row from table_2. This is the same as a CROSS JOIN.
So while your inner query has a where clause and (correctly) returns the max for each credit card type... your outer query does not, and so all possible combinations of credit card and transaction are being compared to the maximums, not just the valid ones.
For example, if cc has rows three rows (mastercard, visa, amex) and tr has three rows (1,2,3) selecting "from cc, tr" is resulting in nine rows:
mastercard,1
mastercard,2
mastercard,3
visa,1
visa,2
visa,3
amex,1
amex,2
amex,3
where what you want is:
mastercard,1
visa,3
amex,2
Each row in the first table will be repeated for each row in the second. Then the WHERE (...) IN (...) restrict this set of rows to only those that match a row in the inner query. As you can imagine, this can easily lead to duplicate results. Some of those duplicates are being removed by the outer GROUP BY, which should not be necessary once this issue is fixed.
As a general rule, I never use join [table_1], [table_2] and prefer to ALWAYS be explicit about doing an inner or outer join (or, in some situations, a cross join) to help avoid this kind of issue and make it clearer to the reader.
SELECT tr.identifier, cc.type, tr.amount as max_amount
FROM credit_cards cc INNER JOIN transactions tr ON (cc.number = tr.number)
WHERE (tr.amount, cc.type) IN (
SELECT MAX(tr.amount), cc.type
FROM credit_cards cc
INNER JOIN transactions tr ON (cc.number = tr.number)
GROUP BY cc.type
)
NOTE: In the case of a tie, this will give you every transaction for each credit card type that is tied for the maximum amount.

Related

INNER JOIN, LEFT/RIGHT OUTER JOIN

Apology in advance for a long question, but doing this just for the sake of learning:
i'm new to SQL and researching on JOIN for now. I'm getting two different behaviors when using INNER and OUTER JOIN. What I know is, INNER JOIN gives an intersection kind of result while returning only common rows among tables, and (LEFT/RIGHT) OUTER JOIN is outputting what is common and remaining rows in LEFT or RIGHT tables, depending upon LEFT/RIGHT clause respectively.
While working with MS Training Kit and trying to solve this practice: "Practice 2: In this practice, you identify rows that appear in one table but have no matches in another. You are given a task to return the IDs of employees from the HR.Employees table who did not handle orders (in the Sales.Orders table) on February 12, 2008. Write three different solutions using the following: joins, subqueries, and set
operators. To verify the validity of your solution, you are supposed to return employeeIDs: 1, 2, 3, 5, 7, and 9."
I'm successful doing this with subqueries and set operators but with JOIN is returning something not expected. I've written the following query:
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
I'm expecting results as: 1, 2, 3, 5, 7, and 9 (6 rows)
But what i'm getting is: 1,1,1,2,2,2,3,3,3,4,4,5,5,5,6,6,7,7,7,8,8,9,9,9 (24 rows)
I tried some videos but could not understand this side of INNER/OUTER JOIN. I'll be grateful if someone could help this side of JOIN, why is it so and what should I try to understand while working with JOIN.
you can also use left outer join to get not matching
*** The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
SELECT
H.empid
FROM
HR.Employees AS H
LEFT OUTER JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
WHERE O.empid IS NULL
Above script will return emp id who did not handle orders on specify date
here you can see all kind of join
Diagram taken from: http://dsin.wordpress.com/2013/03/16/sql-join-cheat-sheet/
adjust your query to be like this
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
where O.orderdate = '2008-02-12' AND O.empid IN null
ORDER BY
E.empid
;
USE TSQL2012;
SELECT
distinct E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
Primary things to always remind yourself when working with SQL JOINs:
INNER JOINs require a match in the join in order for result set rows produced prior to the INNER JOIN to remain in the result set. When no match is found for a row, the row is discarded from the result set.
For a row fed to an INNER JOIN that matches to ONLY one row, only one copy of that row fed to the result set is delivered.
For a row fed to an INNER JOIN that matches to multiple rows, the row will be delivered multiple times, once for each row match from the INNER JOIN table.
OUTER JOINs will not discard rows fed to them in the result set, whether or not the OUTER JOIN results in a match or not.
Just like INNER JOINs, if an OUTER JOIN matches to more than one row, it will increase the number of rows in the result set by duplicating rows equal to the number of rows matched from the OUTER JOIN table.
Ask yourself "if I get NO match on the JOIN, do I want the row discarded or not?" If the answer is NO, use an OUTER JOIN. If the answer is YES, use an INNER JOIN.
If you don't need to reference any of the columns from a JOIN table, don't perform a JOIN at all. Instead, use a WHERE EXISTS, WHERE NOT EXISTS, WHERE IN, WHERE NOT IN, etc. or similar, depending on your database engine in use. Don't rely on the database engine to be smart enough to discard unreferenced columns resulting from JOINs from the result set. Some databases may be smart enough to do that, some not. There's no reason to pull columns into a result set only to not reference them. Doing so increases chance of reduced performance.
Your JOIN of:
JOIN HR.Employees AS E
ON E.empid <> H.empid
...is matching to all Employees rows with a DIFFERENT EMPID to all rows fed to that join. Use of NOT EQUAL on an INNER JOIN is a very rare thing to do or need, especially if the JOIN predicate is testing only ONE condition. That is why your getting duplicate rows in the result set.
On DB2, we could perform an EXCEPTION JOIN to accomplish that using a JOIN alone. Normally, on DB2, I would use a WHERE NOT EXISTS for that. On SQL Server you could do a JOIN to a query where the query set is all employees without orders in SALES.ORDERS on the specified date, but I don't know if that violates the rules of your tutorial.
Naveen posted the solution it appears your tutorial is looking for!

SQLPLUS Table Trouble

I've been using SQLPLUS lately and one of my tasks was to display a set of values from two tables (stocks, orderitems). I have done this part, but I am stuck on the last part of the question which states: "including the stocks that no order has been placed on them so far".
Here is the statement:
`select Stocks.StockNo, Stocks.Description, OrderItems.QtyOrd
from Stocks INNER JOIN OrderItems
ON Stocks.StockNo = OrderItems.StockNo;`
and I have gotten the correct results for this part, but the second part is eluding me, as the curernt statement doesn't display the 0 values for QtyOrd.
Any help would be appreciated.
You likely want to use a LEFT OUTER JOIN otherwise the INNER JOIN will exclude Stocks which don't have any Orders. You might also consider grouping by Stock, in order to SUM the overall quantities for each stock?
SELECT Stocks.StockNo, Stocks.Description, SUM(OrderItems.QtyOrd) AS QtyOrd
FROM Stocks
LEFT OUTER JOIN OrderItems
ON Stocks.StockNo = OrderItems.StockNo
GROUP BY Stocks.StockNo, Stocks.Description;

Fully matching sets of records of two many-to-many tables

I have Users, Positions and Licenses.
Relations are:
users may have many licenses
positions may require many licenses
So I can easily get license requirements per position(s) as well as effective licenses per user(s).
But I wonder what would be the best way to match the two sets? As logic goes user needs at least those licenses that are required by a certain position. May have more, but remaining are not relevant.
I would like to get results with users and eligible positions.
PersonID PositionID
1 1 -> user 1 is eligible to work on position 1
1 2 -> user 1 is eligible to work on position 2
2 1 -> user 2 is eligible to work on position 1
3 2 -> user 3 is eligible to work on position 2
4 ...
As you can see I need a result for all users, not a single one per call, which would make things much much easier.
There are actually 5 tables here:
create table Person ( PersonID, ...)
create table Position (PositionID, ...)
create table License (LicenseID, ...)
and relations
create table PersonLicense (PersonID, LicenseID, ...)
create table PositionLicense (PositionID, LicenseID, ...)
So basically I need to find positions that a particular person is licensed to work on. There's of course a much more complex problem here, because there are other factors, but the main objective is the same:
How do I match multiple records of one relational table to multiple records of the other. This could as well be described as an inner join per set of records and not per single record as it's usually done in TSQL.
I'm thinking of TSQL language constructs:
rowsets but I've never used them before and don't know how to use them anyway
intersect statements maybe although these probably only work over whole sets and not groups
Final solution (for future reference)
In the meantime while you fellow developers answered my question, this is something I came up with and uses CTEs and partitioning which can of course be used on SQL Server 2008 R2. I've never used result partitioning before so I had to learn something new (which is a plus altogether). Here's the code:
with CTEPositionLicense as (
select
PositionID,
LicenseID,
checksum_agg(LicenseID) over (partition by PositionID) as RequiredHash
from PositionLicense
)
select per.PersonID, pos.PositionID
from CTEPositionLicense pos
join PersonLicense per
on (per.LicenseID = pos.LicenseID)
group by pos.PositionID, pos.RequiredHash, per.PersonID
having pos.RequiredHash = checksum_agg(per.LicenseID)
order by per.PersonID, pos.PositionID;
So I made a comparison between these three techniques that I named as:
Cross join (by Andriy M)
Table variable (by Petar Ivanov)
Checksum - this one here (by Robert Koritnik, me)
Mine already orders results per person and position, so I also added the same to the other two to make return identical results.
Resulting estimated execution plan
Checksum: 7%
Table variable: 2% (table creation) + 9% (execution) = 11%
Cross join: 82%
I also changed Table variable version into a CTE version (instead of table variable a CTE was used) and removed order by at the end and compared their estimated execution plans. Just for reference CTE version 43% while original version had 53% (10% + 43%).
One way to write this efficiently is to do a join of PositionLicences with PersonLicences on the licenceId. Then count the non nulls grouped by position and person and compare with the count of all licences for position - if equal than that person qualifies:
DECLARE #tmp TABLE(PositionId INT, LicenseCount INT)
INSERT INTO #tmp
SELECT PositionId as PositionId
COUNT(1) as LicenseCount
FROM PositionLicense
GROUP BY PositionId
SELECT per.PersonID, pos.PositionId
FROM PositionLicense as pos
INNER JOIN PersonLicense as per ON (pos.LicenseId = per.LicenseId)
GROUP BY t.PositionID, t.PersonId
HAVING COUNT(1) = (
SELECT LicenceCount FROM #tmp WHERE PositionId = t.PositionID
)
I would approach the problem like this:
Get all the (distinct) users from PersonLicense.
Cross join them with PositionLicense.
Left join the resulting set with PersonLicense using PersonID and LicenseID.
Group the results by PersonID and PositionID.
Filter out those (PersonID, PositionID) pairs where the number of licenses in PositionLicense does not match the number of those in PersonLicense.
And here's my implementation:
SELECT
u.PersonID,
pl.PositionID
FROM (SELECT DISTINCT PersonID FROM PersonLicense) u
CROSS JOIN PositionLicense pl
LEFT JOIN PersonLicense ul ON u.PersonID = ul.PersonID
AND pl.LicenseID = ul.LicenseID
GROUP BY
u.PersonID,
pl.PositionID
HAVING COUNT(pl.LicenseID) = COUNT(ul.LicenseID)

PostgreSQL UPDATE - query with left join problem

UPDATE user
SET balance = balance + p.amount
FROM payments p WHERE user.id = p.user_id AND p.id IN (36,38,40)
But it adds to the balance, only the value amount of the first payment 1936.
Please help me how to fix it, i do not want to make cycle in the code to run a lot of requests.
In a multiple-table UPDATE, each row in the target table is updated only once, even it's returned more than once by the join.
From the docs:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the fromlist, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
Use this instead:
UPDATE user u
SET balance = balance + p.amount
FROM (
SELECT user_id, SUM(amount) AS amount
FROM payment
WHERE id IN (36, 38, 40)
GROUP BY
user_id
) p
WHERE u.id = p.user_id

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000