How to query the data in a join table by two sets of joined records? - postgresql

I've got three tables: users, courses, and grades, the latter of which joins users and courses with some metadata like the user's score for the course. I've created a SQLFiddle, though the site doesn't appear to be working at the moment. The schema looks like this:
CREATE TABLE users(
id INT,
name VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO users VALUES
(1, 'Beth'),
(2, 'Alice'),
(3, 'Charles'),
(4, 'Dave');
CREATE TABLE courses(
id INT,
title VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO courses VALUES
(1, 'Biology'),
(2, 'Algebra'),
(3, 'Chemistry'),
(4, 'Data Science');
CREATE TABLE grades(
id INT,
user_id INT,
course_id INT,
score INT,
PRIMARY KEY (ID)
);
INSERT INTO grades VALUES
(1, 2, 2, 89),
(2, 2, 1, 92),
(3, 1, 1, 93),
(4, 1, 3, 88);
I'd like to know how (if possible) to construct a query which specifies some users.id values (1, 2, 3) and courses.id values (1, 2, 3) and returns those users' grades.score values for those courses
| name | Algebra | Biology | Chemistry |
|---------|---------|---------|-----------|
| Alice | 89 | 92 | |
| Beth | | 93 | 88 |
| Charles | | | |
In my application logic, I'll be receiving an array of user_ids and course_ids, so the query needs to select those users and courses dynamically by primary key. (The actual data set contains millions of users and tens of thousands of courses—the examples above are just a sample to work with.)
Ideally, the query would:
use the course titles as dynamic attributes/column headers for the users' score data
sort the row and column headers alphabetically
include empty/NULL cells if the user-course pair has no grades relationship
I suspect I may need some combination of JOINs and Postgresql's crosstab, but I can't quite wrap my head around it.
Update: learning that the terminology for this is "dynamic pivot", I found this SO answer which appears to be trying to solve a related problem in Postgres with crosstab()

I think a simple pivot query should work here, since you only have 4 courses in your data set to pivot.
SELECT t1.name,
MAX(CASE WHEN t3.title = 'Biology' THEN t2.score ELSE NULL END) AS Biology,
MAX(CASE WHEN t3.title = 'Algebra' THEN t2.score ELSE NULL END) AS Algebra,
MAX(CASE WHEN t3.title = 'Chemistry' THEN t2.score ELSE NULL END) AS Chemistry,
MAX(CASE WHEN t3.title = 'Data Science' THEN t2.score ELSE NULL END) AS Data_Science
FROM users t1
LEFT JOIN grades t2
ON t1.id = t2.user_id
LEFT JOIN courses t3
ON t2.course_id = t3.id
GROUP BY t1.name
Follow the link below for a running demo. I used MySQL because, as you have noticed, SQLFiddle seems to be perpetually busted the other databases.
SQLFiddle

Related

Grouping user id columns together with string_agg on PostgreSQL 13

This is my emails table
create table emails (
id bigint not null primary key generated by default as identity,
name text not null
);
And contacts table:
create table contacts (
id bigint not null primary key generated by default as identity,
email_id bigint not null,
user_id bigint not null,
full_name text not null,
ordering int not null
);
As you can see I have user_id field here. There can be multiple same user ID's on my result so i want to join them using comma ,
Insert some data to the tables:
insert into emails (name)
values
('dennis1'),
('dennis2');
insert into contacts (id, email_id, user_id, full_name, ordering)
values
(5, 1, 1, 'dennis1', 9),
(6, 2, 1, 'dennis1', 5),
(7, 2, 1, 'dennis1', 1),
(8, 1, 3, 'john', 2),
(9, 2, 4, 'dennis7', 1),
(10, 2, 4, 'dennis7', 1);
My query is:
select em.name,
c.user_ids
from emails em
join (
select email_id, string_agg(user_id::text, ',' order by ordering desc) as user_ids
from contacts
group by email_id
) c on c.email_id = em.id
order by em.name;
Actual Result
name user_ids
dennis1 1,3
dennis2 1,1,4,4
Expected Result
name user_ids
dennis1 1,3
dennis2 1,4
On my real-world data, I get same user id like 50 times. Instead it should appear 1 time only. In example above, you see user 1 and 4 appears 2 times for dennis2 user.
How can I unique them?
Demo: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=2e957b52eb46742f3ddea27ec36effb1
P.S: I tried to add user_id it to group by but this time I get duplicate rows...
demo:db<>fiddle
SELECT
name,
string_agg(user_id::text, ',' order by ordering desc)
FROM (
SELECT DISTINCT ON (em.id, c.user_id)
*
FROM emails em
JOIN contacts c ON c.email_id = em.id
) s
GROUP BY name
Join the tables
DISTINCT ON email and the user_id, so for every email record, there is no equal users
Aggregate

Unpivot Columns with Most Recent Record

Student Records are updated for subject and update date. Student can be enrolled in one or multiple subjects. I would like to get each student record with most subject update date and status.
CREATE TABLE Student
(
StudentID int,
FirstName varchar(100),
LastName varchar(100),
FullAddress varchar(100),
CityState varchar(100),
MathStatus varchar(100),
MUpdateDate datetime2,
ScienceStatus varchar(100),
SUpdateDate datetime2,
EnglishStatus varchar(100),
EUpdateDate datetime2
);
Desired query output, I am using CTE method but trying to find alternative and better way.
SELECT StudentID, FirstName, LastName, FullAddress, CityState, [SubjectStatus], UpdateDate
FROM Student
;WITH orginal AS
(SELECT * FROM Student)
,Math as
(
SELECT DISTINCT StudentID, FirstName, LastName, FullAddress, CityState,
ROW_NUMBER OVER (PARTITION BY StudentID, MathStatus ORDER BY MUpdateDate DESC) as rn
, _o.MathStatus as SubjectStatus, _o.MupdateDate as UpdateDate
FROM original as o
left join orignal as _o on o.StudentID = _o.StudentID
where _o.MathStatus is not null and _o.MUpdateDate is not null
)
,Science AS
(
...--Same as Math
)
,English AS
(
...--Same As Math
)
SELECT * FROM Math WHERE rn = 1
UNION
SELECT * FROM Science WHERE rn = 1
UNION
SELECT * FROM English WHERE rn = 1
First: storing data in a denormalized form is not recommended. Some data model redesign might be in order. There are multiple resources about data normalization available on the web, like this one.
Now then, I made some guesses about how your source table is populated based on the query you wrote. I generated some sample data that could show how the source data is created. Besides that I also reduced the number of columns to reduce my typing efforts. The general approach should still be valid.
Sample data
create table Student
(
StudentId int,
StudentName varchar(15),
MathStat varchar(5),
MathDate date,
ScienceStat varchar(5),
ScienceDate date
);
insert into Student (StudentID, StudentName, MathStat, MathDate, ScienceStat, ScienceDate) values
(1, 'John Smith', 'A', '2020-01-01', 'B', '2020-05-01'),
(1, 'John Smith', 'A', '2020-01-01', 'B+', '2020-06-01'), -- B for Science was updated to B+ month later
(2, 'Peter Parker', 'F', '2020-01-01', 'A', '2020-05-01'),
(2, 'Peter Parker', 'A+', '2020-03-01', 'A', '2020-05-01'), -- Spider-Man would never fail Math, fixed...
(3, 'Tom Holland', null, null, 'A', '2020-05-01'),
(3, 'Tom Holland', 'A-', '2020-07-01', 'A', '2020-05-01'); -- Tom was sick for Math, but got a second chance
Solution
Your question title already contains the word unpivot. That word actually exists in T-SQL as a keyword. You can learn about the unpivot keyword in the documentation. Your own solution already contains common table expression, these constructions should look familiar.
Steps:
cte_unpivot = unpivot all rows, create a Subject column and place the corresponding values (SubjectStat, Date) next to it with a case expression.
cte_recent = number the rows to find the most recent row per student and subject.
Select only those most recent rows.
This gives:
with cte_unpivot as
(
select up.StudentId,
up.StudentName,
case up.[Subject]
when 'MathStat' then 'Math'
when 'ScienceStat' then 'Science'
end as [Subject],
up.SubjectStat,
case up.[Subject]
when 'MathStat' then up.MathDate
when 'ScienceStat' then up.ScienceDate
end as [Date]
from Student s
unpivot ([SubjectStat] for [Subject] in ([MathStat], [ScienceStat])) up
),
cte_recent as
(
select cu.StudentId, cu.StudentName, cu.[Subject], cu.SubjectStat, cu.[Date],
row_number() over (partition by cu.StudentId, cu.[Subject] order by cu.[Date] desc) as [RowNum]
from cte_unpivot cu
)
select cr.StudentId, cr.StudentName, cr.[Subject], cr.SubjectStat, cr.[Date]
from cte_recent cr
where cr.RowNum = 1;
Result
StudentId StudentName Subject SubjectStat Date
----------- --------------- ------- ----------- ----------
1 John Smith Math A 2020-01-01
1 John Smith Science B+ 2020-06-01
2 Peter Parker Math A+ 2020-03-01
2 Peter Parker Science A 2020-05-01
3 Tom Holland Math A- 2020-07-01
3 Tom Holland Science A 2020-05-01

How to filter a query based on a jsonb data?

Not even sure if it's possible to do this kind of query in postgres. At least i'm stuck.
I have two tables: a product recommendation list, containing multiple products to be recommended to a particular customer; and a transaction table indicating the product bought by customer and transaction details.
I'm trying to track the performance of my recommendation by plotting all the transaction that match the recommendations (both customer and product).
Below is my test case.
Kindly help
create table if not exists productRec( --Product Recommendation list
task_id int,
customer_id int,
detail jsonb);
truncate productRec;
insert into productRec values (1, 2, '{"1":{"score":5, "name":"KitKat"},
"4":{"score":2, "name":"Yuppi"}
}'),
(1, 3, '{"1":{"score":3, "name":"Yuppi"},
"4":{"score":2, "name":"GoldenSnack"}
}'),
(1, 4, '{"1":{"score":3, "name":"Chickies"},
"4":{"score":2, "name":"Kitkat"}
}');
drop table txn;
create table if not exists txn( --Transaction table
customer_id int,
item_id text,
txn_value numeric,
txn_date date);
truncate txn;
insert into txn values (1, 'Yuppi', 500, DATE '2001-01-01'), (2, 'Kitkat', 2000, DATE '2001-01-01'),
(3, 'Kitkat', 2000, DATE '2001-02-01'), (4, 'Chickies', 200, DATE '2001-09-01');
--> Query must plot:
--Transaction value vs date where the item_id is inside the recommendation for that customer
--ex: (2000, 2001-01-01), (200, 2001-09-01)
We can get each recommendation as its own row with jsonb_each. I don't know what to do with the keys so I just take the value (still jsonb) and then the name inside it (the ->> outputs text).
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
So now we have a list of customer_ids and item_ids they were recommended. Now we can just join this with the transactions.
select
txn.txn_value,
txn.txn_date
from txn
join (
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
) p ON (
txn.customer_id = p.customer_id AND
lower(txn.item_id) = lower(p.name)
);
In your example data you spelled Kitkat differently in the recommendation table for customer 2. I added lowercasing in the join condition to counter that but it might not be the right solution.
txn_value | txn_date
-----------+------------
2000 | 2001-01-01
200 | 2001-09-01
(2 rows)

Copy value from one row to another row in PostgreSQL

I have a table like this:
id product amount
1 A 6
1 A 8
1 A
1 B 1
1 B
2 C 2
2 C
2 C 4
2 C
2 C
and I need to make it like this:
id product amount
1 A 6
1 A 8
1 A 8
1 B 1
1 B 1
2 C 2
2 C 2
2 C 4
2 C 4
2 C 4
Copy amount by previous non-missing value.
I tried to use lag() function. however, aggregation function lag() is not allowed in UPDATE.
update tableA set amount = lag(amount);
What can I do using PostgreSQL?
You can SELECT what you want to UPDATE, but there is no (easy) way to actually do the UPDATE, because the table fox does not have a primary key (yet).
CREATE TABLE fox (
id integer NOT NULL,
product text NOT NULL,
amount integer
);
To populate the fox with some data.
INSERT INTO fox VALUES
(1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL),
(3, 'What does the fox say?', 5);
The query.
WITH ranks (rank, id, product, amount) AS (
SELECT ROW_NUMBER() OVER (), id, product, amount FROM foo
)
SELECT r.id, r.product,
(SELECT amount FROM ranks
WHERE id = r.id AND product = r.product
AND rank < r.rank AND amount IS NOT NULL
ORDER BY amount DESC LIMIT 1
)
FROM ranks r WHERE r.amount IS NULL ORDER BY 1, 2, 3;
Yields the rows which previously had a NULL and now have the appropriate amount.
id | product | amount
----+---------+--------
1 | A | 8
1 | B | 1
2 | C | 2
2 | C | 4
2 | C | 4
But you cannot use this data to update, because rows are still not uniquely identified by (id, product) - which means you cannot write a WHERE condition identifying your rows uniquely. How would the WHERE clause know whether to change the amount to 2 or 4 in the UPDATE? The multiple rows with (id, product) = (2, 'C') are indistinguishable in the WHERE of the UPDATE.
Let's give the fox a primary key.
ALTER TABLE fox ADD COLUMN IF NOT EXISTS pkey serial ;
ALTER TABLE fox ADD PRIMARY KEY (pkey) ;
Now we can identify the rows by the PRIMARY KEY pkey.
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
)
SELECT pkey,
id, product, -- you can leave these out in your UPDATE: pkey is UNIQUE
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n ORDER BY 1, 2, 3, 4;
to display the changes to be made
pkey | id | product | amount
------+----+---------+--------
3 | 1 | A | 8
5 | 1 | B | 1
7 | 2 | C | 2
9 | 2 | C | 4
10 | 2 | C | 4
And we can use pkey in the UPDATE.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
), changes AS (
SELECT pkey,
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n
) UPDATE fox f SET amount = c.amount FROM changes c WHERE f.pkey = c.pkey ;
Check the result is okay:
SELECT * FROM fox ORDER BY 1, 2, 3, 4;
And accept using COMMIT or ROLLBACK accordingly.
Alternative to adding a PRIMARY KEY
Every table should always have a primary key.
If you insist not to have one, then you could also compute the rows with their then-not-NULL amount and instead of UPDATEing them, you could INSERT them into your table and then DELETE FROM fox WHERE amount IS NULL remove the rows which had no amount. This way you get around adding a primary key, which is unique. Of course the UPDATE and DELETE are packaged into a TRANSACTION such as not to interfere with other Transactions running concurrently. For example another Transaction adding rows with NULL amount AFTER you have calculated the data to be INSERTed using SELECT and before you DELETE all NULL amounts. You'd miss the concurrently added row with NULL amount in this case (data loss due to concurrency; think ACID).
But a missing primary key will probably bite you later on, anyway.
Without knowing what defines "previous rows" all is a guess. But you can use a anonymous block to do what your want, just make your changes:
CREATE TEMPORARY TABLE test_lag AS
SELECT column1 AS id, column2 AS product, column3 AS amount FROM (
VALUES (1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL)) AS tmp;
DO $$
BEGIN
--Loop until update all null amounts
--Why we need this? It's because PostgreSQL don't supports IGNORE NULLS clause on lag()
LOOP
WITH tmp AS (
SELECT ctid, lag(amount) OVER() AS last_amount FROM test_lag ORDER BY id, product -- You MUST change this ORDER to right columns (What's previous row?)
)
UPDATE test_lag SET amount = tmp.last_amount FROM tmp WHERE test_lag.ctid = tmp.ctid AND amount IS NULL;
IF NOT FOUND THEN
EXIT;
END IF;
END LOOP;
END $$;
SELECT * FROM test_lag ORDER BY id, product, amount;

Modifying Duplicates

I'm trying to figure out the means to do two things:
Locate duplicate records in a table.
These are typically duplicate names in the 'Name' column but
specifically those where the ParentID is the same. It's fine if I
have identical names where the ParentID is different because these
names (or Children) belong to different parents.
Modify these duplicates.
Preferably, I would modify these duplicates by appending the 'ID' to the name.
I came up with a query to locate duplicates and them dump them into a temp table:
CREATE TABLE #Dup(
Name varchar(50),
CustNo varchar(7))
insert into #Dup (Name, CustNo)
SELECT [Name],[CustNo]
FROM [02Kids]
GROUP BY [Name], [CustNo]
HAVING Count(*)>1
This seems to work. When I view the data in the table I see the name and I see the ParentID identifying that indeed, this is a name that appears twice for that parent ID. Its worth noting that the name only appears once in the table. It doesn't show two rows with the same name and ID (perhaps this is part of my problem).
Here's the query I came up with attempting to perform the modification:
select[#Dup].[Name] + ' ' + [02Kids].[ID] as iName, [02Kids].ParentID
from #Dup
inner join [02Kids]
on #Dup.CustNo = [02Kids].ParentID
order by iName asc
Well, this sort of works, except I end up with massive amounts of duplicates. For example, one "Name" that I can confirm only has two duplicates ends up with close to 13 in total from that select query.
I may be way off here with that query (this is practice stuff I'm using to teach myself) but I'm having trouble conceiving a correct means to do this. I am still learning syntax, keywords, functions, etc so maybe there's something I should use I just don't know of yet.
Well to only get the matches you want in your "modification" query you'll need to add a match on name to your join clause. Right now you are matching your duplicate record to every kid for that parent, not just the duplicates. So if one parent has 13 kids, only one of which is a duplicate, you'll get 13 extra records.
inner join [02Kids]
on #Dup.CustNo = [02Kids].ParentID AND
#Dup.Name = [02Kids].Name
Does this answer your question?
USE tempdb
GO
CREATE TABLE Person (PersonID INT, FName VARCHAR(25), LName VARCHAR(25))
INSERT INTO Person VALUES
(1, 'Jim', 'Jones'),
(2, 'Rob', 'Smith'),
(3, 'Matt', 'Bridges'),
(4, 'Jim', 'Jones'),
(5, 'Jim', 'Jones'),
(6, 'Alex', 'Door'),
(7, 'Wilhelm', 'Kay')
GO
;WITH DupDetect AS
(
SELECT *
,Occ = ROW_NUMBER() OVER (PARTITION BY FName, LName ORDER BY PersonID)
FROM Person
)
UPDATE DupDetect
SET FName = LTRIM(STR(PersonID)) + FName
WHERE Occ > 1
SELECT *
FROM Person
Resulting in;
PersonID | FName | LName
---------------------------------
1 | Jim | Jones
2 | Rob | Smith
3 | Matt | Bridges
4 | 4Jim | Jones
5 | 5Jim | Jones
6 | Alex | Door
7 | Wilhelm | Kay
I'm unaware of any cleaner or more efficient pattern for the modification or removal of duplicates.