How to filter a query based on a jsonb data? - postgresql

Not even sure if it's possible to do this kind of query in postgres. At least i'm stuck.
I have two tables: a product recommendation list, containing multiple products to be recommended to a particular customer; and a transaction table indicating the product bought by customer and transaction details.
I'm trying to track the performance of my recommendation by plotting all the transaction that match the recommendations (both customer and product).
Below is my test case.
Kindly help
create table if not exists productRec( --Product Recommendation list
task_id int,
customer_id int,
detail jsonb);
truncate productRec;
insert into productRec values (1, 2, '{"1":{"score":5, "name":"KitKat"},
"4":{"score":2, "name":"Yuppi"}
}'),
(1, 3, '{"1":{"score":3, "name":"Yuppi"},
"4":{"score":2, "name":"GoldenSnack"}
}'),
(1, 4, '{"1":{"score":3, "name":"Chickies"},
"4":{"score":2, "name":"Kitkat"}
}');
drop table txn;
create table if not exists txn( --Transaction table
customer_id int,
item_id text,
txn_value numeric,
txn_date date);
truncate txn;
insert into txn values (1, 'Yuppi', 500, DATE '2001-01-01'), (2, 'Kitkat', 2000, DATE '2001-01-01'),
(3, 'Kitkat', 2000, DATE '2001-02-01'), (4, 'Chickies', 200, DATE '2001-09-01');
--> Query must plot:
--Transaction value vs date where the item_id is inside the recommendation for that customer
--ex: (2000, 2001-01-01), (200, 2001-09-01)

We can get each recommendation as its own row with jsonb_each. I don't know what to do with the keys so I just take the value (still jsonb) and then the name inside it (the ->> outputs text).
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
So now we have a list of customer_ids and item_ids they were recommended. Now we can just join this with the transactions.
select
txn.txn_value,
txn.txn_date
from txn
join (
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
) p ON (
txn.customer_id = p.customer_id AND
lower(txn.item_id) = lower(p.name)
);
In your example data you spelled Kitkat differently in the recommendation table for customer 2. I added lowercasing in the join condition to counter that but it might not be the right solution.
txn_value | txn_date
-----------+------------
2000 | 2001-01-01
200 | 2001-09-01
(2 rows)

Related

Concatenate mutliple contiguous rows to single row

i have a huge table with iot-datas from a lot of iot-devices. Every device is sending data one time per minute but only if counter-input got some singals. If not, no data will be sended. So in my database the datas looks like
Today I'm loading all this data in my application and aggregate them by iterating and checking row by row to 3 rows based on contiguous rows. Contiguous rows are all rows where next row is one minute later. It is working but it feels not smart and nice.
Does it make sense to generate this aggregation on sql server - espacialy increase performance?
How would you start?
This is a classic Islands and Gaps problem. I'm still mastering Islands and Gaps so I'd love any feedback on my solution from others in the know (please be gentle). There are at least a couple different ways to solve Islands and Gaps but this is the one that is easiest on my brain. Here's how I got it to work:
DDL to set up data:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp;
CREATE TABLE #tmp
(IoT_Device INT,
Count INT,
TimeStamp DATETIME);
INSERT INTO #tmp
VALUES
(1, 5, '2021-10-27 14:03'),
(1, 4, '2021-10-27 14:04'),
(1, 7, '2021-10-27 14:05'),
(1, 8, '2021-10-27 14:06'),
(1, 5, '2021-10-27 14:07'),
(1, 4, '2021-10-27 14:08'),
(1, 7, '2021-10-27 14:12'),
(1, 8, '2021-10-27 14:13'),
(1, 5, '2021-10-27 14:14'),
(1, 4, '2021-10-27 14:15'),
(1, 5, '2021-10-27 14:21'),
(1, 4, '2021-10-27 14:22'),
(1, 7, '2021-10-27 14:23');
Islands and Gaps Solution:
;WITH CTE_TIMESTAMP_DATA AS (
SELECT
IoT_Device,
Count,
TimeStamp,
LAG(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS previous_timestamp,
LEAD(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS next_timestamp,
ROW_NUMBER() OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_location
FROM #tmp
)
,CTE_ISLAND_START AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_start_timestamp,
island_location AS island_start_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, previous_timestamp, TimeStamp) > 1
OR previous_timestamp IS NULL
)
,CTE_ISLAND_END AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_end_timestamp,
island_location AS island_end_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, TimeStamp, next_timestamp) > 1
OR next_timestamp IS NULL
)
SELECT
S.IoT_Device,
(SELECT SUM(Count)
FROM CTE_TIMESTAMP_DATA
WHERE IoT_Device = S.IoT_Device
AND TimeStamp BETWEEN S.island_start_timestamp AND E.island_end_timestamp) AS Count,
S.island_start_timestamp,
E.island_end_timestamp
FROM CTE_ISLAND_START AS S
INNER JOIN CTE_ISLAND_END AS E
ON E.IoT_Device = S.IoT_Device
AND E.island_number = S.island_number;
The CTE_TIMESTAMP_DATA query pulls the IoT_Device, Count, and TimeStamp along with the TimeStamp before and after each record using LAG and LEAD, and assigns a row number to each record ordered by TimeStamp.
The CTE_ISLAND_START query gets the start of each island.
The CTE_ISLAND_END query gets the end of each island.
The main SELECT at the bottom then uses this data to sum the Count within each island.
This will work with multiple IoT_Devices.
You can read more about Islands and Gaps here or numerous other places online.

Grouping user id columns together with string_agg on PostgreSQL 13

This is my emails table
create table emails (
id bigint not null primary key generated by default as identity,
name text not null
);
And contacts table:
create table contacts (
id bigint not null primary key generated by default as identity,
email_id bigint not null,
user_id bigint not null,
full_name text not null,
ordering int not null
);
As you can see I have user_id field here. There can be multiple same user ID's on my result so i want to join them using comma ,
Insert some data to the tables:
insert into emails (name)
values
('dennis1'),
('dennis2');
insert into contacts (id, email_id, user_id, full_name, ordering)
values
(5, 1, 1, 'dennis1', 9),
(6, 2, 1, 'dennis1', 5),
(7, 2, 1, 'dennis1', 1),
(8, 1, 3, 'john', 2),
(9, 2, 4, 'dennis7', 1),
(10, 2, 4, 'dennis7', 1);
My query is:
select em.name,
c.user_ids
from emails em
join (
select email_id, string_agg(user_id::text, ',' order by ordering desc) as user_ids
from contacts
group by email_id
) c on c.email_id = em.id
order by em.name;
Actual Result
name user_ids
dennis1 1,3
dennis2 1,1,4,4
Expected Result
name user_ids
dennis1 1,3
dennis2 1,4
On my real-world data, I get same user id like 50 times. Instead it should appear 1 time only. In example above, you see user 1 and 4 appears 2 times for dennis2 user.
How can I unique them?
Demo: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=2e957b52eb46742f3ddea27ec36effb1
P.S: I tried to add user_id it to group by but this time I get duplicate rows...
demo:db<>fiddle
SELECT
name,
string_agg(user_id::text, ',' order by ordering desc)
FROM (
SELECT DISTINCT ON (em.id, c.user_id)
*
FROM emails em
JOIN contacts c ON c.email_id = em.id
) s
GROUP BY name
Join the tables
DISTINCT ON email and the user_id, so for every email record, there is no equal users
Aggregate

Multiple UPDATE ... FROM same row is not working

I'm trying to do multiple update, but it works only for the first row.
I have table "users" with 2 records:
create table users
(
uid serial not null
constraint users_pkey
primary key,
balance numeric default 0 not null
);
INSERT INTO public.users (uid, balance) VALUES (2, 100);
INSERT INTO public.users (uid, balance) VALUES (1, 100);
I try to UPDATE user "1" twice with the query, but it update only one time:
balance for user "1" become "105", not "115"
update users as u
set balance = balance + c.bal
from (values (1, 5),
(1, 10)
) as c(uid, bal)
where c.uid = u.uid;
Why it not updated for all rows from subquery?
The postgresql documentation gives no reason for this behaviour but does specify it.
Relevant quote
When a FROM clause is present, what essentially happens is that the
target table is joined to the tables mentioned in the from_list, and
each output row of the join represents an update operation for the
target table. When using FROM you should ensure that the join produces
at most one output row for each row to be modified. In other words, a
target row shouldn't join to more than one row from the other
table(s). If it does, then only one of the join rows will be used to
update the target row, but which one will be used is not readily
predictable.
Use a SELECT with a GROUP BY to combine the rows before performing the update.
You need to aggregate in the inner query before joining:
update users as u
set balance = balance + d.bal
from (
select uid, sum(bal) bal
from ( values (1, 5), (1, 10) ) as c(uid, bal)
group by uid
) d
where d.uid = u.uid;
Demo on DB Fiddle:
| uid | balance |
| --- | ------- |
| 2 | 100 |
| 1 | 115 |

How to query the data in a join table by two sets of joined records?

I've got three tables: users, courses, and grades, the latter of which joins users and courses with some metadata like the user's score for the course. I've created a SQLFiddle, though the site doesn't appear to be working at the moment. The schema looks like this:
CREATE TABLE users(
id INT,
name VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO users VALUES
(1, 'Beth'),
(2, 'Alice'),
(3, 'Charles'),
(4, 'Dave');
CREATE TABLE courses(
id INT,
title VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO courses VALUES
(1, 'Biology'),
(2, 'Algebra'),
(3, 'Chemistry'),
(4, 'Data Science');
CREATE TABLE grades(
id INT,
user_id INT,
course_id INT,
score INT,
PRIMARY KEY (ID)
);
INSERT INTO grades VALUES
(1, 2, 2, 89),
(2, 2, 1, 92),
(3, 1, 1, 93),
(4, 1, 3, 88);
I'd like to know how (if possible) to construct a query which specifies some users.id values (1, 2, 3) and courses.id values (1, 2, 3) and returns those users' grades.score values for those courses
| name | Algebra | Biology | Chemistry |
|---------|---------|---------|-----------|
| Alice | 89 | 92 | |
| Beth | | 93 | 88 |
| Charles | | | |
In my application logic, I'll be receiving an array of user_ids and course_ids, so the query needs to select those users and courses dynamically by primary key. (The actual data set contains millions of users and tens of thousands of courses—the examples above are just a sample to work with.)
Ideally, the query would:
use the course titles as dynamic attributes/column headers for the users' score data
sort the row and column headers alphabetically
include empty/NULL cells if the user-course pair has no grades relationship
I suspect I may need some combination of JOINs and Postgresql's crosstab, but I can't quite wrap my head around it.
Update: learning that the terminology for this is "dynamic pivot", I found this SO answer which appears to be trying to solve a related problem in Postgres with crosstab()
I think a simple pivot query should work here, since you only have 4 courses in your data set to pivot.
SELECT t1.name,
MAX(CASE WHEN t3.title = 'Biology' THEN t2.score ELSE NULL END) AS Biology,
MAX(CASE WHEN t3.title = 'Algebra' THEN t2.score ELSE NULL END) AS Algebra,
MAX(CASE WHEN t3.title = 'Chemistry' THEN t2.score ELSE NULL END) AS Chemistry,
MAX(CASE WHEN t3.title = 'Data Science' THEN t2.score ELSE NULL END) AS Data_Science
FROM users t1
LEFT JOIN grades t2
ON t1.id = t2.user_id
LEFT JOIN courses t3
ON t2.course_id = t3.id
GROUP BY t1.name
Follow the link below for a running demo. I used MySQL because, as you have noticed, SQLFiddle seems to be perpetually busted the other databases.
SQLFiddle

GROUP BY clause sees all VARCHAR fields as different

I have witnessed a strange behaviour while trying to GROUP BY a VARCHAR field.
Let the following example, where I try to spot customers that have changed name at least once in the past.
CREATE TABLE #CustomersHistory
(
Id INT IDENTITY(1,1),
CustomerId INT,
Name VARCHAR(200)
)
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'BBB')
INSERT INTO #CustomersHistory VALUES (44, '444')
SELECT ch.CustomerId, count(ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(ch.Name) != 1
Which oddly yields (as if 'AAA' from first INSERT was different from the second one)
CustomerId cnt // (I was expecting)
12 3 // 2
44 1 // 1
Is this behaviour specific to T-SQL?
Why does it behave in this rather counter-intuitive way?
How is it customary to overcome this limitation?
Note: This question is very similar to GROUP BY problem with varchar, where I didn't find the answer to Why
Side Note: Is it good practice to use HAVING count(ch.Name) != 1 instead of HAVING count(ch.Name) > 1 ?
The COUNT() operator will count all rows regardless of value. I think you might want to use a COUNT(DISTINCT ch.Name) which will only count unique names.
SELECT ch.CustomerId, count(DISTINCT ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(DISTINCT ch.Name) > 1
For more information, take a look at the COUNT() article on book online