How to count the number of joins between a user and several tables? - postgresql

I'm using postgresql 9.3.9 and have a table users that looks like this:
user_id | email
----------------------------
1001 | hello#world.com
1030 | mel#hotmail.com
2333 | jess#gmail.com
2502 | peter#gmail.com
3000 | olivia#hotmail.com
4000 | sharon#gmail.com
4900 | lisa#gmail.com
I then have several tables that list what users are connected on various platforms and when they connected. Ie platform_a, platform_b, platform_c, etc.
platform_a may look like this:
user_id | created_at
----------------------------
1001 | 2015-04-30
2333 | 2015-05-15
3000 | 2014-02-15
platform_b may look like this:
user_id | created_at
----------------------------
1001 | 2015-06-30
2333 | 2015-07-02
4900 | 2015-07-03
platform_c may look like this:
user_id | created_at
----------------------------
1001 | 2015-08-16
1030 | 2015-07-03
3000 | 2015-09-01
4000 | 2015-09-01
I want the end result to look like this:
user_id | # of connections | latest created_at | connected to a | connected to b | connected to c
--------------------------------------------------------------------------------------------------
1001 | 3 | 2015-08-16 | yes | yes | yes
1030 | 1 | 2015-07-03 | no | no | yes
2333 | 2 | 2015-07-02 | yes | yes | no
2502 | 0 | | no | no | no
3000 | 2 | 2015-09-01 | yes | no | yes
4000 | 1 | 2015-09-01 | no | no | yes
4900 | 1 | 2015-07-03 | no | yes | no
How would I do this?

First, make an union with all your tables :
SELECT user_id, created_at, 1 AS a, 0 AS b, 0 AS c FROM tableA
UNION
SELECT user_id, created_at, 0 AS a, 1 AS b, 0 AS c FROM tableB
UNION
SELECT user_id, created_at, 0 AS a, 0 AS b, 1 AS c FROM tableC
then group the result from this subquery
SELECT user_id, COUNT(user_id), MAX(created_at), MAX(a), MAX(b), MAX(c)
FROM subquery_above
GROUP BY user_id
This won't give you the zero results, but you can achieve that with a LEFT JOIN on the user list.

select
user_id,
count(p),
max(created_at),
coalesce(sum((pl = 'a')::int), 0) connected_to_a,
coalesce(sum((pl = 'b')::int), 0) connected_to_b,
coalesce(sum((pl = 'c')::int), 0) connected_to_c
from users u
left join (
select *, 'a' pl from platform_a
union all
select *, 'b' pl from platform_b
union all
select *, 'c' pl from platform_c
) p
using (user_id)
group by 1;
user_id | count | max | connected_to_a | connected_to_b | connected_to_c
---------+-------+------------+----------------+----------------+----------------
1001 | 3 | 2015-08-16 | 1 | 1 | 1
1030 | 1 | 2015-07-03 | 0 | 0 | 1
2333 | 2 | 2015-07-02 | 1 | 1 | 0
2502 | 0 | | 0 | 0 | 0
3000 | 2 | 2015-09-01 | 1 | 0 | 1
4000 | 1 | 2015-09-01 | 0 | 0 | 1
4900 | 1 | 2015-07-03 | 0 | 1 | 0
(7 rows)

While you check for all users, it's typically fastest to aggregate before you join:
SELECT *
FROM (SELECT user_id FROM users) u -- subquery to clip other columns
LEFT JOIN (
SELECT user_id, count(*) AS connections, max(created_at) AS latest_created_at
, bool_or(pl = 'a') AS connected_to_a
, bool_or(pl = 'b') AS connected_to_b
, bool_or(pl = 'c') AS connected_to_c
FROM ( SELECT user_id, created_at, 'a'::"char" AS pl FROM platform_a
UNION ALL SELECT user_id, created_at, 'b' FROM platform_b
UNION ALL SELECT user_id, created_at, 'c' FROM platform_b
) p1
) p2 USING (user_id)
ORDER BY user_id;
Result is exactly as desired - except that connections is NULL instead of '0' in your example. Use COALESCE() in the outer SELECT if you need to convert that. I didn't, because SELECT * is so convenient.
If you are going to list all columns in the outer SELECT you can as well just use users instead of the subquery u to clip other columns.
bool_or() is the perfect aggregate function for the job.
There might be multiple links to one platform. This query still returns a single row per user.

Related

PostgreSQL - Check if column value exists in any previous row

I'm working on a problem where I need to check if an ID exists in any previous records within another ID set, and create a tag if it does.
Suppose I have the following table
| client_id | order_date | supplier_id |
| 1 | 2022-01-01 | 1 |
| 1 | 2022-02-01 | 2 |
| 1 | 2022-03-01 | 1 |
| 1 | 2022-04-01 | 3 |
| 2 | 2022-05-01 | 1 |
| 2 | 2022-06-01 | 1 |
| 2 | 2022-07-01 | 2 |
And I want to create a column with a "is new supplier" tag (for each client):
| client_id | order_date | supplier_id | is_new_supplier|
| 1 | 2022-01-01 | 1 | True
| 1 | 2022-02-01 | 2 | True
| 1 | 2022-03-01 | 1 | False
| 1 | 2022-04-01 | 3 | True
| 2 | 2022-05-01 | 1 | True
| 2 | 2022-06-01 | 1 | False
| 2 | 2022-07-01 | 2 | True
First I tried doing this by creating a dense_rank and filtering out repeated ranks, but it didn't work:
with aux as (SELECT client_id,
order_date,
supplier_id
FROM table)
SELECT *, dense_rank() over (
partition by client_id
order by supplier_id
) as _dense_rank
FROM aux
Another way I thought about doing this, is by creating an auxiliary id with client_id + supplier_id, ordering by date and checking if the aux id exists in any previous row, but I don't know how to do this in SQL.
You are on the right track.
Instead of dense_rank, you can just use row_number and on your partition by add supplier id..
Don't forget to order by order_date
with aux as (SELECT client_id,
order_date,
supplier_id,
row_number() over (
partition by client_id, supplier_id
order by order_date
) as rank
FROM table)
SELECT client_id,
order_date,
supplier_id,
rank,
(rank = 1) as is_new_supplier
FROM aux

Get dummy columns from different tables

I have three different tables that look like that:
Table 1
| id | city|
|----|-----|
| 1 | A |
| 1 | B |
| 2 | C |
Table 2
| id | city|
|----|-----|
| 2 | B |
| 1 | B |
| 3 | C |
Table 3
| id | city|
|----|-----|
| 1 | A |
| 1 | B |
| 2 | A |
I need to create one column for each table, and the dummies values if it's present.
| id | city| is_tbl_1 | is_tbl_2 | is_tbl_3 |
|----|-----|-----------|-------------|------------|
| 1 | A | 1 | 0 | 1 |
| 1 | B | 1 | 1 | 1 |
| 2 | A | 0 | 0 | 1 |
| 2 | C | 1 | 0 | 0 |
| 2 | B | 0 | 1 | 0 |
| 3 | C | 0 | 1 | 0 |
I have tried to add the columns is_tbl# myself on three different selects, UNION all the three tables and group, but it looks ugly, is there a better way to do it?
You can outer-join the 3 tables on id and city, then group by the id and city, and finally count the number of non-null values of the city columns :
SELECT
COALESCE (t1.id, t2.id, t3.id) AS id
, COALESCE (t1.city, t2.city, t3.city) AS city
, count(*) FILTER (WHERE t1.city IS NOT NULL) AS is_tbl_1
, count(*) FILTER (WHERE t2.city IS NOT NULL) AS is_tbl_2
, count(*) FILTER (WHERE t3.city IS NOT NULL) AS is_tbl_3
FROM
t1 AS t1
FULL OUTER JOIN
t2 AS t2 ON t1.id = t2.id AND t1.city = t2.city
FULL OUTER JOIN
t3 AS t3 ON t1.id = t3.id AND t1.city = t3.city
GROUP BY
1,2
ORDER BY
1,2

How to fill Null with the previous value in PostgreSQL?

I have a table which contains Null values. I need to replace them with a previous non-Null value.
This is an example of data which I have:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | Null |
2018-01-03 | A | 0 | Null |
2018-01-04 | A | 0 | Null |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | Null |
2018-01-07 | B | 0 | Null |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | Null |
2018-01-10 | A | 0 | Null |
The result should look like this:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | 1 |
2018-01-03 | A | 0 | 1 |
2018-01-04 | A | 0 | 1 |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | 2 |
2018-01-07 | B | 0 | 2 |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | 3 |
2018-01-10 | A | 0 | 3 |
I tried the following query, but in this case, only the first Null value will be replaced.
select
date,
category,
start_period,
case
when period_number isnull then lag(period_number) over()
else period_number
end as period_number
from period_table;
Also, I tried to use first_value() window function, but I don't know how to set up the correct window.
Any help is highly appreciated.
You can join table with itself and get desired value. Assuming your date column is the primary key or unique.
update your_table upd set period_number = tbl.period_number
from
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t
inner join your_table tbl on tbl.date = t.d2
where t.date= upd.date
If you don't need to update the table but only a select statement then
select yt.date, yt.category, yt.start_period, tbl.period_number
from your_table yt
inner join
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t on yt.date = t.date
inner join your_table tbl on tbl.date = t.d2
If you replace your case statement with:
(
select
_.period_number
from
period_table as _
where
_.period_number is not null
and _.category = period_table.category
and _.date <= period_table.date
order by
_.date desc
limit 1
) as period_number
Then it should have the intended effect. It's nowhere near as elegant as a window function but I don't think window functions are quite flexible enough for your specific use case here (Or at least, if they are, I don't know how to flex them that much)
Examples of windows function and frame clause:
select
date,category,score
,FIRST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW
) as last_score
from testing.rec_test
order by date, category
select
date,category,score
,LAST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) as last_score
from testing.rec_test
order by date, category

SUM values from two tables with GROUP BY and WHERE

I have two tables below named sent_table and received_table. I am attempting to mash them together in a query to achieve output_table. All my attempts so far result in a huge amount of duplicates and totally bogus sum values.
I am assuming I would need to use GROUP BY and WHERE to achieve this goal. I want to be able to filter based on the users name.
sent_table
+----+------+-------+----------+
| id | name | value | order_id |
+----+------+-------+----------+
| 1 | dave | 100 | 1 |
| 2 | dave | 200 | 1 |
| 3 | dave | 300 | 2 |
+----+------+-------+----------+
received_table
+----+------+-------+----------+
| id | name | value | order_id |
+----+------+-------+----------+
| 1 | dave | 400 | 1 |
| 2 | dave | 500 | 2 |
| 3 | dave | 600 | 2 |
+----+------+-------+----------+
output table
+------+----------+----------+
| sent | received | order_id |
+------+----------+----------+
| 300 | 400 | 1 |
| 300 | 1100 | 2 |
+------+----------+----------+
I tried the following with no joy. This does not impose any restrictions on how I would desire to solve this problem. It is just how I attempted to do it.
SELECT *
FROM
( select SUM(value) as sent, order_id FROM sent_table WHERE name='dave' GROUP BY order_id) A
CROSS JOIN
( select SUM(value) as received, order_id FROM received_table WHERE name='dave' GROUP BY order_id) B
Any help would be greatly appreciated.
Do the sums on each table, grouping by order_id, then join the results. To get the rows even if one side is missing, do a FULL OUTER JOIN:
SELECT COALESCE(s.order_id, r.order_id) AS order_id, s.sent, r.received
FROM (
SELECT order_id, SUM(value) AS sent
FROM sent
GROUP BY order_id
) s
FULL OUTER JOIN (
SELECT order_id, SUM(value) AS received
FROM received
GROUP BY order_id
) r
USING (order_id)
ORDER BY 1
Result:
| order_id | sent | received |
| -------- | ---- | -------- |
| 1 | 300 | 400 |
| 2 | | 1100 |
Note the COALESCE on the order_id, so that if it's missing from sent it will be taken from recevied, so that that value will never be NULL.
If you want to have 0 in place of NULL (when e.g. there is no record for that order_id in either sent or received), you would do COALESCE(s.sent, 0) AS sent, COALESCE(r.received, 0) AS received.
https://www.db-fiddle.com/f/nq3xYrcys16eUrBRHT6xLL/2

PostgreSQL Query for grouping column values into columns

I have the PostreSQL table shown below. ordered is a boolean column and created_at is a timestamp. I'm trying to fetch rows which tell me the total number of successful orders (count(t)) vs failed ordered (count(f)) as well as the total number of orders (t + f) grouped by day (extracted from created_at)
ordered | created_at
t | 2018-10-10 20:13:10
t | 2018-10-10 21:23:11
t | 2018-10-11 06:33:52
f | 2018-10-11 13:13:33
f | 2018-10-11 19:17:11
f | 2018-10-12 00:53:01
f | 2018-10-12 05:14:41
f | 2018-10-12 16:33:09
f | 2018-10-13 17:14:14
I want the following result
created_at | ordered_true | ordered_false | total_orders
2018-10-10 | 2 | 0 | 2
2018-10-11 | 1 | 2 | 3
2018-10-12 | 0 | 3 | 3
2018-10-13 | 0 | 1 | 1
Use the aggregate functions sum() and count():
select
created_at::date,
sum(ordered::int) as ordered_true,
sum((not ordered)::int) as ordered_false,
count(*) as total_orders
from my_table
group by 1
order by 1
created_at | ordered_true | ordered_false | total_orders
------------+--------------+---------------+--------------
2018-10-10 | 2 | 0 | 2
2018-10-11 | 1 | 2 | 3
2018-10-12 | 0 | 3 | 3
2018-10-13 | 0 | 1 | 1
(4 rows)
Try:
SELECT created_at,
COUNT(ordered) filter (where ordered = 't') AS ordered_true,
COUNT(ordered) filter (where ordered = 'f') AS ordered_false,
COUNT(*) AS total_orders
FROM my_table
GROUP BY created_at
EDIT: use #klint's answer as pointed in the comments by OP grouping by created_at will result in unwanted results as one day will have a couple of groups(timestamp longer than just a day)