Join tables and count instances of different values - postgresql

user
---------------------------
| ID | Name |
---------------------------
| 1 | Jim Rice |
| 2 | Wade Boggs |
| 3 | Bill Buckner |
---------------------------
at_bats
----------------------
| ID | User | Bases |
----------------------
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 4 | 3 | 0 |
| 5 | 1 | 3 |
----------------------
What I want my query to do is get the count of the different base values in a join table like:
count_of_hits
---------------------
| ID | 1B | 2B | 3B |
---------------------
| 1 | 0 | 2 | 1 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 |
---------------------
I had a query where I was able to get the bases individually, but not them all unless I did some complicated Joins and I'd imagine there is a better way. This was the foundational query though:
SELECT id, COUNT(ab.*)
FROM user
LEFT OUTER JOIN (SELECT * FROM at_bats WHERE at_bats.bases=2) ab ON ab.user=user.id

PostgreSQL 9.4+ provides a much cleaner way to do this:
SELECT
users,
count(*) FILTER (WHERE bases=1) As B1,
count(*) FILTER (WHERE bases=2) As B2,
count(*) FILTER (WHERE bases=3) As B3,
FROM at_bats
GROUP BY users
ORDER BY users;

I think the following query would solve your problem. However, I am not sure if it is the best approach:
select distinct a.users, coalesce(b.B1, 0) As B1, coalesce(c.B2, 0) As B2 ,coalesce(d.B3, 0) As B3
FROM at_bats a
LEFT JOIN (SELECT users, count(bases) As B1 FROM at_bats WHERE bases = 1 GROUP BY users) as b ON a.users=b.users
LEFT JOIN (SELECT users, count(bases) As B2 FROM at_bats WHERE bases = 2 GROUP BY users) as c ON a.users=c.users
LEFT JOIN (SELECT users, count(bases) As B3 FROM at_bats WHERE bases = 3 GROUP BY users) as d ON a.users=d.users
Order by users
the coalesce() function is just to replace the nulls with zeros. I hope this query helps you :D
UPDATE 1
I found a better way to do it, look to the following:
SELECT users,
count(case bases when 1 then 1 else null end) As B1,
count(case bases when 2 then 1 else null end) As B2,
count(case bases when 3 then 1 else null end) As B3
FROM at_bats
GROUP BY users
ORDER BY users;
It it is more efficient compared to my first query. You can check the performance by using EXPLAIN ANALYSE before the query.
Thanks to Guffa from this post: https://stackoverflow.com/a/1400115/4453190

Related

How to compute frequency/count of concurrent events by combination in postgresql?

I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).
toy dataframe
+------------------+
|name startts endts|
| A 02:20 02:23 |
| A 02:23 02:25 |
| A 02:27 02:28 |
| B 02:20 02:23 |
| B 02:23 02:25 |
| B 02:25 02:27 |
| C 02:27 02:28 |
| D 02:27 02:28 |
| D 02:28 02:31 |
| E 02:27 02:28 |
| E 02:29 02:31 |
+------------------+
Ideal output:
+---------------------------+
|combination| count |
+---------------------------+
| AB | 2 |
| AC | 1 |
| AE | 1 |
| AD | 1 |
| BC | 0 |
| BD | 0 |
| BE | 0 |
| CE | 0 |
+-----------+---------------+
Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.
What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).
User #GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.
#GMB's attempt:
I understand this as a self-join, aggregation, and a conditional count of matching intervals:
select t1.name name1, t2.name name2,
sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
from mytable t1
inner join mytable t2 on t2.name > t1.name
group by t1.name, t2.name
order by t1.name, t2.name
Demo on DB Fiddle:
name1 | name2 | cnt
:---- | :---- | --:
A | B | 2
A | C | 1
A | D | 1
A | E | 1
B | C | 0
B | D | 0
B | E | 0
C | D | 1
C | E | 1
D | E | 1
#GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:
sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt
Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097
Thank you.
Consider the following in MySQL (where your DBFiddle points to):
SELECT name, COUNT(*)
FROM (
SELECT group_concat(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
Equivalent in PostgreSQL:
SELECT name, COUNT(*)
FROM (
SELECT string_agg(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
First, you create a list of concurrent events (in the subquery), and then you count them.

Accomplishing what I need without a CROSS JOIN

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.
Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Postgresql : Filtering duplicate pair

I am asking this from mobile, so apologies for bad formatting. For the following table.
Table players
| ID | name |matches_won|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
| 1 | bob | 3 |
| 2 | Paul | 2 |
| 3 | John | 4 |
| 4 | Jim | 1 |
| 5 | hal | 0 |
| 6 | fin | 0 |
I want to pair two players together in a query. Who have a similar or near similar the number of matches won. So the query should display the following result.
| ID | NAME | ID | NAME |
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
| 3 | John | 1 | bob |
| 2 | paul | 4 | Jim |
| 5 | hal | 6 | fin |
Until now I have tried this query. But it gives repeat pairs.
Select player1.ID,player1.name,player2.ID,player2.name
From player as player1,
player as player2
Where
player1.matches_won >= player2.matches_won
And player1.ID ! = player2.ID;
The query will pair the player with the most won matches with everyone of the other players. While I only want one player to appear only once in the result. With the player who is nearest to his wins.
I have tried sub queries. But I don't know how to go about it, since it only returns one result. Also aggregates don't work in the where clause. So I am not sure how to achieve this.
An easier way, IMHO, to achieve this would be to order the players by their number of wins, divide these ranks by two to create matches and self join. CTEs (with expressions) allow you to do this relatively elegantly:
WITH wins AS (
SELECT id, name, ROW_NUMNBER() OVER (ORDER BY matches_won DESC) AS rn
FROM players
)
SELECT w1.id, w1.name, w2.id, w2.name
FROM (SELECT id, name, rn / 2 AS rn
FROM wins
WHERE rn % 2 = 1) w1
LEFT JOIN (SELECT id, name, (rn - 1) / 2 AS rn
FROM wins
WHERE rn % 2 = 0) w2 ON w1.rn = w2.rn
Add row numbers in descending order by won matches to the table and join odd row numbers with adjacent even row numbers:
with players as (
select *, row_number() over (order by matches_won desc) rn
from player)
select a.id, a.name, b.id, b.name
from players a
join players b
on a.rn = b.rn- 1
where a.rn % 2 = 1
id | name | id | name
----+------+----+------
3 | John | 1 | bob
2 | Paul | 4 | Jim
5 | hal | 6 | fin
(3 rows)

How to count the number of joins between a user and several tables?

I'm using postgresql 9.3.9 and have a table users that looks like this:
user_id | email
----------------------------
1001 | hello#world.com
1030 | mel#hotmail.com
2333 | jess#gmail.com
2502 | peter#gmail.com
3000 | olivia#hotmail.com
4000 | sharon#gmail.com
4900 | lisa#gmail.com
I then have several tables that list what users are connected on various platforms and when they connected. Ie platform_a, platform_b, platform_c, etc.
platform_a may look like this:
user_id | created_at
----------------------------
1001 | 2015-04-30
2333 | 2015-05-15
3000 | 2014-02-15
platform_b may look like this:
user_id | created_at
----------------------------
1001 | 2015-06-30
2333 | 2015-07-02
4900 | 2015-07-03
platform_c may look like this:
user_id | created_at
----------------------------
1001 | 2015-08-16
1030 | 2015-07-03
3000 | 2015-09-01
4000 | 2015-09-01
I want the end result to look like this:
user_id | # of connections | latest created_at | connected to a | connected to b | connected to c
--------------------------------------------------------------------------------------------------
1001 | 3 | 2015-08-16 | yes | yes | yes
1030 | 1 | 2015-07-03 | no | no | yes
2333 | 2 | 2015-07-02 | yes | yes | no
2502 | 0 | | no | no | no
3000 | 2 | 2015-09-01 | yes | no | yes
4000 | 1 | 2015-09-01 | no | no | yes
4900 | 1 | 2015-07-03 | no | yes | no
How would I do this?
First, make an union with all your tables :
SELECT user_id, created_at, 1 AS a, 0 AS b, 0 AS c FROM tableA
UNION
SELECT user_id, created_at, 0 AS a, 1 AS b, 0 AS c FROM tableB
UNION
SELECT user_id, created_at, 0 AS a, 0 AS b, 1 AS c FROM tableC
then group the result from this subquery
SELECT user_id, COUNT(user_id), MAX(created_at), MAX(a), MAX(b), MAX(c)
FROM subquery_above
GROUP BY user_id
This won't give you the zero results, but you can achieve that with a LEFT JOIN on the user list.
select
user_id,
count(p),
max(created_at),
coalesce(sum((pl = 'a')::int), 0) connected_to_a,
coalesce(sum((pl = 'b')::int), 0) connected_to_b,
coalesce(sum((pl = 'c')::int), 0) connected_to_c
from users u
left join (
select *, 'a' pl from platform_a
union all
select *, 'b' pl from platform_b
union all
select *, 'c' pl from platform_c
) p
using (user_id)
group by 1;
user_id | count | max | connected_to_a | connected_to_b | connected_to_c
---------+-------+------------+----------------+----------------+----------------
1001 | 3 | 2015-08-16 | 1 | 1 | 1
1030 | 1 | 2015-07-03 | 0 | 0 | 1
2333 | 2 | 2015-07-02 | 1 | 1 | 0
2502 | 0 | | 0 | 0 | 0
3000 | 2 | 2015-09-01 | 1 | 0 | 1
4000 | 1 | 2015-09-01 | 0 | 0 | 1
4900 | 1 | 2015-07-03 | 0 | 1 | 0
(7 rows)
While you check for all users, it's typically fastest to aggregate before you join:
SELECT *
FROM (SELECT user_id FROM users) u -- subquery to clip other columns
LEFT JOIN (
SELECT user_id, count(*) AS connections, max(created_at) AS latest_created_at
, bool_or(pl = 'a') AS connected_to_a
, bool_or(pl = 'b') AS connected_to_b
, bool_or(pl = 'c') AS connected_to_c
FROM ( SELECT user_id, created_at, 'a'::"char" AS pl FROM platform_a
UNION ALL SELECT user_id, created_at, 'b' FROM platform_b
UNION ALL SELECT user_id, created_at, 'c' FROM platform_b
) p1
) p2 USING (user_id)
ORDER BY user_id;
Result is exactly as desired - except that connections is NULL instead of '0' in your example. Use COALESCE() in the outer SELECT if you need to convert that. I didn't, because SELECT * is so convenient.
If you are going to list all columns in the outer SELECT you can as well just use users instead of the subquery u to clip other columns.
bool_or() is the perfect aggregate function for the job.
There might be multiple links to one platform. This query still returns a single row per user.

How to list the train operators that use the second oldest trains (PostgreSQL)

train_operators:
| train_operator_id | name |
------------------------------
| 1 | Virgin |
| 2 | First |
journeys:
| journey_id | train_operator | train_type |
--------------------------------------------
| 1 | 2 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 3 |
| 4 | 1 | 2 |
train_types:
| train_type_id | date_made |
------------------------------
| 1 | 1999-02-15 |
| 2 | 2001-03-11 |
| 3 | 2000-12-05 |
How would you write a query to find all the train operators that use the second oldest type of train?
With the given schema the query should result with just Virgin since it is the only train operator that uses the second oldest train type
Try this:
select distinct train_operator from journeys
inner join (Select * from train_types order by date_made LIMIT 1 OFFSET 1) sectrain
on sectrain.train_type_id = journeys.train_type
You're into the UK Rail Network are you? I used to work for Funkwerk IT, who in turn used to provide the timetable planning software for Network Rail...
It can be pretty easy using the power of window functions in pg
SELECT DISTINCT train_operator_id,
name
FROM (SELECT t.train_operator_id,
t.name,
Rank() OVER (ORDER BY tt.date_made) AS rank
FROM train_operators AS t
JOIN journeys AS j
ON j.train_operator = t.train_operator_id
JOIN train_types AS tt
ON tt.train_type_id = j.train_type) AS q
WHERE rank = 2;
http://sqlfiddle.com/#!12/98816/8
select to.name
from
train_operators to
inner join
journeys j on to.train_operator_id = j.train_operator
where
j.train_type = (
select train_type_id
from train_types
order by date_made
limit 1 offset 1
)