How to compute frequency/count of concurrent events by combination in postgresql? - postgresql

I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).
toy dataframe
+------------------+
|name startts endts|
| A 02:20 02:23 |
| A 02:23 02:25 |
| A 02:27 02:28 |
| B 02:20 02:23 |
| B 02:23 02:25 |
| B 02:25 02:27 |
| C 02:27 02:28 |
| D 02:27 02:28 |
| D 02:28 02:31 |
| E 02:27 02:28 |
| E 02:29 02:31 |
+------------------+
Ideal output:
+---------------------------+
|combination| count |
+---------------------------+
| AB | 2 |
| AC | 1 |
| AE | 1 |
| AD | 1 |
| BC | 0 |
| BD | 0 |
| BE | 0 |
| CE | 0 |
+-----------+---------------+
Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.
What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).
User #GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.
#GMB's attempt:
I understand this as a self-join, aggregation, and a conditional count of matching intervals:
select t1.name name1, t2.name name2,
sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
from mytable t1
inner join mytable t2 on t2.name > t1.name
group by t1.name, t2.name
order by t1.name, t2.name
Demo on DB Fiddle:
name1 | name2 | cnt
:---- | :---- | --:
A | B | 2
A | C | 1
A | D | 1
A | E | 1
B | C | 0
B | D | 0
B | E | 0
C | D | 1
C | E | 1
D | E | 1
#GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:
sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt
Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097
Thank you.

Consider the following in MySQL (where your DBFiddle points to):
SELECT name, COUNT(*)
FROM (
SELECT group_concat(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
Equivalent in PostgreSQL:
SELECT name, COUNT(*)
FROM (
SELECT string_agg(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
First, you create a list of concurrent events (in the subquery), and then you count them.

Related

Make sure every distinct value of Column1 has a row with every distinct value of Column2, by populating a table with 0s - postgresql

Here's a crude example I've made up to illustrate what I want to achieve:
table1:
| Shop | Product | QuantityInStock |
| a | Prod1 | 13 |
| a | Prod3 | 13 |
| b | Prod2 | 13 |
| b | Prod3 | 13 |
| b | Prod4 | 13 |
table1 becomes:
| Shop | Product | QuantityInStock |
| a | Prod1 | 13 |
| a | Prod2 | 0 | -- new
| a | Prod3 | 13 |
| a | Prod4 | 0 | -- new
| b | Prod1 | 0 | -- new
| b | Prod2 | 13 |
| b | Prod3 | 13 |
| b | Prod4 | 13 |
In this example, I want to represent every Shop/Product combination
every Shop {a,b} to have a row with every Product {Prod1, Prod2, Prod3, Prod4}
QuantityInStock=13 has no significance, I just wanted a placeholder number :)
Use a calendar table cross join approach:
SELECT s.Shop, p.Product, COALESCE(t1.QuantityInStock, 0) AS QuantityInStock
FROM (SELECT DISTINCT Shop FROM table1) s
CROSS JOIN (SELECT DISTINCT Product FROM table1) p
LEFT JOIN table1 t1
ON t1.Shop = s.Shop AND
t1.Product = p.Product
ORDER BY
s.Shop,
p.Product;
The idea here is to generate an intermediate table containing of all shop/product combinations via a cross join. Then, we left join this to table1. Any shop/product combinations which do not have a match in the actual table are assigned a zero stock quantity.

Accomplishing what I need without a CROSS JOIN

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.
Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Join tables and count instances of different values

user
---------------------------
| ID | Name |
---------------------------
| 1 | Jim Rice |
| 2 | Wade Boggs |
| 3 | Bill Buckner |
---------------------------
at_bats
----------------------
| ID | User | Bases |
----------------------
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 4 | 3 | 0 |
| 5 | 1 | 3 |
----------------------
What I want my query to do is get the count of the different base values in a join table like:
count_of_hits
---------------------
| ID | 1B | 2B | 3B |
---------------------
| 1 | 0 | 2 | 1 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 |
---------------------
I had a query where I was able to get the bases individually, but not them all unless I did some complicated Joins and I'd imagine there is a better way. This was the foundational query though:
SELECT id, COUNT(ab.*)
FROM user
LEFT OUTER JOIN (SELECT * FROM at_bats WHERE at_bats.bases=2) ab ON ab.user=user.id
PostgreSQL 9.4+ provides a much cleaner way to do this:
SELECT
users,
count(*) FILTER (WHERE bases=1) As B1,
count(*) FILTER (WHERE bases=2) As B2,
count(*) FILTER (WHERE bases=3) As B3,
FROM at_bats
GROUP BY users
ORDER BY users;
I think the following query would solve your problem. However, I am not sure if it is the best approach:
select distinct a.users, coalesce(b.B1, 0) As B1, coalesce(c.B2, 0) As B2 ,coalesce(d.B3, 0) As B3
FROM at_bats a
LEFT JOIN (SELECT users, count(bases) As B1 FROM at_bats WHERE bases = 1 GROUP BY users) as b ON a.users=b.users
LEFT JOIN (SELECT users, count(bases) As B2 FROM at_bats WHERE bases = 2 GROUP BY users) as c ON a.users=c.users
LEFT JOIN (SELECT users, count(bases) As B3 FROM at_bats WHERE bases = 3 GROUP BY users) as d ON a.users=d.users
Order by users
the coalesce() function is just to replace the nulls with zeros. I hope this query helps you :D
UPDATE 1
I found a better way to do it, look to the following:
SELECT users,
count(case bases when 1 then 1 else null end) As B1,
count(case bases when 2 then 1 else null end) As B2,
count(case bases when 3 then 1 else null end) As B3
FROM at_bats
GROUP BY users
ORDER BY users;
It it is more efficient compared to my first query. You can check the performance by using EXPLAIN ANALYSE before the query.
Thanks to Guffa from this post: https://stackoverflow.com/a/1400115/4453190

SQL - group by - limit clause - postgresql

I have a table which has two columns C1 and C2.
C1 has an integer data type and C2 has text.
Table looks like this.
---C1--- ---C2---
1 | a |
1 | b |
1 | c |
1 | d |
1 | e |
1 | f |
1 | g |
2 | h |
2 | i |
2 | j |
2 | k |
2 | l |
2 | m |
2 | n |
------------------
My question: i want a sql query which does group by on column C1 but with size of 3.
looks like this.
------------------
1 | a,b,c |
1 | d,e,f |
1 | g |
2 | h,i,j |
2 | k,l,m |
2 | n |
------------------
is it possible by executing SQL???
Note: I do not want to write stored procedure or function...
You can use a common table expression to partition the results into rows, and then use STRING_AGG to join them into comma separated lists;
WITH cte AS (
SELECT *, (ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2)-1)/3 rn
FROM mytable
)
SELECT C1, STRING_AGG(C2, ',') ALL_C2
FROM cte
GROUP BY C1,rn
ORDER BY C1
An SQLfiddle to test with.
A short explanation of the common table expression;
ROW_NUMBER() OVER (...) will number the results from 1 to n for each value of C1. We then subtract 1 and divide by 3 to get the sequence 0,0,0,1,1,1,2,2,2... and group by that value in the outer query to get 3 results per row.
Apart from Joachim Isaksson's answer,you try this method also
SELECT C1, string_agg(C2, ',') as c2
FROM (
SELECT *, (ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2)-1)/3 as row_num
FROM atable) t
GROUP BY C1,row_num
ORDER BY c2

How to list the train operators that use the second oldest trains (PostgreSQL)

train_operators:
| train_operator_id | name |
------------------------------
| 1 | Virgin |
| 2 | First |
journeys:
| journey_id | train_operator | train_type |
--------------------------------------------
| 1 | 2 | 2 |
| 2 | 2 | 1 |
| 3 | 1 | 3 |
| 4 | 1 | 2 |
train_types:
| train_type_id | date_made |
------------------------------
| 1 | 1999-02-15 |
| 2 | 2001-03-11 |
| 3 | 2000-12-05 |
How would you write a query to find all the train operators that use the second oldest type of train?
With the given schema the query should result with just Virgin since it is the only train operator that uses the second oldest train type
Try this:
select distinct train_operator from journeys
inner join (Select * from train_types order by date_made LIMIT 1 OFFSET 1) sectrain
on sectrain.train_type_id = journeys.train_type
You're into the UK Rail Network are you? I used to work for Funkwerk IT, who in turn used to provide the timetable planning software for Network Rail...
It can be pretty easy using the power of window functions in pg
SELECT DISTINCT train_operator_id,
name
FROM (SELECT t.train_operator_id,
t.name,
Rank() OVER (ORDER BY tt.date_made) AS rank
FROM train_operators AS t
JOIN journeys AS j
ON j.train_operator = t.train_operator_id
JOIN train_types AS tt
ON tt.train_type_id = j.train_type) AS q
WHERE rank = 2;
http://sqlfiddle.com/#!12/98816/8
select to.name
from
train_operators to
inner join
journeys j on to.train_operator_id = j.train_operator
where
j.train_type = (
select train_type_id
from train_types
order by date_made
limit 1 offset 1
)