postgresql-group by- complex query - postgresql

Here's what my table TheTable looks like
ColA | ColB |
------+-------+------
abc | 2005 |
abc | 2010 |
def | 2009 |
def | 2010 |
def | 2011 |
abc | 2012 |
And I want to write a query to return this result:
ColA | ColB | ColC
------+-------+------
abc | 2005 | 2010
def | 2009 | 2011
abc | 2012 | -

I believe you can get the results you want using window functions and a nested subquery:
select "ColA"
, max(case when parity = 0 then "ColB" end) as "ColB"
, max(case when parity = 1 then "ColB" end) as "ColC"
from (
select *
, (rank() over(partition by "ColA" order by "ColB" asc) - 1)
, (rank() over(partition by "ColA" order by "ColB" asc) - 1) / 2 as result_row
, (rank() over(partition by "ColA" order by "ColB" asc) - 1) % 2 as parity
from TheTable ) t
GROUP BY "ColA", result_row

Related

Distinct Count Dates by timeframe

I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.
You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date
I would be inclined to approach this by expanding out the days and visitors using a cross join and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;

Aggregate data per week

I'd like to aggregate data weekly according to a date and a value.
I have a table like this :
create table test (t_val integer, t_date date);
insert into test values(1,'2017-02-09'),(2,'2017-02-10'),(4,'2017-02-16');
This is the query :
WITH date_range AS (
SELECT MIN(t_date) as start_date,
MAX(t_date) as end_date
FROM test
)
SELECT
date_part('year', f.date) as date_year,
date_part('week', f.date) as date_week,
f.val
FROM generate_series( (SELECT start_date FROM date_range), (SELECT end_date FROM date_range), '7 day')d
LEFT JOIN
(
SELECT t_val as val, t_date as date
FROM test
WHERE t_date >= (SELECT start_date FROM date_range)
AND t_date <= (SELECT end_date FROM date_range)
GROUP BY t_val, t_date
) f
ON f.date BETWEEN d.date AND (d.date + interval '7 day')
GROUP BY date_part('year', f.date),date_part('week', f.date), f.val;
I expect a result like this :
| Year | Week | Val |
| 2017 | 6 | 3 |
| 2017 | 7 | 4 |
BUt the query returns :
| Year | Week | Val |
| 2017 | 6 | 1 |
| 2017 | 6 | 2 |
| 2017 | 7 | 4 |
What is missing ?

How can I get the sum(value) on the latest gather_time per group(name,col1) in PostgreSQL?

Actually, I got a good answer about the similar issue on below thread, but I need one more solution for different data set.
How to get the latest 2 rows ( PostgreSQL )
The Data set has historical data, and I just want to get sum(value) for the group on the latest gather_time.
The final result should be as following:
name | col1 | gather_time | sum
-------+------+---------------------+-----
first | 100 | 2016-01-01 23:12:49 | 6
first | 200 | 2016-01-01 23:11:13 | 4
However, I just can see the data for the one group(first-100) with a query below meaning that there is no data for the second group(first-200).
Thing is that I need to get the one row per the group.
The number of the group can be vary.
select name,col1,gather_time,sum(value)
from testtable
group by name,col1,gather_time
order by gather_time desc
limit 2;
name | col1 | gather_time | sum
-------+------+---------------------+-----
first | 100 | 2016-01-01 23:12:49 | 6
first | 100 | 2016-01-01 23:11:19 | 6
(2 rows)
Can you advice me to accomplish this requirement?
Data set
create table testtable
(
name varchar(30),
col1 varchar(30),
col2 varchar(30),
gather_time timestamp,
value integer
);
insert into testtable values('first','100','q1','2016-01-01 23:11:19',2);
insert into testtable values('first','100','q2','2016-01-01 23:11:19',2);
insert into testtable values('first','100','q3','2016-01-01 23:11:19',2);
insert into testtable values('first','200','t1','2016-01-01 23:11:13',2);
insert into testtable values('first','200','t2','2016-01-01 23:11:13',2);
insert into testtable values('first','100','q1','2016-01-01 23:11:11',2);
insert into testtable values('first','100','q1','2016-01-01 23:12:49',2);
insert into testtable values('first','100','q2','2016-01-01 23:12:49',2);
insert into testtable values('first','100','q3','2016-01-01 23:12:49',2);
select *
from testtable
order by name,col1,gather_time;
name | col1 | col2 | gather_time | value
-------+------+------+---------------------+-------
first | 100 | q1 | 2016-01-01 23:11:11 | 2
first | 100 | q2 | 2016-01-01 23:11:19 | 2
first | 100 | q3 | 2016-01-01 23:11:19 | 2
first | 100 | q1 | 2016-01-01 23:11:19 | 2
first | 100 | q3 | 2016-01-01 23:12:49 | 2
first | 100 | q1 | 2016-01-01 23:12:49 | 2
first | 100 | q2 | 2016-01-01 23:12:49 | 2
first | 200 | t2 | 2016-01-01 23:11:13 | 2
first | 200 | t1 | 2016-01-01 23:11:13 | 2
One option is to join your original table to a table containing only the records with the latest gather_time for each name, col1 group. Then you can take the sum of the value column for each group to get the result set you want.
SELECT t1.name, t1.col1, MAX(t1.gather_time) AS gather_time, SUM(t1.value) AS sum
FROM testtable t1 INNER JOIN
(
SELECT name, col1, col2, MAX(gather_time) AS maxTime
FROM testtable
GROUP BY name, col1, col2
) t2
ON t1.name = t2.name AND t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND
t1.gather_time = t2.maxTime
GROUP BY t1.name, t1.col1
If you wanted to use a subquery in the WHERE clause, as you attempted in your OP, to restrict to only records with the latest gather_time then you could try the following:
SELECT name, col1, gather_time, SUM(value) AS sum
FROM testtable t1
WHERE gather_time =
(
SELECT MAX(gather_time)
FROM testtable t2
WHERE t1.name = t2.name AND t1.col1 = t2.col1
)
GROUP BY name, col1

How to get info about position element in the table?

I have query:
Select * from mytable order by 'date'
And result:
date | item_id | user_id | some_data
------------------------------------------
2015-01-01 | 1 | 1 | null
2015-01-01 | 1 | 1 | null
2015-01-02 | 1 | 1 | null
2015-01-03 | 1 | 1 | null
2015-01-03 | 1 | 2 | null
2015-01-04 | 1 | 1 | null
2015-01-05 | 1 | 2 | null
And I want to get position of first row where user_id = 2. In this example it be 5. How to do it?
select pos_overall
from (
select user_id,
row_number() over (order by "date") as pos_overall,
row_number() over (partition by user_id order by "date") as user_pos
from mytable
) t
where user_id = 2
and user_pos = 1
You can use the row_number() function to number the rows in order of date, user_id and then select the minimum value:
select min(rn)
from (
select
user_id, row_number() over (order by date, user_id) as rn
from mytable
) x
where user_id = 2;
If the item_id can change you might want to include that in the order by clause for the row_number function in the derived table.

Crosstab using PostgreSQL

I have table table1 which consist of following details.
Example:
create table table1
(
slno varchar(10),
joiningdate date,
joiningtime time
);
Inserting some rows:
insert into table1 values('a1','09-08-2011','10:00:00');
insert into table1 values('a1','09-08-2011','10:00:00');
insert into table1 values('a2','19-08-2011','11:00:00');
insert into table1 values('a2','20-08-2011','12:00:00');
Now I need to display it into following format:
slno joiningdate 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
--------------------------------------------------------------------------------------------------------------------------------------
a1 09-08-2011 2
a2 19-08-2011 1
a2 20-08-2011 1
For which I have tried the following script:
select *
from crosstab('
select slno,joiningdate , to_char(joiningtime, ''HH24'') as tc, count(tc)::int
from table1
group by 1,2,3,4
order by 1,2,3,4'
,$$VALUES('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('10'),
('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('20'),
('21'),('22'),('23')$$)
as ct(slno varchar,joiningdate date,"01" int,"02" int,"03" int,"04" int,"05" int,"06" int,"07" int,"08" int,"09" int,"10" int,
"11" int,"12" int,"13" int,"14" int,"15" int,"16" int,"17" int,"18" int,"19" int,"20" int,
"21" int,"22" int, "23" int);
But got stuck how to count tc(joiningtime hours) and add it to appropriate column.
First, produce a series of rows with the hourly counts.
select
slno, joiningdate,
hour,
sum(case when extract(hour from joiningtime) = hour then 1 end)
from table1
cross join generate_series(0,23) h(hour)
group by slno, joiningdate, hour;
then, because crosstab can't deal with multiple column row keys, consolodate the row key using a composite type:
CREATE TYPE ctrowid as ( slno text, joiningdate date );
select
ROW(slno, joiningdate) :: ctrowid,
hour,
sum(case when extract(hour from joiningtime) = hour then 1 end)
from table1
cross join generate_series(0,23) h(hour)
group by slno, joiningdate, hour
order by 1,2;
so the query produces tuples of (rowid, category, value) as required by crosstab. Then wrap it in a crosstab, e.g.
SELECT
*
FROM
crosstab('
select
ROW(slno, joiningdate)::ctrowid,
hour::text,
sum(case when extract(hour from joiningtime) = hour then 1 end)::integer
from table1
cross join generate_series(0,23) h(hour)
group by slno, joiningdate, hour
order by 1, 2
')
ct(rowid ctrowid, h0 integer, h1 integer, h2 integer, h3 integer, h4 integer, h5 integer, h6 integer, h7 integer, h8 integer, h9 integer, h10 integer, h11 integer, h12 integer, h13 integer, h14 integer, h15 integer, h16 integer, h17 integer, h18 integer, h19 integer, h20 integer, h21 integer, h22 integer, h23 integer);
producing:
rowid | h0 | h1 | h2 | h3 | h4 | h5 | h6 | h7 | h8 | h9 | h10 | h11 | h12 | h13 | h14 | h15 | h16 | h17 | h18 | h19 | h20 | h21 | h22 | h23
-----------------+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
(a1,2011-08-09) | | | | | | | | | | | 2 | | | | | | | | | | | | |
(a2,2011-08-19) | | | | | | | | | | | | 1 | | | | | | | | | | | |
(a2,2011-08-20) | | | | | | | | | | | | | 1 | | | | | | | | | | |
(3 rows)
You can then unpack the rowid into separate fields in an outer query if you want.
Yes, the need to specify all the columns is ugly, and makes crosstab much less useful than it should be.
I'm sure there is a more efficient way of doing this, but this query should give you what you are looking for. I only provided a portion of your table, but I'm sure you can write out the rest.
SELECT DISTINCT ON(slno,joiningdate) slno,joiningdate,
CASE WHEN joiningtime = '01:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "01",
CASE WHEN joiningtime = '02:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "02",
CASE WHEN joiningtime = '03:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "03",
CASE WHEN joiningtime = '10:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "10",
CASE WHEN joiningtime = '11:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "11",
CASE WHEN joiningtime = '12:00:00' THEN count(*) OVER (PARTITION BY joiningtime) ELSE NULL END AS "12"
FROM table1