Count rows within a group, but also from global result set: performance issue - postgresql

I have a table with log records. Each log record is represented by a status (open or closed) and a date:
CREATE TABLE logs (
id BIGSERIAL PRIMARY KEY,
status VARCHAR NOT NULL,
inserted_at DATE NOT NULL
);
I need to get a daily report with a following information:
how many log records with status = open were created,
how many log records with status = closed were created,
how many log records with status = open exist to this day, including this day.
Here's a sample report output:
day | created | closed | total_open
------------+---------+--------+------------
2017-01-01 | 2 | 0 | 2
2017-01-02 | 2 | 1 | 3
2017-01-03 | 1 | 1 | 3
2017-01-04 | 1 | 0 | 4
2017-01-05 | 1 | 0 | 5
2017-01-06 | 1 | 0 | 6
2017-01-07 | 1 | 0 | 7
2017-01-08 | 0 | 1 | 6
2017-01-09 | 0 | 0 | 6
2017-01-10 | 0 | 0 | 6
(10 rows)
I solved this in a very "dirty" way:
INSERT INTO logs (status, inserted_at) VALUES
('created', '2017-01-01'),
('created', '2017-01-01'),
('closed', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-03'),
('closed', '2017-01-03'),
('created', '2017-01-04'),
('created', '2017-01-05'),
('created', '2017-01-06'),
('created', '2017-01-07'),
('closed', '2017-01-08');
SELECT days.day,
count(case when logs.inserted_at = days.day AND logs.status = 'created' then 1 end) as created,
count(case when logs.inserted_at = days.day AND logs.status = 'closed' then 1 end) as closed,
count(case when logs.inserted_at <= days.day AND logs.status = 'created' then 1 end) -
count(case when logs.inserted_at <= days.day AND logs.status = 'closed' then 1 end) as total
FROM (SELECT day::date FROM generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) day) days,
logs
GROUP BY days.day
ORDER BY days.day;
Also (posted it on gist for brevity), and would like to improve the solution.
Right now explain for my query reveals some ridiculous cost numbers that that I would like to minimize (I don't have indexes yet).
How an efficient query to achieve the report above would look like?

A possible solution is to use window functions:
select s.*, sum(created - closed) over (order by inserted_at)
from (select inserted_at,
count(status) filter (where status = 'created') created,
count(status) filter (where status = 'closed') closed
from (select d::date inserted_at
from generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) d) d
left join logs using (inserted_at)
group by inserted_at) s
http://rextester.com/GFRRP71067
Also, an index on (inserted_at, status) could help you a lot with this query.
Note: count(...) filter (where ...) is really just a fancy way to write count(case when ... then ... [else null] end).

Related

how to drop rows if a variale is less than x, in sql

I have the following query code
query = """
with double_entry_book as (
SELECT to_address as address, value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE to_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
-- credits
SELECT from_address as address, -value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE from_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
)
SELECT address,
sum(value) / 1000000000000000000 as balance
from double_entry_book
group by address
order by balance desc
LIMIT 15000000
"""
In the last part, I want to drop rows where "balance" is less than, let's say, 0.02 and then group, order, etc. I imagine this should be a simple code. Any help will be appreciated!
We can delete on a CTE and use returning to get the id's of the rows being deleted, but they still exist until the transaction is comitted.
CREATE TABLE t (
id serial,
variale int);
insert into t (variale) values
(1),(2),(3),(4),(5);
✓
5 rows affected
with del as
(delete from t
where variale < 3
returning id)
select
t.id,
t.variale,
del.id ids_being_deleted
from t
left join del
on t.id = del.id;
id | variale | ids_being_deleted
-: | ------: | ----------------:
1 | 1 | 1
2 | 2 | 2
3 | 3 | null
4 | 4 | null
5 | 5 | null
select * from t;
id | variale
-: | ------:
3 | 3
4 | 4
5 | 5
db<>fiddle here

Need help to merge overlapping time intervals

I need some help with merging overlapping time intervals if the interval not more than 4 minutes (for example only where id = 1).
I have the next table:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
2 | started | 2020-08-18 13:52:14
2 | suspended | 2020-08-18 13:52:17
3 | started | 2020-08-18 13:52:21
3 | suspended | 2020-08-18 13:52:24
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:08:56
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:11
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:13
Expected result:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:09:13
How it can be done? I will be very grateful for your help!
You want to eliminate suspended/start pairs that are for the same id and within 4 minutes. Use lag() and lead():
select t.*
from (select t.*,
lag(date) over (partition by id order by date) as prev_date,
lead(date) over (partition by id order by date) as next_date
from t
) t
where (action = 'start' and
prev_date > date - interval '4 minute'
) or
(action = 'suspended' and
next_date < date + interval '4 minute'
);
Date/time functions are notoriously database dependent. This is just adding or subtracting 4 minutes, which any database can do but the syntax might vary.
You're wanting to filter out certain rows, what is common with the rows you are removing?
It seems you want the first 'started' and last 'suspended' rows. Can you just ignore 'started' rows if there is another 'started' row in the previous 4 minutes, and ignore 'suspended' rows if there is another 'suspended row in the next 4 minutes?
from my_table a
where action = 'started' and not exists (
select 1 from my_table b
where b.id = a.id and b.action = 'started'
and datediff(minute, b.date, a.date) <= 4 -- row exists in the previous 4 min
)
Ditto for 'suspended' but the other way. That doesn't work if the difference between the last 'started' and 'suspended' is > 4 minutes though, but that can be overcome with another condition to check for no start within the last 4 minutes.
If you need to get overlapping intervals duration not more than 4 minutes, can use this query:
--cte where creating groups with time intervals
with base_cte as
(
select Tab.id,Tab.NumGr,Tab.date,
Tab.action from
(
select * from
(
--selecting only values where time difference <= 4 min
select *,sum(TimeDiff)over(partition by id,NumGr order by date rows unbounded preceding)SumTimeInterval from
(
--creating a group
select sum(Num)over(partition by id order by date rows unbounded preceding )NumGr, * from
(
select date,lead(date)over(partition by id order by date)lead_date,id,action,
lead(action)over(partition by id order by date)lead_action,
--split intervals between overlaps (240seconds)
iif(TimeDiff>240,1,0)Num,TimeDiff from
(
--find time difference in seconds between current and next date (partition by id)
select datediff(second,date,LEAD(date)over(partition by id order by date))TimeDiff,* from Table
)A
)B
)C
--selecting only pairs within time intervals
where TimeDiff<=240
--checking duration interval:<=4 min
)D where SumTimeInterval<=240
)E
CROSS JOIN LATERAL
(values (id,NumGr,date,action),
(id,NumGr,lead_date,lead_action)
)Tab(id,NumGr,date,action)
)
--selectig data with start/end overlapping time interval
select id,date,action from base_cte base
where date
in (select max(date) from base_cte where NumGr=base.NumGr)
or date in
(select min(date) from base_cte where NumGr=base.NumGr)

Should I use GROUPING SETS, CUBE, or ROLLUP in Postgres

We just upgraded last month to Postgres 10, so I'm new to a few of its feautures.
So this query requests that I display the days each student is taken care of and require a sum of how many students are taken care of for each weekday
select distinct s.studentnr,(CASE When lower(cd.weekday) like lower('MONDAY')
then 1 else 0 end) as MONDAY,
(CASE When lower(cd.weekday) like lower('TUESDAY')
then 1 else 0 end) as TUESDAY,
(CASE When lower(cd.weekday) like lower('WEDNESDAY')
then 1 else 0 end) as WEDNESDAY,
(CASE When lower(cd.weekday) like lower('THURSDAY')
then 1 else 0 end) as THURSDAY,
(CASE When lower(cd.weekday) like lower('FRIDAY')
then 1 else 0 end) as FRIDAY,
scp.durationid
from student s
full join studentcarepreference scp on s.id = scp.studentid
full join careday cd on cd.studentcarepreferenceid = scp.id
join pupil per on per.id = s.personid
join studentschool ss ON ss.studentid = s.id
join duration d on d.id = sdc.durationid
AND d.id BETWEEN ss.validfrom AND ss.validuntil
where sdc.durationid = 1507
and cd.weekday is not null
order by s.studentnr
where s.studentnr and cd.weekday are both varchar type
resulting in
However I need the following data as follows.
Required result
Which approach is best to use in this kind of query?
new results after change to code
select case grouping(studentnr)
when 0 then studentnr
else count(distinct studentnr)|| ' students'
end studentnr
, count(case lower(cd.weekday) when 'monday' then 1 end) monday
, count(case lower(cd.weekday) when 'tuesday' then 1 end) teusday
, count(case lower(cd.weekday) when 'wednesday' then 1 end) wednesday
, count(case lower(cd.weekday) when 'thursday' then 1 end) thursday
, count(case lower(cd.weekday) when 'friday' then 1 end) friday
from mydata
group by rollup ((studentnr))
order by studentnr
Nearly there I guess, just the results or values are wrong. what would you suggest I look into to correcgt the results?
It looks like you want to ROLLUP yourdata using a GROUPING SET:
select case grouping(studentnr)
when 0 then studentnr
else count(distinct studentnr)|| ' students'
end studentnr
, count(distinct case careday when 'monday' then studentnr end) monday
, count(distinct case careday when 'tuesday' then studentnr end) teusday
, count(distinct case careday when 'wednesday' then studentnr end) wednesday
, count(distinct case careday when 'thursday' then studentnr end) thursday
, count(distinct case careday when 'friday' then studentnr end) friday
, durationid
from yourdata
group by rollup ((studentnr, durationid))
Which yields the desired results:
| studentnr | monday | teusday | wednesday | thursday | friday | durationid |
|------------|--------|---------|-----------|----------|--------|------------|
| 10177 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 717208 | 1 | 1 | 1 | 1 | 1 | 1507 |
| 722301 | 1 | 1 | 1 | 1 | 0 | 1507 |
| 3 students | 3 | 3 | 3 | 3 | 2 | (null) |
The second set of parenthesis in the ROLLUP indicates that studentnr and durationid should be summarized at the same level when doing the roll up.
With just one level of summarization, there's not much difference between ROLLUP and CUBE, however to use GROUPING SETS would require a slight change to the GROUP BY clause in order to get the lowest desired level of detail. All three of the following GROUP BY statements produce equivalent results:
group by rollup ((studentnr, durationid))
group by cube ((studentnr, durationid))
group by grouping sets ((),(studentnr, durationid))

I am computing a percentage in postgresql and I get the following unexpected behavior when dividing a number by the same number

I am new at postgresql and am having trouble wrapping my mind around why I am getting the results that I see.
I perform the following query
SELECT
name AS region_name,
COUNT(tripsq1.id) AS trips,
COUNT(DISTINCT user_id) AS unique_users,
COUNT(case when consumed_at = start_at then tripsq1.id end) AS first_day,
(SUM(case when consumed_at = start_at then tripsq1.id end)::NUMERIC(6,4))/COUNT(tripsq1.id)::NUMERIC(6,4) AS percent_on_first_day
FROM promotionsq1
INNER JOIN couponsq1
ON promotion_id = promotionsq1.id
INNER JOIN tripsq1
ON couponsq1.id = coupon_id
INNER JOIN regionsq1
ON regionsq1.id = region_id
WHERE promotion_name = 'TestPromo'
GROUP BY region_name;
and get the following result
region_name | trips | unique_users | first_day | percent_on_first_day
-------------------+-------+--------------+-----------+-----------------------
A | 3 | 2 | 1 | 33.3333333333333333
B | 1 | 1 | 0 |
C | 1 | 1 | 1 | 2000.0000000000000000
The first rows percentage gets calculated correctly while the third rows percentage is 20 times what it should be. The percent_on_first_day should be 100.00 since it is 100.0 * 1/1.
Any help would be greatly appreciated
I'm suspecting that the issue is because of this code:
SUM(case when consumed_at = start_at then tripsq1.id end)
This tells me you are summing the ids, which is meaningless. You probably want:
SUM(case when consumed_at = start_at then 1 end)

Iterate through rows, compare them against each other and store results in another table

I have a table that contains the following rows:
product_id | order_date
A | 12/04/12
A | 01/11/13
A | 01/21/13
A | 03/05/13
B | 02/14/13
B | 03/09/13
What I now need is an overview for each month, how many products have been bought for the first time (=have not been bought the month before), how many are existing products (=have been bought the month before) and how many have not been purchased within a given month. Taken the sample above as an input, the script should deliver the following result, regardless of what period of time is in the data:
month | new | existing | nopurchase
12/2012 | 1 | 0 | 0
01/2013 | 0 | 1 | 0
02/2013 | 1 | 0 | 1
03/2013 | 1 | 1 | 0
Would be great to get a first hint how this could be solved so I'm able to continue.
Thanks!
SQL Fiddle
with t as (
select product_id pid, date_trunc('month', order_date)::date od
from t
group by 1, 2
)
select od,
sum(is_new::integer) "new",
sum(is_existing::integer) existing,
sum(not_purchased::integer) nopurchase
from (
select od,
lag(t_pid) over(partition by s_pid order by od) is null and t_pid is not null is_new,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is not null is_existing,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is null not_purchased
from (
select t.pid t_pid, s.pid s_pid, s.od
from
t
right join
(
select pid, s.od
from
t
cross join
(
select date_trunc('month', d)::date od
from
generate_series(
(select min(od) from t),
(select max(od) from t),
'1 month'
) s(d)
) s
group by pid, s.od
) s on t.od = s.od and t.pid = s.pid
) s
) s
group by 1
order by 1