Optimise query: count latest win/lose streak for all teams - postgresql

I'm not an expert in data-warehousing nor analytics, so I give birth to a monster-query that I'd like to optimise (if possible).
The problem is: I need to display the stagings table for a given tournament. This table should display team id, total score, team position (i.e. bucket for same-score teams), order and latest (!) win-lose strike.
I.e. for the following sequence (most recent first) WWLLLWLD I should get 2W
W = win, L = lose, D = draft
Schema
create table matches (
id integer primary key,
stage_id integer not null,
scheduled_at timestamp not null,
winner_id integer null,
status text not null default 'finished' -- just to give extra context
);
create table teams (
id integer primary key
);
create table match_teams (
match_id integer,
team_id integer,
constraint fk_mt_m foreign key (match_id) references matches(id),
constraint fk_mt_t foreign key (team_id) references teams(id)
);
insert into teams(id) values(1),(2);
insert into matches(id, stage_id, scheduled_at, winner_id) values
(1, 1, now() - interval '1 day', 1),
(2, 1, now() - interval '2 days', 1),
(3, 1, now() - interval '3 days', 2),
(4, 1, now() - interval '4 days', 1),
(5, 1, now() - interval '5 days', null);
insert into match_teams(match_id, team_id) values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2),
(4, 1),
(4, 2),
(5, 1),
(5, 2);
Query itself:
with v_mto as (
SELECT
m.id,
m."stage_id",
mt."team_id",
m."scheduled_at",
(
case
when m."winner_id" IS NULL then 0
when (m."winner_id" = mt."team_id") then 1
else -1
end
) win
FROM matches m
INNER JOIN match_teams mt ON m.id = mt."match_id"
WHERE m.status = 'finished'
ORDER BY "stage_id", "team_id", "scheduled_at" desc
),
v_lag as (
select
"stage_id",
"team_id",
win,
lag(win, 1, win) over (partition by "stage_id", "team_id" order by "scheduled_at" desc ) lag_win,
first_value(win) over (partition by "stage_id", "team_id" order by "scheduled_at" desc ) first_win
from v_mto
)
select
"stage_id",
"team_id",
v_lag.win,
count(1)
from v_lag
where v_lag.win = v_lag.lag_win and v_lag.win = v_lag.first_win
group by 1, 2, 3
-- This is the query for the final table (on a screenshot)
-- with team_scores as (
-- select
-- m."tournamentStageId",
-- "teamId",
-- sum(
-- -- each win gives 3 score, each draft gives 1 score
-- coalesce((m."winner_id" = mt."team_id")::integer, 0) * 3
-- +
-- (m."winner_id" IS NULL)::int
-- ) as score
-- from matches m
-- inner join match_teams mt on m.id = mt."matchId"
-- where m.status = 1
-- group by m."tournamentStageId", "teamId")
-- select
-- "tournamentStageId",
-- "teamId",
-- t.name,
-- score,
-- dense_rank() over (partition by "tournamentStageId" order by score desc) rank,
-- row_number() over (partition by "tournamentStageId" order by t.name) position
-- -- total number of wins/losses/drafts to be added (the "score" column from the screenshot)
-- from team_scores ts
-- inner join teams t on t.id = ts."teamId"
-- order by "tournamentStageId", rank, position
I've created a sandbox for those who is brave enough to get a deep dive into the task: https://www.db-fiddle.com/f/6jsFFnxQMKwNQWznR3VXHC/2
Also, I've already crafted the part that creates a list of teams together with scores and points, so the attached query will be used as a joined one or sub-select.
Query plan on the real database and query (some indexes, probably, are missing, but that's ok for this moment):
GroupAggregate (cost=24862.28..29423.68 rows=3 width=24)
" Group Key: v_lag.""computerGameId"", v_lag.""tournamentStageId"", v_lag.""teamId"", v_lag.win"
-> Incremental Sort (cost=24862.28..29423.61 rows=3 width=16)
" Sort Key: v_lag.""computerGameId"", v_lag.""tournamentStageId"", v_lag.""teamId"", v_lag.win"
" Presorted Key: v_lag.""computerGameId"", v_lag.""tournamentStageId"", v_lag.""teamId"""
-> Subquery Scan on v_lag (cost=22581.67..29423.47 rows=3 width=16)
Filter: ((v_lag.win = v_lag.lag_win) AND (v_lag.lag_win = v_lag.first_win))
-> WindowAgg (cost=22581.67..27468.67 rows=130320 width=32)
-> Subquery Scan on v_mto (cost=22581.67..24210.67 rows=130320 width=24)
-> Sort (cost=22581.67..22907.47 rows=130320 width=28)
" Sort Key: m.""computerGameId"", m.""tournamentStageId"", mt.""teamId"", m.""scheduledAt"" DESC"
-> Hash Join (cost=3863.39..8391.38 rows=130320 width=28)
" Hash Cond: (mt.""matchId"" = m.id)"
-> Seq Scan on match_teams mt (cost=0.00..2382.81 rows=137281 width=8)
-> Hash (cost=2658.10..2658.10 rows=65623 width=24)
-> Seq Scan on matches m (cost=0.00..2658.10 rows=65623 width=24)
Filter: (status = 1)
Thanks everyone for help and suggestions!
The final result:
P.S. it is possible to convert the first query (v_mto) as materialised view or de-normalise win into the match_teams table, as this piece will be used in different queries to build match/game stats.

So, the original query is wrong - gives incorrect result for the standings results.
I've moved to row_number math to solve this task.
The final query (with scores) looks like this:
create materialized view vm_tournament_stage_standings as
with v_mto as (SELECT m.id,
m."computerGameId",
m."tournamentStageId",
mt."teamId",
m."scheduledAt",
(
case
when m."winnerId" IS NULL then 'D'
when m."winnerId" = mt."teamId" then 'W'
else 'L'
end
) win
FROM matches m
INNER JOIN match_teams mt ON
m.id = mt."matchId"
WHERE m.status = 1),
v_streaks as (select "computerGameId",
"tournamentStageId",
"teamId",
row_number()
over grp_ord_matches
- row_number()
over (partition by "computerGameId", "tournamentStageId", "teamId", win order by "scheduledAt" desc ) streak_index,
win
from v_mto
window grp_ord_matches as (partition by "computerGameId", "tournamentStageId", "teamId" order by "scheduledAt" desc)),
v_streak as (select "computerGameId",
"tournamentStageId",
"teamId",
count(1) || win as streak
from v_streaks
where streak_index = 0
group by "computerGameId", "tournamentStageId", "teamId", "win"),
team_scores as (select m."tournamentStageId",
"teamId",
sum((m."winnerId" = mt."teamId")::int) as wins,
sum((m."winnerId" is null)::int) draws,
sum((m."winnerId" <> mt."teamId")::int) loses,
sum(
coalesce((m."winnerId" = mt."teamId")::integer, 0) * 3
+
(m."winnerId" IS NULL)::int
) as score
from matches m
inner join match_teams mt on m.id = mt."matchId"
where m.status = 1
group by m."tournamentStageId", "teamId")
select ts."tournamentStageId",
ts."teamId",
score,
wins,
draws,
loses,
vs.streak as streak,
dense_rank() over (partition by ts."tournamentStageId" order by score desc) rank,
row_number() over (partition by ts."tournamentStageId" order by t.name) position
from team_scores ts
inner join teams t on t.id = ts."teamId"
inner join v_streak vs on vs."teamId" = t.id and vs."tournamentStageId" = ts."tournamentStageId"
order by "tournamentStageId", rank, position

Related

Postgres Query Optimization without adding an extra index

I was trying to optimize this query differently, but before that, Can we make any slight change in this query to reduce the time without adding any index?
Postgres version: 13.5
Query:
SELECT
orders.id as order_id,
orders.*, u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(select
SUM(order_items.tax) as tax from order_items
where order_items.order_id = orders.id and order_items.deleted = 'f'
) as tax,
(select
SUM(orders_surcharges.surcharge_tax) as surcharge_tax from orders_surcharges
where orders_surcharges.order_id = orders.id
)
FROM "orders"
LEFT JOIN
users as u1 ON u1.id = orders.user_id
LEFT JOIN
users as u2 ON u2.id = orders.driver_id
LEFT JOIN
users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN
referrals ON referrals.id = orders.referral_id
INNER JOIN
locations ON locations.id = orders.location_id
LEFT JOIN
orders_payments ON orders_payments.order_id = orders.id
WHERE
(orders.company_id = '626')
AND
(orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00')
AND
orders.order_status_id NOT IN (10, 5, 50)
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at ASC LIMIT 300 OFFSET 0
Current Index:
"orders_pkey" PRIMARY KEY, btree (id)
"idx_orders_company_and_location" btree (company_id, location_id)
"idx_orders_created_at" btree (created_at)
"idx_orders_customer_id" btree (customer_id)
"idx_orders_location_id" btree (location_id)
"idx_orders_order_status_id" btree (order_status_id)
Execution Plan
Seems this takes more time on the parallel heap scan.
You're looking for 300 orders and try to get some additional information about these records. I would see if I could first get these 300 records, instead of getting all the data and then limit it to 300. Something like this:
WITH orders_300 AS (
SELECT * -- just get the columns that you really need, never use * in production
FROM orders
INNER JOIN locations ON locations.id = orders.location_id
WHERE orders.company_id = '626'
AND orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00'
AND orders.order_status_id NOT IN (10, 5, 50)
ORDER BY
created_at ASC LIMIT 300 -- LIMIT
OFFSET 0
)
SELECT
orders.id as order_id,
orders.*, -- just get the columns that you really need, never use * in production
u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(SELECT SUM(order_items.tax) as tax
FROM order_items
WHERE order_items.order_id = orders.id
AND order_items.deleted = 'f'
) as tax,
( SELECT SUM(orders_surcharges.surcharge_tax) as surcharge_tax
FROM orders_surcharges
WHERE orders_surcharges.order_id = orders.id
)
FROM "orders_300" AS orders
LEFT JOIN users as u1 ON u1.id = orders.user_id
LEFT JOIN users as u2 ON u2.id = orders.driver_id
LEFT JOIN users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN referrals ON referrals.id = orders.referral_id
LEFT JOIN orders_payments ON orders_payments.order_id = orders.id
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at;
This will at least have a huge impact on the slowest part of your query, all these index scans on orders_payments. Every single scan is fast, but the query is doing 165000 of them... Limit this to just 300 and will be much faster.
Another issue is that none of your indexes covers the entire WHERE condition on the table "orders". But if you can't create a new index, you're out of luck.

Postgres: Optimisation for query "WHERE id IN (...)"

I have a table (2M+ records) which keeps track of a ledger.
Some entries add points, while others subtract points (there are only two kinds of entries). The entries which subtract points, always reference the (adding) entries they were subtracted from with referenceentryid. The adding entries would always have NULL in referenceentryid.
This table has a dead column which would be set to true by a worker when some additions was depleted or expired, or when the subtractions is pointing at a "dead" additions. Since the table has a partial index on dead=false, SELECT on live rows works pretty fast.
My problem is with the performance of the worker that sets dead to NULL.
The flow would be:
1. Get an entry for each addition which indicates the amount added, subtracted and whether or not it's expired.
2. Filter away entries which are both not expired and have more addition than subtraction.
3. Update dead=true on every row where either the id or the referenceentryid is in the filtered set of entries.
WITH entries AS
(
SELECT
additions.id AS id,
SUM(subtractions.amount) AS subtraction,
additions.amount AS addition,
additions.expirydate <= now() AS expired
FROM
loyalty_ledger AS subtractions
INNER JOIN
loyalty_ledger AS additions
ON
additions.id = subtractions.referenceentryid
WHERE
subtractions.dead = FALSE
AND subtractions.referenceentryid IS NOT NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_entries AS (
SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE
)
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
In the query above the inner part runs pretty fast (a few seconds) while the last part would run for ever.
I have the following indexes on the table:
CREATE TABLE IF NOT EXISTS loyalty_ledger (
id SERIAL PRIMARY KEY,
programid bigint NOT NULL,
FOREIGN KEY (programid) REFERENCES loyalty_programs(id) ON DELETE CASCADE,
referenceentryid bigint,
FOREIGN KEY (referenceentryid) REFERENCES loyalty_ledger(id) ON DELETE CASCADE,
customerprofileid bigint NOT NULL,
FOREIGN KEY (customerprofileid) REFERENCES customer_profiles(id) ON DELETE CASCADE,
amount int NOT NULL,
expirydate TIMESTAMPTZ,
dead boolean DEFAULT false,
expired boolean DEFAULT false
);
CREATE index loyalty_ledger_referenceentryid_idx ON loyalty_ledger (referenceprofileid) WHERE dead = false;
CREATE index loyalty_ledger_customer_program_idx ON loyalty_ledger (customerprofileid, programid) WHERE dead = false;
I'm trying to optimise the last part of the query.
EXPLAIN gives me the following:
"Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger ledger (cost=103412.24..4976040812.22 rows=986583 width=67)"
" Filter: ((SubPlan 3) OR (SubPlan 4))"
" CTE entries"
" -> GroupAggregate (cost=1.47..97737.83 rows=252177 width=25)"
" Group Key: subtractions.referenceentryid, additions.id"
" -> Merge Join (cost=1.47..91390.72 rows=341928 width=28)"
" Merge Cond: (subtractions.referenceentryid = additions.id)"
" -> Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger subtractions (cost=0.43..22392.56 rows=341928 width=12)"
" Index Cond: (referenceentryid IS NOT NULL)"
" -> Index Scan using loyalty_ledger_pkey on loyalty_ledger additions (cost=0.43..80251.72 rows=1683086 width=16)"
" CTE dead_entries"
" -> CTE Scan on entries (cost=0.00..5673.98 rows=168118 width=4)"
" Filter: ((subtraction >= addition) OR expired)"
" SubPlan 3"
" -> CTE Scan on dead_entries (cost=0.00..3362.36 rows=168118 width=4)"
" SubPlan 4"
" -> CTE Scan on dead_entries dead_entries_1 (cost=0.00..3362.36 rows=168118 width=4)"
Seems like the last part of my query is very inefficient. Any ideas on how to speed it up?
For large datasets, I have found semi-joins to have much better performance than query in-lists:
from
loyalty_ledger as ledger
WHERE
ledger.dead = FALSE AND (
exists (
select null
from dead_entries d
where d.id = ledger.id
) or
exists (
select null
from dead_entries d
where d.id = ledger.referenceentryid
)
)
I honestly don't know, but I think each of these would also be worth a try. It's less code and more intuitive, but there is no guarantee they will work better:
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id = ledger.id or d.id = ledger.referenceentryid
)
or
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id in (ledger.id, ledger.referenceentryid)
)
What helped me in the end was to do the id IN filtering part in the second WITH step, replacing IN with ANY syntax:
WITH entries AS
(
SELECT
additions.id AS id,
additions.amount - coalesce(SUM(subtractions.amount),0) AS balance,
additions.expirydate <= now() AS passed_expiration
FROM
loyalty_ledger AS additions
LEFT JOIN
loyalty_ledger AS subtractions
ON
subtractions.dead = FALSE AND
additions.id = subtractions.referenceentryid
WHERE
additions.dead = FALSE AND additions.referenceentryid IS NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_rows AS (
SELECT
l.id AS id,
-- only additions that still have usable points can expire
l.referenceentryid IS NULL AND e.balance > 0 AND e.passed_expiration AS expired
FROM
loyalty_ledger AS l
INNER JOIN
entries AS e
ON
(l.id = e.id OR l.referenceentryid = e.id)
WHERE
l.dead = FALSE AND
(e.balance <= 0 OR e.passed_expiration)
ORDER BY e.balance DESC
)
UPDATE
loyalty_ledger AS l
SET
(dead, expired) = (TRUE, d.expired)
FROM
dead_rows AS d
WHERE
l.id = d.id AND
l.dead = FALSE;
I also believe
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
Can be rewritten into a JOIN and UNION ALL which most likely also will generate a other execution plan and might be faster.
But hard to verify for sure without the other table structures.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE
And because CTE's in PostgreSQL are materialized and not indexed. Your are most likely better off removing the dead_entries alias from the CTE and repeat outside the CTE.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE

CTE query performance improvement (postgres 9.6)

I have got a query below that does what needed, but super slow. It returns in a given period of time (for example below between 2017-05-01 00:00:00 and 2017-05-01 01:00:00, but could be anything between 1 second and several days) first and last records of a given intervals (15 sec as example, but could be anything from 1 second to several days).
It is unbearably slow. For example for interval on 1 second for a period 2017-05-01 00:00:00 to 2017-05-01 00:00:01 it runs 6 sec on my i7 7700HQ. for interval for 1 for period 2017-05-01 00:00:00 to 2017-05-01 00:00:05 I never saw the result! The database has about 60 millions row now. in production it will be 1 billion with approx 50 millions added every months.
QUERY:
WITH ranges as (
SELECT dd as start_range,
dd + '15 seconds'::interval as end_range,
ROW_NUMBER() over () as grp
FROM generate_series
( '2017-05-01 00:00:00'::timestamp
, '2017-05-01 01:00:00'::timestamp
, '15 seconds'::interval) dd
), create_grp as (
SELECT r.grp, r.start_range, r.end_range, p.*
FROM prices p
JOIN ranges r
ON p.dt >= r.start_range
AND p.dt < r.end_range
WHERE instrument='EURGBP'
), minmax as (
SELECT row_number() over (partition by grp
order by dt asc) as rn1,
row_number() over (partition by grp
order by dt desc) as rn2,
create_grp.*
FROM create_grp
)
SELECT *,
CASE WHEN rn1 = 1 and rn2 = 1 THEN 'first and last'
WHEN rn1 = 1 THEN 'first'
WHEN rn2 = 1 THEN 'last'
END as row_position
FROM minmax
WHERE
1 IN (rn1, rn2)
ORDER BY dt
;
SCHEMA:
CREATE TABLE public.prices
(
uid uuid NOT NULL DEFAULT uuid_generate_v4(),
instrument character varying COLLATE pg_catalog."default" NOT NULL,
bid double precision NOT NULL,
ask double precision NOT NULL,
dt timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT prices_pkey PRIMARY KEY (uid)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
INDEXES:
CREATE INDEX idx_dt_instrument
ON public.prices USING btree
(dt, instrument COLLATE pg_catalog."default")
TABLESPACE pg_default;
CREATE INDEX idx_dt_instrument_bid_ask
ON public.prices USING btree
(dt, instrument COLLATE pg_catalog."default", bid, ask)
TABLESPACE pg_default;
CREATE INDEX idx_instrument
ON public.prices USING btree
(instrument COLLATE pg_catalog."default")
TABLESPACE pg_default;
EXAMPLE dataset for 5 seconds staring from 2017-05-01 00:00:00:
"uid","instrument","bid","ask","dt"
"4ecaa607-3733-4aba-9093-abc8f59e1638","EURGBP","0.84331","0.8434","2017-05-01 00:00:00.031"
"d1a41847-4945-4cf4-a45f-781db977ae07","GBPJPY","143.949005","143.970993","2017-05-01 00:00:00.031"
"34972c12-899b-404a-b0de-bae0fd3f6733","GBPJPY","143.947998","143.970993","2017-05-01 00:00:00.056"
"384b8246-3eac-4826-b6d2-d6caaa364f81","GBPUSD","1.29311","1.29323","2017-05-01 00:00:00.066"
"d879b04d-a4ed-452e-9208-7dfff672e860","GBPJPY","143.947006","143.970993","2017-05-01 00:00:00.067"
"a9e735ec-30d9-4c9a-a28e-5e5553273372","GBPJPY","143.945999","143.970993","2017-05-01 00:00:00.079"
"ee40ee5f-d8ac-41ce-9f39-ae50ef15d02d","GBPJPY","143.947006","143.964005","2017-05-01 00:00:00.091"
"7b605c3b-121f-46c2-a3e4-297d187f0a28","GBPJPY","143.947006","143.968994","2017-05-01 00:00:00.115"
"ccb307b0-7fa3-4354-8707-1426eded49e8","GBPJPY","143.942001","143.968994","2017-05-01 00:00:00.205"
"206c339d-bc36-469d-82d1-c7ae74002f44","EURGBP","0.84332","0.8434","2017-05-01 00:00:00.206"
"bc581318-91c7-4c80-85e0-e06f7236b277","GBPJPY","143.944","143.968994","2017-05-01 00:00:00.206"
"1fabf850-9045-4beb-81ae-bfa3adada62e","GBPUSD","1.29311","1.29324","2017-05-01 00:00:00.208"
"06d40f9a-a47e-466d-aebf-97a0154bdc74","GBPJPY","143.942001","143.968994","2017-05-01 00:00:00.209"
"b3b7fac9-340f-4e3b-8946-7bdecc383191","GBPUSD","1.29311","1.29327","2017-05-01 00:00:00.211"
"b5c28955-b40f-446f-9f43-9d5e9f145c1b","EURGBP","0.84331","0.8434","2017-05-01 00:00:00.212"
"192a40d6-8001-42ea-9430-96e800f2d4e8","GBPJPY","143.942993","143.968994","2017-05-01 00:00:00.212"
"e98dbba7-8231-4fa3-926b-291eb22b0f87","GBPJPY","143.944","143.968994","2017-05-01 00:00:00.215"
"7c6e952d-d01f-4dac-a1df-6b32168246fd","GBPUSD","1.29311","1.29326","2017-05-01 00:00:00.216"
"c86ba29f-3edb-4147-ba99-d9dfe594b0ff","GBPUSD","1.29312","1.29327","2017-05-01 00:00:00.243"
"35ca131e-b714-462d-827f-b5bc593aa3e6","GBPJPY","143.942993","143.968994","2017-05-01 00:00:00.262"
"91fc6fc0-7af9-4036-8e0e-29d4a3bb3e43","EURGBP","0.8433","0.8434","2017-05-01 00:00:00.283"
"e71946e0-1859-461a-b3eb-0539584ac4dc","GBPJPY","143.944","143.968994","2017-05-01 00:00:00.296"
"7321eea8-2610-408b-8dbf-4087f01e8c6e","GBPJPY","143.947998","143.968994","2017-05-01 00:00:00.377"
"f146716d-cadf-4e2f-9f17-6e7e8c5f2175","GBPUSD","1.29312","1.29327","2017-05-01 00:00:00.38"
"b3d81295-8cd5-44e7-879c-f15476ffac21","GBPUSD","1.29312","1.29326","2017-05-01 00:00:00.391"
"c037e27b-a8f4-4ec3-8472-b0f72a58fd33","EURGBP","0.8433","0.8434","2017-05-01 00:00:00.413"
"dba0f8f5-f218-49ea-8ebf-132f3ecf8910","GBPJPY","143.947998","143.968994","2017-05-01 00:00:00.443"
"a08449e3-44aa-4fed-b8e9-bf1a6bfc35c5","EURGBP","0.8433","0.84339","2017-05-01 00:00:00.585"
"a0b7ba20-653f-46db-93e9-d1edd8972dba","GBPUSD","1.29312","1.29326","2017-05-01 00:00:00.588"
"c2855ce8-8c5b-4de7-a92b-186d928e8f31","GBPJPY","143.947998","143.968994","2017-05-01 00:00:00.591"
"01b19a70-3ce7-44c7-9abd-9945321fdfd0","EURGBP","0.8433","0.84339","2017-05-01 00:00:00.621"
"4518aa9d-1f76-428e-ace4-7dcffeb6aa22","GBPJPY","143.947998","143.968994","2017-05-01 00:00:00.796"
"e4d4bac4-dd02-4da3-b231-20bfa6424412","EURGBP","0.8433","0.84339","2017-05-01 00:00:00.907"
"e48d3721-3157-4033-bd4f-09baae0f989c","GBPUSD","1.29312","1.29326","2017-05-01 00:00:00.909"
"64a33b2d-c756-4a0a-823e-075143ae7263","GBPJPY","143.947998","143.968994","2017-05-01 00:00:00.913"
"e477c47d-efd1-44dd-8058-cac17e08bc5d","GBPUSD","1.29314","1.29327","2017-05-01 00:00:00.914"
"cf6d5341-f7fd-47bc-89f6-a5448f78fb99","EURGBP","0.84329","0.84339","2017-05-01 00:00:00.943"
"4caa8bb9-094e-48ca-8a9a-7b2dd23a4fb4","GBPJPY","143.947006","143.968994","2017-05-01 00:00:00.967"
"274b7f51-3b07-430d-bfc5-a0b1ce62a750","GBPUSD","1.29312","1.29327","2017-05-01 00:00:00.975"
"2e3c2cd6-2525-46b3-86c5-b88e48bb0138","GBPJPY","143.947998","143.968994","2017-05-01 00:00:01.076"
"bcb12b90-2795-4789-bfa2-8e9a494eaeb1","GBPUSD","1.29312","1.29326","2017-05-01 00:00:01.076"
"d63d6037-fa81-47cc-bd62-4655850c0f80","GBPUSD","1.29312","1.29327","2017-05-01 00:00:01.077"
"6dbf8d8e-37c8-4537-80b5-c9219f4356b1","EURGBP","0.8433","0.84339","2017-05-01 00:00:01.079"
"7e3b6eaf-22e1-4f87-a7c1-226d3ee76146","EURGBP","0.84329","0.84339","2017-05-01 00:00:01.08"
"63e451c5-b8c6-4b57-ac9e-2171bc1dfbfa","GBPJPY","143.947998","143.968994","2017-05-01 00:00:01.121"
"866316e7-90a2-4c80-9a38-b3a062837415","GBPJPY","143.949005","143.968994","2017-05-01 00:00:01.143"
"fb11e963-cd36-4cfc-89b3-1bba3b8595b5","GBPUSD","1.29312","1.29327","2017-05-01 00:00:01.156"
"ad57e34f-5cbe-4b79-8579-b2c77c83b50b","EURGBP","0.84329","0.84339","2017-05-01 00:00:01.249"
"46b1840e-e424-41c2-8c71-691b201183ab","GBPJPY","143.949005","143.968994","2017-05-01 00:00:01.259"
"c5fa5e09-46df-4ea4-8640-9cf18e1f8fcc","GBPUSD","1.29312","1.29327","2017-05-01 00:00:01.265"
"926d092d-601e-43ec-b398-3300b0345cfd","GBPJPY","143.949997","143.968994","2017-05-01 00:00:01.267"
"4ddfbc84-20a1-4281-86c2-f0a4e77d0152","EURGBP","0.84329","0.84339","2017-05-01 00:00:01.305"
"e75af139-51a2-4a0a-acc8-e0adfbe5472a","GBPUSD","1.29313","1.29327","2017-05-01 00:00:01.346"
"408bca2d-2d57-471b-b6a7-819a3257741f","GBPJPY","143.951004","143.968994","2017-05-01 00:00:01.348"
"53c6f444-bd76-4a28-a370-7af98d5fa9ec","GBPUSD","1.29313","1.29326","2017-05-01 00:00:01.359"
"cd002ef4-925e-469b-8f2f-848cadd5943f","EURGBP","0.84329","0.84339","2017-05-01 00:00:01.443"
"13587e44-b694-4ce9-a626-e592d28c507d","EURGBP","0.8433","0.84339","2017-05-01 00:00:01.45"
"aabbacaf-3f09-4313-b992-e8b8d91df7ad","GBPJPY","143.951004","143.968994","2017-05-01 00:00:01.461"
"6b2111ef-c285-4482-b93c-238f27522ca3","GBPUSD","1.29313","1.29326","2017-05-01 00:00:01.477"
"19507f29-149c-4fdf-a1fa-aac312ab8479","EURGBP","0.8433","0.84338","2017-05-01 00:00:01.506"
"916cc759-a536-449d-b825-8203ebea9bf8","GBPJPY","143.949997","143.968994","2017-05-01 00:00:01.649"
"1fdd2b35-fd44-4dbb-81df-b98514af7004","GBPUSD","1.29313","1.29326","2017-05-01 00:00:01.649"
"3e9fc214-3cc6-4991-9eb3-2f5a5e557e96","GBPUSD","1.29312","1.29326","2017-05-01 00:00:01.65"
"5da9a29b-0d8a-42b1-98b6-f89dd2893c77","EURGBP","0.8433","0.84338","2017-05-01 00:00:01.651"
"b87f85e3-fba5-4556-8ca5-52625d978d53","EURGBP","0.84329","0.84338","2017-05-01 00:00:01.652"
"75b7624c-b90c-40d4-a7fc-09df7da9f659","GBPJPY","143.949997","143.968994","2017-05-01 00:00:01.715"
"b076534f-b893-4b55-9d9a-e3bac6e77ab2","GBPUSD","1.29312","1.29325","2017-05-01 00:00:01.732"
"6464da85-9eb5-4548-bd5e-3331a69f2121","EURGBP","0.84329","0.84338","2017-05-01 00:00:01.817"
"f9937464-e36a-4c57-a212-2f32943307d3","EURGBP","0.8433","0.84338","2017-05-01 00:00:01.83"
"1c7848e5-c101-4a50-87fe-1f5e980dbb95","GBPJPY","143.949005","143.968994","2017-05-01 00:00:01.83"
"30ad5bfc-f968-4b49-9b22-80c12e801f37","GBPUSD","1.29312","1.29325","2017-05-01 00:00:01.847"
"94eb4b51-4901-4e56-873a-05f93ef35e64","EURGBP","0.8433","0.84338","2017-05-01 00:00:02.007"
"97ef7e05-0a2f-4d70-80d1-1d3c748ea70f","GBPJPY","143.949005","143.968994","2017-05-01 00:00:02.007"
"af9274d9-02d6-43fc-9336-c3f512e4bd78","GBPUSD","1.29312","1.29325","2017-05-01 00:00:02.008"
"78c12ba8-992a-469e-bdda-bef004596aa1","GBPUSD","1.29311","1.29325","2017-05-01 00:00:02.062"
"4e1e31a9-bfdc-4f56-b8e1-0c7e83852fe0","GBPUSD","1.29311","1.29324","2017-05-01 00:00:02.071"
"e4f1c459-f75b-4005-8dd1-00c572ded4f9","GBPUSD","1.29311","1.29324","2017-05-01 00:00:02.203"
"2c8b38ce-50de-43e7-ba63-b6d1d7a7f76c","GBPJPY","143.947998","143.968994","2017-05-01 00:00:02.244"
"438f4439-a259-421f-8f45-a0dc88d80e1d","GBPJPY","143.949997","143.968994","2017-05-01 00:00:02.257"
"f12b7e18-84f0-4686-9588-0ecfab8f8bf8","GBPUSD","1.29311","1.29324","2017-05-01 00:00:02.301"
"f335ee32-cd5e-4cd2-aabe-06c14985778f","GBPJPY","143.949997","143.968994","2017-05-01 00:00:02.394"
"3c55786e-f9b8-424b-9c32-0b4b432a553f","EURGBP","0.8433","0.84338","2017-05-01 00:00:02.404"
"53c3e013-b016-41bb-a3ed-48158640831a","GBPUSD","1.29311","1.29324","2017-05-01 00:00:02.417"
"a944e2b6-3700-4ecd-994d-cb5ab97cf5ad","EURGBP","0.8433","0.84339","2017-05-01 00:00:02.508"
"ab32781d-1502-4047-b320-c3f4cda389fd","GBPJPY","143.949997","143.968994","2017-05-01 00:00:02.605"
"24759950-e095-4bb7-be3d-fda725423589","EURGBP","0.84329","0.84339","2017-05-01 00:00:02.625"
"cd7940d9-8a30-4a12-9dc9-a2c710cbb982","GBPUSD","1.29311","1.29325","2017-05-01 00:00:02.649"
"233e0a3d-eeae-403f-b512-2add055d6735","GBPJPY","143.949997","143.968994","2017-05-01 00:00:02.761"
"58c7c33c-75b2-45bd-a50c-615d9eaaec07","EURGBP","0.84329","0.84339","2017-05-01 00:00:02.763"
"675b2847-b0f9-44f1-ab75-d6ddafbddfd5","GBPUSD","1.29311","1.29325","2017-05-01 00:00:02.766"
"d2946714-e84c-4a36-9819-0cca4f2dd197","EURGBP","0.8433","0.84339","2017-05-01 00:00:02.81"
"5322418b-4c45-4771-b591-cc8fea167424","GBPJPY","143.949997","143.968994","2017-05-01 00:00:02.832"
"9a1c572e-28b9-4d16-b196-ccef3e48f602","EURGBP","0.84331","0.84339","2017-05-01 00:00:02.929"
"a4377169-5400-4dd6-81f0-8b455d178b65","EURGBP","0.8433","0.84339","2017-05-01 00:00:02.951"
"37cdddb0-db5e-4574-bf4d-45fb8d52c6be","GBPUSD","1.29311","1.29325","2017-05-01 00:00:02.951"
"1258c886-b797-43ed-9886-323daac83dde","GBPUSD","1.29311","1.29325","2017-05-01 00:00:02.999"
"453406c3-5902-4dab-b39c-2e0b85739767","EURGBP","0.8433","0.84339","2017-05-01 00:00:03.005"
"95699a6d-8412-4f2f-9dd4-86a1d3eed57d","GBPUSD","1.29312","1.29325","2017-05-01 00:00:03.129"
"2268fcc4-104e-434b-8237-bce765bda084","EURGBP","0.8433","0.84339","2017-05-01 00:00:03.131"
"ba771b59-b04c-4683-8ceb-8e9b8451bcc7","GBPUSD","1.29312","1.29325","2017-05-01 00:00:03.17"
"00541afa-79a6-4d09-8c6e-4d59e8d15bb2","GBPJPY","143.947006","143.968994","2017-05-01 00:00:03.216"
"b4dd6e15-0e25-4da1-b210-021b91e4245a","GBPUSD","1.29312","1.29325","2017-05-01 00:00:03.323"
"79b11513-97c5-4146-a4c8-7daf7d8ffe6f","GBPJPY","143.947006","143.966995","2017-05-01 00:00:03.324"
"c5283502-eb09-45d3-b92b-d5625de9e9fc","GBPJPY","143.947006","143.966003","2017-05-01 00:00:03.411"
"67715855-9171-4ae7-b205-5ba017d328f2","EURGBP","0.8433","0.84339","2017-05-01 00:00:03.501"
"00b7647e-7d53-407d-9704-78a8712ac580","GBPJPY","143.947006","143.966003","2017-05-01 00:00:03.525"
"b41b5483-5fb2-4c57-9892-0d02e5e6a823","GBPJPY","143.947998","143.966003","2017-05-01 00:00:03.549"
"881712b2-ff9a-4065-ad6a-9caee03283e3","GBPJPY","143.947998","143.964996","2017-05-01 00:00:03.56"
"7248bc13-5a9f-4fba-97a1-fe041e975c38","EURGBP","0.8433","0.84339","2017-05-01 00:00:03.581"
"0379b4bb-7691-48e8-a668-2388f6e5d510","GBPJPY","143.947998","143.964996","2017-05-01 00:00:03.672"
"2715fe9a-eb6f-445d-b7d0-850293bb5b2e","GBPJPY","143.947006","143.964996","2017-05-01 00:00:03.698"
"52ce7685-0964-46e4-ab8f-282e68bdb73d","GBPJPY","143.947006","143.964005","2017-05-01 00:00:03.711"
"6b31dee4-1aec-4d10-9b37-0c58bae7e449","GBPJPY","143.947006","143.964005","2017-05-01 00:00:03.848"
"7e61571a-ea8b-44cb-9481-ae31ba8b86c4","GBPUSD","1.29312","1.29325","2017-05-01 00:00:03.868"
"89e316cb-e01d-4f01-9dcd-a380e94ed4e8","GBPJPY","143.947006","143.964005","2017-05-01 00:00:03.943"
"966318e4-a6b5-4c1b-82e1-fbbf0029ce02","GBPUSD","1.29312","1.29325","2017-05-01 00:00:04.096"
"8fc7b075-7f5e-40f5-9879-55604f77a3c5","EURGBP","0.8433","0.84339","2017-05-01 00:00:04.227"
"23ab8b66-337e-4c4e-b73a-f31101716565","GBPJPY","143.947006","143.964005","2017-05-01 00:00:04.227"
"eb7469be-8a40-4498-af85-7b2c42630149","GBPJPY","143.947006","143.964005","2017-05-01 00:00:04.288"
"f945a821-9790-4760-beea-1b54bb1c2a85","EURGBP","0.8433","0.84339","2017-05-01 00:00:04.406"
"978c6d4c-22f9-4218-b447-940a3d3c436e","EURGBP","0.8433","0.84339","2017-05-01 00:00:04.505"
"2ace7d77-b228-491c-9cf0-610b8e48cdf7","GBPJPY","143.947006","143.966003","2017-05-01 00:00:04.786"
"0f89d5dd-52f2-44e9-aa54-cbb423cd7416","GBPJPY","143.947006","143.964996","2017-05-01 00:00:04.787"
"db7748a8-317b-4c62-a1f2-d6679430b343","GBPUSD","1.29312","1.29325","2017-05-01 00:00:04.895"
"ae29728e-773b-42a9-9f1e-199ba996731a","EURGBP","0.8433","0.84339","2017-05-01 00:00:04.986"
"19ffa8e4-63c5-4f47-90d5-771dd5a839ba","GBPJPY","143.947006","143.962997","2017-05-01 00:00:04.986"
The slowdown is caused by " WHERE instrument='EURGBP' ". As soon as I remove it the query flies. However, I must have filtering by 'instrument'.
EXPLAIN OUTPUT:
Sort (cost=2356660289.95..2356711257.16 rows=20386886 width=144)
Sort Key: minmax.dt
CTE ranges
-> WindowAgg (cost=0.00..25.00 rows=1000 width=24)
-> Function Scan on generate_series dd (cost=0.00..10.00 rows=1000 width=8)
CTE create_grp
-> Nested Loop (cost=344299.48..460976348.77 rows=2043798111 width=71)
Join Filter: ((p.dt >= r.start_range) AND (p.dt < r.end_range))
-> Bitmap Heap Scan on prices p (cost=344299.48..1121763.77 rows=18394183 width=47)
Recheck Cond: ((instrument)::text = 'EURGBP'::text)
-> Bitmap Index Scan on idx_instrument (cost=0.00..339700.94 rows=18394183 width=0)
Index Cond: ((instrument)::text = 'EURGBP'::text)
-> CTE Scan on ranges r (cost=0.00..20.00 rows=1000 width=24)
CTE minmax
-> WindowAgg (cost=1796644092.85..1837520055.07 rows=2043798111 width=112)
-> Sort (cost=1796644092.85..1801753588.13 rows=2043798111 width=104)
Sort Key: create_grp.grp, create_grp.dt DESC
-> WindowAgg (cost=880857947.67..921733909.89 rows=2043798111 width=104)
-> Sort (cost=880857947.67..885967442.95 rows=2043798111 width=96)
Sort Key: create_grp.grp, create_grp.dt
-> CTE Scan on create_grp (cost=0.00..40875962.22 rows=2043798111 width=96)
-> CTE Scan on minmax (cost=0.00..51298821.64 rows=20386886 width=144)
Filter: ((1 = rn1) OR (1 = rn2))
Any suggestions to optimize it are very very welcome.
CREATE TEMP TABLE ranges
( start_range timestamp NOT NULL
, end_range timestamp NOT NULL
, grp INTEGER NOT NULL UNIQUE
, PRIMARY KEY (start_range)
);
INSERT INTO ranges(start_range, end_range, grp)
SELECT dd as start_range,
dd + '1 seconds'::interval as end_range,
ROW_NUMBER() over () as grp
FROM generate_series
( '2017-05-01 00:00:00'::timestamp
, '2017-05-01 01:00:00'::timestamp
--, '15 seconds'::interval) dd
, '1 seconds'::interval) dd
;
VACUUM ANALYZE ranges;
-- EXPLAIN ANALYZE
SELECT *,
CASE WHEN rn1 = 1 and rn2 = 1 THEN 'first and last'
WHEN rn1 = 1 THEN 'first'
WHEN rn2 = 1 THEN 'last'
END as row_position
FROM (
SELECT r.grp, p.uid, p.instrument, r.start_range , p.dt AS dt
, p.bid, p.ask
, row_number() over (partition by r.grp order by p.dt asc) as rn1
, row_number() over (partition by r.grp order by p.dt desc) as rn2
FROM prices p
JOIN ranges r ON p.dt >= r.start_range AND p.dt < r.end_range
WHERE p.instrument='EURGBP'
) zzzz
WHERE (rn1=1 OR rn2=1)
-- ORDER BY instrument,grp, dt
;
And, you don't actually need the row number, since you only want the beginning/end of a window:
-- EXPLAIN ANALYZE
SELECT *,
CASE WHEN (prev IS NULL AND next IS NULL) THEN 'first and last'
WHEN prev IS NULL THEN 'first'
WHEN next IS NULL THEN 'last'
END as row_position
FROM (
SELECT r.grp, p.uid, p.instrument, r.start_range , p.dt AS dt, p.bid, p.ask
, lag(uid) over (www) as prev
, lead(uid) over (www) as next
FROM prices p
JOIN ranges r ON p.dt >= r.start_range AND p.dt < r.end_range
WHERE p.instrument='EURGBP'
WINDOW www AS (partition by r.grp order by p.dt )
-- ORDER BY p.instrument,r.grp, p.dt
) qqqq
WHERE (prev IS NULL OR next IS NULL)
-- ORDER BY instrument,grp, dt
;
Thanks to all for help and hints! Apparently if I move WHERE clause from create_gpr to minimax, the query become extremely fast. I have no explanation to it whatsoever. But it works. Here it is:
WITH ranges as (
SELECT dd as start_range,
dd + '1 seconds'::interval as end_range,
ROW_NUMBER() over () as grp
FROM generate_series
( '2017-05-01 00:00:00'::timestamp
, '2017-05-01 00:01:00'::timestamp
, '1 seconds'::interval) dd
), create_grp as (
SELECT r.grp, r.start_range, r.end_range, p.*
FROM prices p
JOIN ranges r
ON p.dt >= r.start_range
AND p.dt < r.end_range
-- WHERE need to be moved out of here, which has no indexes as being temporary
-- WHERE instrument='EURGBP'
), minmax as (
SELECT row_number() over (partition by grp
order by dt asc) as rn1,
row_number() over (partition by grp
order by dt desc) as rn2,
create_grp.*
FROM create_grp
-- Here WHERE goes! It does use index here for some reason. And the query flies.
WHERE instrument='EURGBP'
)
SELECT *,
CASE WHEN rn1 = 1 and rn2 = 1 THEN 'first and last'
WHEN rn1 = 1 THEN 'first'
WHEN rn2 = 1 THEN 'last'
END as row_position
FROM minmax
WHERE
1 IN (rn1, rn2)
ORDER BY dt
;
This query (for 1 minute range) runs in 84 msec, the old one never finished to run (I gave up after 10 minutes). The old query took 10 sec to run for the range of 1 sec.
Thanks to everyone for help!

Create Postgres View involving multiple tables

Postgres Version 9.4
I already have this query involving two tables namely, surigao_stats_2yr and evidensapp_surigaobldg.
SELECT
province,
municipali AS municipality,
brgy_locat AS barangay,
total_bldg,
(total_bldg - count(id)) as total_not_affected,
count(id) AS total_affected,
count(id) FILTER (WHERE gridcode = 1) AS low,
count(id) FILTER (WHERE gridcode = 2) AS medium,
count(id) FILTER (WHERE gridcode = 3) AS high
FROM (
SELECT
pss.province,ps.id, ps.brgy_locat, ps.municipali,
gridcode, count(pss.id) as total_bldg
FROM surigao_stats_2yr ps
INNER JOIN
evidensapp_surigaobldg pss
ON ps.brgy_locat = pss.brgy_locat
AND ps.municipali = pss.municipali
GROUP BY 1, 2,3, 4, 5
)
AS ps_fh GROUP BY 1,2,3, 4
ORDER BY 1 ASC, 2 ASC
The result of the query above is this:
I wanted to have a view involving another two tables with the same result where it can be filtered by municipality and barangay and sum the total_bldg, total_not_affected, total_affected, low, medium and high if the municipality and barangay is the same.
So far, I have tried this:
CREATE VIEW surigao_del_norte_2yr_affected AS
SELECT
province,
municipality,
barangay,
SUM(total_bldg) as tot_bldg,
SUM(total_not_affected) as tot_not_affect,
SUM(total_affected) as tot_affect,
SUM(low) as lo,
SUM(medium) as med,
SUM(high)as hi
FROM(
WITH surigao AS(
SELECT
province,
municipali AS municipality,
brgy_locat AS barangay,
total_bldg,
(total_bldg - count(id)) as total_not_affected,
count(id) AS total_affected,
count(id) FILTER (WHERE gridcode = 1) AS low,
count(id) FILTER (WHERE gridcode = 2) AS medium,
count(id) FILTER (WHERE gridcode = 3) AS high
FROM (
SELECT
pss.province,ps.id, ps.brgy_locat, ps.municipali,
gridcode, count(pss.id) as total_bldg
FROM surigao_stats_2yr ps
INNER JOIN
evidensapp_surigaobldg pss
ON ps.brgy_locat = pss.brgy_locat
AND ps.municipali = pss.municipali
GROUP BY 1, 2,3, 4, 5
)
AS ps_fh GROUP BY 1,2,3, 4
ORDER BY 1 ASC, 2 ASC
), magallanes AS(
SELECT
province,
municipali AS municipality,
brgy_locat AS barangay,
total_bldg,
(total_bldg - count(id)) as total_not_affected,
count(id) AS total_affected,
count(id) FILTER (WHERE gridcode = 1) AS low,
count(id) FILTER (WHERE gridcode = 2) AS medium,
count(id) FILTER (WHERE gridcode = 3) AS high
FROM (
SELECT
pss.province,ps.id, ps.brgy_locat, ps.municipali,
gridcode, count(pss.id) as total_bldg
FROM magallanes_stats_2yr ps
INNER JOIN
evidensapp_magallanesbldg pss
ON ps.brgy_locat = pss.brgy_locat
AND ps.municipali = pss.municipali
GROUP BY 1, 2,3, 4, 5
)
AS ps_fh GROUP BY 1,2,3, 4
ORDER BY 1 ASC, 2 ASC
), mainit AS(
SELECT
province,
municipali AS municipality,
brgy_locat AS barangay,
total_bldg,
(total_bldg - count(id)) as total_not_affected,
count(id) AS total_affected,
count(id) FILTER (WHERE gridcode = 1) AS low,
count(id) FILTER (WHERE gridcode = 2) AS medium,
count(id) FILTER (WHERE gridcode = 3) AS high
FROM (
SELECT
pss.province,ps.id, ps.brgy_locat, ps.municipali,
gridcode, count(pss.id) as total_bldg
FROM mainit_tubay_stats_2yr ps
INNER JOIN
evidensapp_mainittubaybldg pss
ON ps.brgy_locat = pss.brgy_locat
AND ps.municipali = pss.municipali
WHERE province='Surigao del Norte'
GROUP BY 1, 2,3, 4, 5
)
AS ps_fh GROUP BY 1,2,3, 4
ORDER BY 1 ASC, 2 ASC
)
TABLE surigao
UNION ALL
TABLE magallanes
UNION ALL
TABLE mainit
) surigao_2yr
--WHERE municipality ='Placer'
--AND barangay = 'Santa Cruz'
GROUP BY 1,2,3
ORDER BY 2
SELECT *FROM surigao_del_norte_2yr_affected WHERE municipality ='Placer' AND barangay = 'Santa Cruz'
The performance is quite slow and I want to know if the values mentioned above is added with the query I had.
Thanks.

How to calculate longest streak in SQL?

I have
TABLE EMPLOYEE - ID,DATE,IsPresent
I want to calculate longest streak for a employee presence.The Present bit will be false for days he didnt come..So I want to calculate the longest number of days he came to office for consecutive dates..I have the Date column field is unique...So I tried this way -
Select Id,Count(*) from Employee where IsPresent=1
But the above doesnt work...Can anyone guide me towards how I can calculate streak for this?....I am sure people have come across this...I tried searching online but...didnt understand it well...please help me out..
EDIT Here's a SQL Server version of the query:
with LowerBound as (select second_day.EmployeeId
, second_day."DATE" as LowerDate
, row_number() over (partition by second_day.EmployeeId
order by second_day."DATE") as RN
from T second_day
left outer join T first_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = dateadd(day, -1, second_day."DATE")
and first_day.IsPresent = 1
where first_day.EmployeeId is null
and second_day.IsPresent = 1)
, UpperBound as (select first_day.EmployeeId
, first_day."DATE" as UpperDate
, row_number() over (partition by first_day.EmployeeId
order by first_day."DATE") as RN
from T first_day
left outer join T second_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = dateadd(day, -1, second_day."DATE")
and second_day.IsPresent = 1
where second_day.EmployeeId is null
and first_day.IsPresent = 1)
select LB.EmployeeID, max(datediff(day, LowerDate, UpperDate) + 1) as LongestStreak
from LowerBound LB
inner join UpperBound UB
on LB.EmployeeId = UB.EmployeeId
and LB.RN = UB.RN
group by LB.EmployeeId
SQL Server Version of the test data:
create table T (EmployeeId int
, "DATE" date not null
, IsPresent bit not null
, constraint T_PK primary key (EmployeeId, "DATE")
)
insert into T values (1, '2000-01-01', 1);
insert into T values (2, '2000-01-01', 0);
insert into T values (3, '2000-01-01', 0);
insert into T values (3, '2000-01-02', 1);
insert into T values (3, '2000-01-03', 1);
insert into T values (3, '2000-01-04', 0);
insert into T values (3, '2000-01-05', 1);
insert into T values (3, '2000-01-06', 1);
insert into T values (3, '2000-01-07', 0);
insert into T values (4, '2000-01-01', 0);
insert into T values (4, '2000-01-02', 1);
insert into T values (4, '2000-01-03', 1);
insert into T values (4, '2000-01-04', 1);
insert into T values (4, '2000-01-05', 1);
insert into T values (4, '2000-01-06', 1);
insert into T values (4, '2000-01-07', 0);
insert into T values (5, '2000-01-01', 0);
insert into T values (5, '2000-01-02', 1);
insert into T values (5, '2000-01-03', 0);
insert into T values (5, '2000-01-04', 1);
insert into T values (5, '2000-01-05', 1);
insert into T values (5, '2000-01-06', 1);
insert into T values (5, '2000-01-07', 0);
Sorry, this is written in Oracle, so substitute the appropriate SQL Server date arithmetic.
Assumptions:
Date is either a Date value or
DateTime with time component of
00:00:00.
The primary key is
(EmployeeId, Date)
All fields are not null
If a date is missing for the employee, they were not present. (Used to handle the beginning and ending of the data series, but also means that missing dates in the middle will break streaks. Could be a problem depending on requirements.
with LowerBound as (select second_day.EmployeeId
, second_day."DATE" as LowerDate
, row_number() over (partition by second_day.EmployeeId
order by second_day."DATE") as RN
from T second_day
left outer join T first_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = second_day."DATE" - 1
and first_day.IsPresent = 1
where first_day.EmployeeId is null
and second_day.IsPresent = 1)
, UpperBound as (select first_day.EmployeeId
, first_day."DATE" as UpperDate
, row_number() over (partition by first_day.EmployeeId
order by first_day."DATE") as RN
from T first_day
left outer join T second_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = second_day."DATE" - 1
and second_day.IsPresent = 1
where second_day.EmployeeId is null
and first_day.IsPresent = 1)
select LB.EmployeeID, max(UpperDate - LowerDate + 1) as LongestStreak
from LowerBound LB
inner join UpperBound UB
on LB.EmployeeId = UB.EmployeeId
and LB.RN = UB.RN
group by LB.EmployeeId
Test Data:
create table T (EmployeeId number(38)
, "DATE" date not null check ("DATE" = trunc("DATE"))
, IsPresent number not null check (IsPresent in (0, 1))
, constraint T_PK primary key (EmployeeId, "DATE")
)
/
insert into T values (1, to_date('2000-01-01', 'YYYY-MM-DD'), 1);
insert into T values (2, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
insert into T values (3, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
insert into T values (3, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
insert into T values (3, to_date('2000-01-03', 'YYYY-MM-DD'), 1);
insert into T values (3, to_date('2000-01-04', 'YYYY-MM-DD'), 0);
insert into T values (3, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
insert into T values (3, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
insert into T values (3, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
insert into T values (4, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
insert into T values (4, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
insert into T values (4, to_date('2000-01-03', 'YYYY-MM-DD'), 1);
insert into T values (4, to_date('2000-01-04', 'YYYY-MM-DD'), 1);
insert into T values (4, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
insert into T values (4, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
insert into T values (4, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
insert into T values (5, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
insert into T values (5, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
insert into T values (5, to_date('2000-01-03', 'YYYY-MM-DD'), 0);
insert into T values (5, to_date('2000-01-04', 'YYYY-MM-DD'), 1);
insert into T values (5, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
insert into T values (5, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
insert into T values (5, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
groupby is missing.
To select total man-days (for everyone) attendance of the whole office.
Select Id,Count(*) from Employee where IsPresent=1
To select man-days attendance per employee.
Select Id,Count(*)
from Employee
where IsPresent=1
group by id;
But that is still not good because it counts the total days of attendance and NOT the length of continuous attendance.
What you need to do is construct a temp table with another date column date2. date2 is set to today. The table is the list of all days an employee is absent.
create tmpdb.absentdates as
Select id, date, today as date2
from EMPLOYEE
where IsPresent=0
order by id, date;
So the trick is to calculate the date difference between two absent days to find the length of continuously present days.
Now, fill in date2 with the next absent date per employee. The most recent record per employee will not be updated but left with value of today because there is no record with greater date than today in the database.
update tmpdb.absentdates
set date2 =
select min(a2.date)
from
tmpdb.absentdates a1,
tmpdb.absentdates a2
where a1.id = a2.id
and a1.date < a2.date
The above query updates itself by performing a join on itself and may cause deadlock query so it is better to create two copies of the temp table.
create tmpdb.absentdatesX as
Select id, date
from EMPLOYEE
where IsPresent=0
order by id, date;
create tmpdb.absentdates as
select *, today as date2
from tmpdb.absentdatesX;
You need to insert the hiring date, presuming the earliest date per employee in the database is the hiring date.
insert into tmpdb.absentdates a
select a.id, min(e.date), today
from EMPLOYEE e
where a.id = e.id
Now update date2 with the next later absent date to be able to perform date2 - date.
update tmpdb.absentdates
set date2 =
select min(x.date)
from
tmpdb.absentdates a,
tmpdb.absentdatesX x
where a.id = x.id
and a.date < x.date
This will list the length of days an emp is continuously present:
select id, datediff(date2, date) as continuousPresence
from tmpdb.absentdates
group by id, continuousPresence
order by id, continuousPresence
But you only want to longest streak:
select id, max(datediff(date2, date) as continuousPresence)
from tmpdb.absentdates
group by id
order by id
However, the above is still problematic because datediff does not take into account holidays and weekends.
So we depend on the count of records as the legitimate working days.
create tmpdb.absentCount as
Select a.id, a.date, a.date2, count(*) as continuousPresence
from EMPLOYEE e, tmpdb.absentdates a
where e.id = a.id
and e.date >= a.date
and e.date < a.date2
group by a.id, a.date
order by a.id, a.date;
Remember, every time you use an aggregator like count, ave
yo need to groupby the selected item list because it is common sense that you have to aggregate by them.
Now select the max streak
select id, max(continuousPresence)
from tmpdb.absentCount
group by id
To list the dates of streak:
select id, date, date2, continuousPresence
from tmpdb.absentCount
group by id
having continuousPresence = max(continuousPresence);
There may be some mistakes (sql server tsql) above but this is the general idea.
Try this:
select
e.Id,
e.date,
(select
max(e1.date)
from
employee e1
where
e1.Id = e.Id and
e1.date < e.date and
e1.IsPresent = 0) StreakStartDate,
(select
min(e2.date)
from
employee e2
where
e2.Id = e.Id and
e2.date > e.date and
e2.IsPresent = 0) StreakEndDate
from
employee e
where
e.IsPresent = 1
Then finds out the longest streak for each employee:
select id, max(datediff(streakStartDate, streakEndDate))
from (<use subquery above>)
group by id
I'm not fully sure this query has correct syntax because I havn't database just now.
Also notice streak start and streak end columns contains not the first and last day when employee was present, but nearest dates when he was absent. If dates in table have approximately equal distance, this does not means, otherwise query become little more complex, because we need to finds out nearest presence dates. Also this improvements allow to handle situation when the longest streak is first or last streak.
The main idea is for each date when employee was present find out streak start and streak end.
For each row in table when employee was present, streak start is maximum date that is less then date of current row when employee was absent.
Here is an alternate version, to handle missing days differently. Say that you only record a record for work days, and being at work Monday-Friday one week and Monday-Friday of the next week counts as ten consecutive days. This query assumes that missing dates in the middle of a series of rows are non-work days.
with LowerBound as (select second_day.EmployeeId
, second_day."DATE" as LowerDate
, row_number() over (partition by second_day.EmployeeId
order by second_day."DATE") as RN
from T second_day
left outer join T first_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = dateadd(day, -1, second_day."DATE")
and first_day.IsPresent = 1
where first_day.EmployeeId is null
and second_day.IsPresent = 1)
, UpperBound as (select first_day.EmployeeId
, first_day."DATE" as UpperDate
, row_number() over (partition by first_day.EmployeeId
order by first_day."DATE") as RN
from T first_day
left outer join T second_day
on first_day.EmployeeId = second_day.EmployeeId
and first_day."DATE" = dateadd(day, -1, second_day."DATE")
and second_day.IsPresent = 1
where second_day.EmployeeId is null
and first_day.IsPresent = 1)
select LB.EmployeeID, max(datediff(day, LowerDate, UpperDate) + 1) as LongestStreak
from LowerBound LB
inner join UpperBound UB
on LB.EmployeeId = UB.EmployeeId
and LB.RN = UB.RN
group by LB.EmployeeId
go
with NumberedRows as (select EmployeeId
, "DATE"
, IsPresent
, row_number() over (partition by EmployeeId
order by "DATE") as RN
-- , min("DATE") over (partition by EmployeeId, IsPresent) as MinDate
-- , max("DATE") over (partition by EmployeeId, IsPresent) as MaxDate
from T)
, LowerBound as (select SecondRow.EmployeeId
, SecondRow.RN
, row_number() over (partition by SecondRow.EmployeeId
order by SecondRow.RN) as LowerBoundRN
from NumberedRows SecondRow
left outer join NumberedRows FirstRow
on FirstRow.IsPresent = 1
and FirstRow.EmployeeId = SecondRow.EmployeeId
and FirstRow.RN + 1 = SecondRow.RN
where FirstRow.EmployeeId is null
and SecondRow.IsPresent = 1)
, UpperBound as (select FirstRow.EmployeeId
, FirstRow.RN
, row_number() over (partition by FirstRow.EmployeeId
order by FirstRow.RN) as UpperBoundRN
from NumberedRows FirstRow
left outer join NumberedRows SecondRow
on SecondRow.IsPresent = 1
and FirstRow.EmployeeId = SecondRow.EmployeeId
and FirstRow.RN + 1 = SecondRow.RN
where SecondRow.EmployeeId is null
and FirstRow.IsPresent = 1)
select LB.EmployeeId, max(UB.RN - LB.RN + 1)
from LowerBound LB
inner join UpperBound UB
on LB.EmployeeId = UB.EmployeeId
and LB.LowerBoundRN = UB.UpperBoundRN
group by LB.EmployeeId
I did this once to determine consecutive days that a fire fighter had been on shift at least 15 minutes.
Your case is a bit more simple.
If you wanted to assume that no employee came more than 32 consecutive times, you could just use a Common Table Expression. But a better approach would be to use a temp table and a while loop.
You will need a column called StartingRowID. Keep joining from your temp table to the employeeWorkDay table for the next consecutive employee work day and insert them back into the temp table. When ##Row_Count = 0, you have captured the longest streak.
Now aggregate by StartingRowID to get the first day of the longest streak. I'm running short on time, or I would include some sample code.