Efficient way of calculating overlap area in PostGIS - postgresql

I am working on calculating the overlap of two tables/layers in a PostGIS database. One set is a grid of hexagons for which I want to calculate the fraction of overlap with another set of polygons, for each of the hexagons. The multipolygon set also has a few polygons that overlap, so I need to dissolve/union those. Before I was doing this in FME, but it kept running out of memory for some of the larger polygons. That's why I want to do this in the database (and PostGIS should be very much capable of doing that).
Here is what I have so far, and it works, and memory is not running out now:
EXPLAIN ANALYZE
WITH rh_union AS (
SELECT (ST_Dump(ST_Union(rh.geometry))).geom AS geometry
FROM relevant_habitats rh
WHERE rh.assessment_area_id=1
)
SELECT h.receptor_id,
SUM(ROUND((ST_Area(ST_Intersection(rhu.geometry, h.geometry)) / ST_Area(h.geometry))::numeric,3)) AS frac_overlap
FROM rh_union rhu, hexagons h
WHERE h.zoom_level=1 AND ST_Intersects(rhu.geometry, h.geometry)
GROUP BY h.receptor_id
So I first break the multipolygon and union what I can. Then calculate the overlay of the hexagons with the polygons. Then calculate the (sum of all small pieces of) area.
Now, my question is:
is this an efficient way of doing this? Or would there be a better way?
(And a side question: is it correct to first ST_Union and then ST_Dump?)
--Update with EXPLAIN ANALYZE
Output for a single area:
"QUERY PLAN"
"GroupAggregate (cost=1996736.77..15410052.20 rows=390140 width=36) (actual time=571.303..1137.657 rows=685 loops=1)"
" Group Key: h.receptor_id"
" -> Sort (cost=1996736.77..1998063.55 rows=530712 width=188) (actual time=571.090..620.379 rows=806 loops=1)"
" Sort Key: h.receptor_id"
" Sort Method: external merge Disk: 42152kB"
" -> Nested Loop (cost=55.53..1848314.51 rows=530712 width=188) (actual time=382.769..424.643 rows=806 loops=1)"
" -> Result (cost=55.25..1321.51 rows=1000 width=32) (actual time=382.550..383.696 rows=65 loops=1)"
" -> ProjectSet (cost=55.25..61.51 rows=1000 width=32) (actual time=382.544..383.652 rows=65 loops=1)"
" -> Aggregate (cost=55.25..55.26 rows=1 width=32) (actual time=381.323..381.325 rows=1 loops=1)"
" -> Index Scan using relevant_habitats_pkey on relevant_habitats rh (cost=0.28..28.75 rows=12 width=130244) (actual time=0.026..0.048 rows=12 loops=1)"
" Index Cond: (assessment_area_id = 94)"
" -> Index Scan using idx_hexagons_geometry_gist on hexagons h (cost=0.29..1846.45 rows=53 width=156) (actual time=0.315..0.624 rows=12 loops=65)"
" Index Cond: (geometry && (((st_dump((st_union(rh.geometry))))).geom))"
" Filter: (((zoom_level)::integer = 1) AND st_intersects((((st_dump((st_union(rh.geometry))))).geom), geometry))"
" Rows Removed by Filter: 19"
"Planning Time: 0.390 ms"
"Execution Time: 1372.443 ms"
Update 2: now the output of EXPLAIN ANALYZE on the second select (SELECT h.receptor_id~) and the CTE replaced by a (temp) table:
"QUERY PLAN"
"GroupAggregate (cost=2691484.47..20931829.74 rows=390140 width=36) (actual time=29.455..927.945 rows=685 loops=1)"
" Group Key: h.receptor_id"
" -> Sort (cost=2691484.47..2693288.89 rows=721768 width=188) (actual time=28.382..31.514 rows=806 loops=1)"
" Sort Key: h.receptor_id"
" Sort Method: quicksort Memory: 336kB"
" -> Nested Loop (cost=0.29..2488035.20 rows=721768 width=188) (actual time=0.189..27.852 rows=806 loops=1)"
" -> Seq Scan on rh_union rhu (cost=0.00..23.60 rows=1360 width=32) (actual time=0.016..0.050 rows=65 loops=1)"
" -> Index Scan using idx_hexagons_geometry_gist on hexagons h (cost=0.29..1828.89 rows=53 width=156) (actual time=0.258..0.398 rows=12 loops=65)"
" Index Cond: (geometry && rhu.geometry)"
" Filter: (((zoom_level)::integer = 1) AND st_intersects(rhu.geometry, geometry))"
" Rows Removed by Filter: 19"
"Planning Time: 0.481 ms"
"Execution Time: 928.583 ms"

You want a metric describing the extent of overlap of a table of polygons on other polygons on another table.
This query returns the id of overlapping polygons within the same table and an indicator as a percentage.
SELECT
CONCAT(a.polyid,' ',b.polyid) AS intersecting_polys,
CONCAT(a.attribute_x,' ',b.attribute_x) AS attribute,
ROUND((100 * (ST_Area(ST_Intersection(a.geom, b.geom))) / ST_Area(a.geom))::NUMERIC,0) AS pc_overlap
FROM your_schema.your_table AS a
LEFT JOIN your_schema.your_table AS b
ON (a.geom && b.geom AND ST_Relate(a.geom, b.geom, '2********'))
WHERE a.polyid != b.polyid
;
Note the term
ON (a.geom && b.geom AND ST_Relate(a.geom, b.geom, '2********')) is used instead of ST_Covers. You might want to experiment which is correct in your use case.

Related

Postgres query performance improvement

I am a newbie to database optimisations,
The table data I have is around 29 million rows,
I am running on Pgadmin to do select * on the rows and it takes 9 seconds.
What can I do to optimize performance?
SELECT
F."Id",
F."Name",
F."Url",
F."CountryModel",
F."RegionModel",
F."CityModel",
F."Street",
F."Phone",
F."PostCode",
F."Images",
F."Rank",
F."CommentCount",
F."PageRank",
F."Description",
F."Properties",
F."IsVerify",
count(*) AS Counter
FROM
public."Firms" F,
LATERAL unnest(F."Properties") AS P
WHERE
F."CountryId" = 1
AND F."RegionId" = 7
AND F."CityId" = 4365
AND P = ANY (ARRAY[126, 128])
AND F."Deleted" = FALSE
GROUP BY
F."Id"
ORDER BY
Counter DESC,
F."IsVerify" DESC,
F."PageRank" DESC OFFSET 10 ROWS FETCH FIRST 20 ROW ONLY
Thats my query plan
" -> Sort (cost=11945.20..11948.15 rows=1178 width=369) (actual time=8981.514..8981.515 rows=30 loops=1)"
" Sort Key: (count(*)) DESC, f.""IsVerify"" DESC, f.""PageRank"" DESC"
" Sort Method: top-N heapsort Memory: 58kB"
" -> HashAggregate (cost=11898.63..11910.41 rows=1178 width=369) (actual time=8981.234..8981.305 rows=309 loops=1)"
" Group Key: f.""Id"""
" Batches: 1 Memory Usage: 577kB"
" -> Nested Loop (cost=7050.07..11886.85 rows=2356 width=360) (actual time=79.408..8980.167 rows=322 loops=1)"
" -> Bitmap Heap Scan on ""Firms"" f (cost=7050.06..11716.04 rows=1178 width=360) (actual time=78.414..8909.649 rows=56071 loops=1)"
" Recheck Cond: ((""CityId"" = 4365) AND (""RegionId"" = 7))"
" Filter: ((NOT ""Deleted"") AND (""CountryId"" = 1))"
" Heap Blocks: exact=55330"
" -> BitmapAnd (cost=7050.06..7050.06 rows=1178 width=0) (actual time=70.947..70.947 rows=0 loops=1)"
" -> Bitmap Index Scan on ""IX_Firms_CityId"" (cost=0.00..635.62 rows=58025 width=0) (actual time=11.563..11.563 rows=56072 loops=1)"
" Index Cond: (""CityId"" = 4365)"
" -> Bitmap Index Scan on ""IX_Firms_RegionId"" (cost=0.00..6413.60 rows=588955 width=0) (actual time=57.795..57.795 rows=598278 loops=1)"
" Index Cond: (""RegionId"" = 7)"
" -> Function Scan on unnest p (cost=0.00..0.13 rows=2 width=0) (actual time=0.001..0.001 rows=0 loops=56071)"
" Filter: (p = ANY ('{126,128}'::integer[]))"
" Rows Removed by Filter: 2"
"Planning Time: 0.351 ms"
"Execution Time: 8981.725 ms"```
Create a GIN index on F."Properties",
create index on "Firms" using gin ("Properties");
then add a clause to your WHERE
...
AND P = ANY (ARRAY[126, 128])
AND "Properties" && ARRAY[126, 128]
....
That added clause is redundant to the one preceding it, but the planner is not smart enough to reason through that so you need to make it explicit.

PostgreSQL Calls All Data For Group By Limit Operation

I have a query like below:
SELECT
MAX(m.org_id) as orgId,
MAX(m.org_name) as orgName,
MAX(m.app_id) as appId,
MAX(r.country_or_region) as country,
MAX(r.local_spend_currency) as currency,
SUM(r.local_spend_amount) as spend,
SUM(r.impressions) as impressions
...
FROM report r
LEFT JOIN metadata m
ON m.org_id = r.org_id
AND m.campaign_id = r.campaign_id
AND m.ad_group_id = r.ad_group_id
WHERE (r.report_date BETWEEN '2019-01-01' AND '2019-10-10')
AND r.org_id = 1
GROUP BY r.country_or_region, r.ad_group_id, r.keyword_id, r.keyword, r.text
OFFSET 0
LIMIT 20
Explain Analyze:
"Limit (cost=1308.04..1308.14 rows=1 width=562) (actual time=267486.538..267487.067 rows=20 loops=1)"
" -> GroupAggregate (cost=1308.04..1308.14 rows=1 width=562) (actual time=267486.537..267487.061 rows=20 loops=1)"
" Group Key: r.country_or_region, r.ad_group_id, r.keyword_id, r.keyword, r.text"
" -> Sort (cost=1308.04..1308.05 rows=1 width=221) (actual time=267486.429..267486.536 rows=567 loops=1)"
" Sort Key: r.country_or_region, r.ad_group_id, r.keyword_id, r.keyword, r.text"
" Sort Method: external merge Disk: 667552kB"
" -> Nested Loop (cost=1.13..1308.03 rows=1 width=221) (actual time=0.029..235158.692 rows=2742789 loops=1)"
" -> Nested Loop Semi Join (cost=0.44..89.76 rows=1 width=127) (actual time=0.016..8.967 rows=1506 loops=1)"
" Join Filter: (m.org_id = (479360))"
" -> Nested Loop (cost=0.44..89.05 rows=46 width=123) (actual time=0.013..4.491 rows=1506 loops=1)"
" -> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1)"
" Group Key: 479360"
" -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)"
" -> Index Scan using pmx_org_cmp_adg on metadata m (cost=0.41..88.55 rows=46 width=119) (actual time=0.008..1.947 rows=1506 loops=1)"
" Index Cond: (org_id = (479360))"
" -> Materialize (cost=0.00..0.03 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1506)"
" -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)"
" -> Index Scan using report_unx on search_term_report r (cost=0.69..1218.26 rows=1 width=118) (actual time=51.983..155.421 rows=1821 loops=1506)"
" Index Cond: ((org_id = m.org_id) AND (report_date >= '2019-07-01'::date) AND (report_date <= '2019-10-10'::date) AND (campaign_id = m.campaign_id) AND (ad_group_id = m.ad_group_id))"
"Planning Time: 0.988 ms"
"Execution Time: 267937.889 ms"
I have indexes on metadata and report table like: metadata(org_id, campaign_id, ad_group_id); report(org_id, report_date, campaign_id, ad_group_id)
I just want to call random 20 items with limit. But PostgreSQL takes so long time to call it? How can I improve it?
You want to have 20 groups. But for building these groups (to be sure, there is nothing missing in any group), you need to fetch all raw data.
When you say "random items", I assume you mean "random reports", as you have no item table.
with r as (select * from report WHERE r.report_date BETWEEN '2019-01-01' AND '2019-10-10' AND r.org_id = 1 order by random() limit 20)
select <whatever> from r left join <whatever>
You might need to tweak your aggregates a but. Does every record in "metadata" belong to exactly one record in "report"?

Optimisation on postgres query

I am looking for optimization suggestions for the below query on postgres. Not a DBA so looking for some expert advice in here.
Devices table holds device_id which are hexadecimal.
To achieve high throughput we run 6 instances of this query in parallel with pattern matching for device_id
beginning with [0-2], [3-5], [6-9], [a-c], [d-f]
When we run just one instance of the query it works fine, but with 6 instances we get error -
[6669]:FATAL: connection to client lost
explain analyze select notifications.id, notifications.status, events.alert_type,
events.id as event_id, events.payload, notifications.device_id as device_id,
device_endpoints.region, device_endpoints.device_endpoint as endpoint
from notifications
inner join events
on notifications.event_id = events.id
inner join devices
on notifications.device_id = devices.id
inner join device_endpoints
on devices.id = device_endpoints.device_id
where notifications.status = 'pending' AND notifications.region = 'ap-southeast-2'
AND devices.device_id ~ '[0-9a-f].*'
limit 10000;
Output of explain analyse
"Limit (cost=25.62..1349.23 rows=206 width=202) (actual time=0.359..0.359 rows=0 loops=1)"
" -> Nested Loop (cost=25.62..1349.23 rows=206 width=202) (actual time=0.357..0.357 rows=0 loops=1)"
" Join Filter: (notifications.device_id = devices.id)"
" -> Nested Loop (cost=25.33..1258.73 rows=206 width=206) (actual time=0.357..0.357 rows=0 loops=1)"
" -> Hash Join (cost=25.04..61.32 rows=206 width=52) (actual time=0.043..0.172 rows=193 loops=1)"
" Hash Cond: (notifications.event_id = events.id)"
" -> Index Scan using idx_notifications_status on notifications (cost=0.42..33.87 rows=206 width=16) (actual time=0.013..0.100 rows=193 loops=1)"
" Index Cond: (status = 'pending'::notification_status)"
" Filter: (region = 'ap-southeast-2'::text)"
" -> Hash (cost=16.50..16.50 rows=650 width=40) (actual time=0.022..0.022 rows=34 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 14kB"
" -> Seq Scan on events (cost=0.00..16.50 rows=650 width=40) (actual time=0.005..0.014 rows=34 loops=1)"
" -> Index Scan using idx_device_endpoints_device_id on device_endpoints (cost=0.29..5.80 rows=1 width=154) (actual time=0.001..0.001 rows=0 loops=193)"
" Index Cond: (device_id = notifications.device_id)"
" -> Index Scan using devices_pkey on devices (cost=0.29..0.43 rows=1 width=4) (never executed)"
" Index Cond: (id = device_endpoints.device_id)"
" Filter: (device_id ~ '[0-9a-f].*'::text)"
"Planning time: 0.693 ms"
"Execution time: 0.404 ms"

Postgres Explain Plans Are DIfferent for same query with different value

I have databases running Postgres 9.56 on heroku.
I'm running the following SQL with different parameter value, but getting very different results in the performance.
Query 1
SELECT COUNT(s), DATE_TRUNC('MONTH', t.departure)
FROM tk_seat s
LEFT JOIN tk_trip t ON t.trip_id = s.trip_id
WHERE DATE_PART('year', t.departure)= '2017'
AND t.trip_status = 'BOOKABLE'
AND t.route_id = '278'
AND s.seat_status_type != 'NONE'
AND s.operator_id = '15'
GROUP BY DATE_TRUNC('MONTH', t.departure)
ORDER BY DATE_TRUNC('MONTH', t.departure)
Query 2
SELECT COUNT(s), DATE_TRUNC('MONTH', t.departure)
FROM tk_seat s
LEFT JOIN tk_trip t ON t.trip_id = s.trip_id
WHERE DATE_PART('year', t.departure)= '2017'
AND t.trip_status = 'BOOKABLE'
AND t.route_id = '150'
AND s.seat_status_type != 'NONE'
AND s.operator_id = '15'
GROUP BY DATE_TRUNC('MONTH', t.departure)
ORDER BY DATE_TRUNC('MONTH', t.departure)
Only Difference is t.route_id value.
So, I tried running explain and give me very different result.
For Query 1
"GroupAggregate (cost=279335.17..279335.19 rows=1 width=298)"
" Group Key: (date_trunc('MONTH'::text, t.departure))"
" -> Sort (cost=279335.17..279335.17 rows=1 width=298)"
" Sort Key: (date_trunc('MONTH'::text, t.departure))"
" -> Nested Loop (cost=0.00..279335.16 rows=1 width=298)"
" Join Filter: (s.trip_id = t.trip_id)"
" -> Seq Scan on tk_trip t (cost=0.00..5951.88 rows=1 width=12)"
" Filter: (((trip_status)::text = 'BOOKABLE'::text) AND (route_id = '278'::bigint) AND (date_part('year'::text, departure) = '2017'::double precision))"
" -> Seq Scan on tk_seat s (cost=0.00..271738.35 rows=131594 width=298)"
" Filter: (((seat_status_type)::text <> 'NONE'::text) AND (operator_id = '15'::bigint))"
For Query 2
"Sort (cost=278183.94..278183.95 rows=1 width=298)"
" Sort Key: (date_trunc('MONTH'::text, t.departure))"
" -> HashAggregate (cost=278183.92..278183.93 rows=1 width=298)"
" Group Key: date_trunc('MONTH'::text, t.departure)"
" -> Hash Join (cost=5951.97..278183.88 rows=7 width=298)"
" Hash Cond: (s.trip_id = t.trip_id)"
" -> Seq Scan on tk_seat s (cost=0.00..271738.35 rows=131594 width=298)"
" Filter: (((seat_status_type)::text <> 'NONE'::text) AND (operator_id = '15'::bigint))"
" -> Hash (cost=5951.88..5951.88 rows=7 width=12)"
" -> Seq Scan on tk_trip t (cost=0.00..5951.88 rows=7 width=12)"
" Filter: (((trip_status)::text = 'BOOKABLE'::text) AND (route_id = '150'::bigint) AND (date_part('year'::text, departure) = '2017'::double precision))"
My question is why and how can i make it same? Because first Query give me very bad performance
Query 1 Analyze
"GroupAggregate (cost=274051.28..274051.31 rows=1 width=8) (actual time=904682.606..904684.283 rows=7 loops=1)"
" Group Key: (date_trunc('MONTH'::text, t.departure))"
" -> Sort (cost=274051.28..274051.29 rows=1 width=8) (actual time=904682.432..904682.917 rows=13520 loops=1)"
" Sort Key: (date_trunc('MONTH'::text, t.departure))"
" Sort Method: quicksort Memory: 1018kB"
" -> Nested Loop (cost=0.42..274051.27 rows=1 width=8) (actual time=1133.925..904676.254 rows=13520 loops=1)"
" Join Filter: (s.trip_id = t.trip_id)"
" Rows Removed by Join Filter: 42505528"
" -> Index Scan using tk_trip_route_id_idx on tk_trip t (cost=0.42..651.34 rows=1 width=12) (actual time=0.020..2.720 rows=338 loops=1)"
" Index Cond: (route_id = '278'::bigint)"
" Filter: (((trip_status)::text = 'BOOKABLE'::text) AND (date_part('year'::text, departure) = '2017'::double precision))"
" Rows Removed by Filter: 28"
" -> Seq Scan on tk_seat s (cost=0.00..271715.83 rows=134728 width=8) (actual time=0.071..2662.102 rows=125796 loops=338)"
" Filter: (((seat_status_type)::text <> 'NONE'::text) AND (operator_id = '15'::bigint))"
" Rows Removed by Filter: 6782294"
"Planning time: 1.172 ms"
"Execution time: 904684.570 ms"
Query 2 Analyze
"Sort (cost=275018.88..275018.89 rows=1 width=8) (actual time=2153.843..2153.843 rows=9 loops=1)"
" Sort Key: (date_trunc('MONTH'::text, t.departure))"
" Sort Method: quicksort Memory: 25kB"
" -> HashAggregate (cost=275018.86..275018.87 rows=1 width=8) (actual time=2153.833..2153.834 rows=9 loops=1)"
" Group Key: date_trunc('MONTH'::text, t.departure)"
" -> Hash Join (cost=2797.67..275018.82 rows=7 width=8) (actual time=2.472..2147.093 rows=36565 loops=1)"
" Hash Cond: (s.trip_id = t.trip_id)"
" -> Seq Scan on tk_seat s (cost=0.00..271715.83 rows=134728 width=8) (actual time=0.127..2116.153 rows=125796 loops=1)"
" Filter: (((seat_status_type)::text <> 'NONE'::text) AND (operator_id = '15'::bigint))"
" Rows Removed by Filter: 6782294"
" -> Hash (cost=2797.58..2797.58 rows=7 width=12) (actual time=1.853..1.853 rows=1430 loops=1)"
" Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage: 78kB"
" -> Bitmap Heap Scan on tk_trip t (cost=32.21..2797.58 rows=7 width=12) (actual time=0.176..1.559 rows=1430 loops=1)"
" Recheck Cond: (route_id = '150'::bigint)"
" Filter: (((trip_status)::text = 'BOOKABLE'::text) AND (date_part('year'::text, departure) = '2017'::double precision))"
" Rows Removed by Filter: 33"
" Heap Blocks: exact=333"
" -> Bitmap Index Scan on tk_trip_route_id_idx (cost=0.00..32.21 rows=1572 width=0) (actual time=0.131..0.131 rows=1463 loops=1)"
" Index Cond: (route_id = '150'::bigint)"
"Planning time: 0.211 ms"
"Execution time: 2153.972 ms"
You can - possibly - make them the same if you hint postgres to not use Nested Loops:
SET enable_nestloop = 'off';
You can make it permanent by setting it to either server, role, inside function definition or server configuration:
ALTER DATABASE postgres
SET enable_nestloop = 'off';
ALTER ROLE lkaminski
SET enable_nestloop = 'off';
CREATE FUNCTION add(integer, integer) RETURNS integer
AS 'select $1 + $2;'
LANGUAGE SQL
SET enable_nestloop = 'off'
IMMUTABLE
RETURNS NULL ON NULL INPUT;
As for why - you change search condition and planner estimates that from tk_trip he will get 1 row instead of 7, so it changes plan because it seems like Nested Loop will be better. Sometimes it is wrong about those and you might get slower execution time. But if you "force" it to not use Nested Loops then for different parameter it could be slower to use second plan instead of first one (with Nested Loop).
You can make planner estimates more accurate by increasing how much statistics it gathers per column. It might help.
ALTER TABLE tk_trip ALTER COLUMN route_id SET STATISTICS 1000;
As a side note - your LEFT JOIN is actually INNER JOIN, because you have put conditions for that table inside WHERE instead of ON. You should get different plan (and result) if you move them over to ON - assuming you wanted LEFT JOIN.

Index on boolean columns in time dimension

In our time dimension in data warehouse, we have a lot of columns with boolean flags, for example:
is_ytd (is year to date)
is_mtd (is month to date)
is_current_date
is_current_month
is_current_year
Would it be a good indexing strategy to create partial index on all such columns? Something like:
CREATE INDEX tdim_is_current_month
ON calendar (is_current_month)
WHERE is_current_month;
Our time dimension has 136 columns, 7 000 rows, 53 columns with boolean indicator.
Why we use flags instead of deriving desired date range from current_date?
Make life easier
Enforce consistency
Speed-up queries
Provide not-so-easy-to-derive indicators
Make use of other tools easier
Ad1)
Once you join time dimension (and this is almost always when analyzing any fact table in data warehouse), it is much easier to write where is_current_year instead of where extract(year from time_date) = extract(year from current_date)
Ad2)
Example: It sounds simple to figure out what year to date (YTD) is. We can start with: time_date between date_trunc('year', current_date) and current_date. But some people would actually exclude current_date (this make sense, because today is not finished). In such case we would use: time_date between date_trunc('year', current_date) and (current_date - 1). And going further - what would happen if for some reason DW is not updated for couple of days. Maybe then you would like YTD linked to day where you have last completed data from all source systems. When you have common definition of what YTD means than you reduce risk of different meanings.
Ad 3)
I think that it should be faster to filter data based on indexed boolean flag in column than filter based on on-the-fly calculated expression.
Ad 4)
Some flags are not so easy to create - for example we do have flags is_first_workday_in_month, is_last_workday_in_month.
Ad 5)
In some tools it is easier to use existing columns than SQL expressions. For example when creating dimensions for OLAP cube is is much easier to add table column as level of hierarchy than constructing such level with SQL expression.
Testing indexes for boolean flags
I tested all indexed flags and run explain analyze for simmple query with one fact table and time dimension (named calendar):
select count(*) from fact_table join calendar using(time_key)
For most flags I get Index scan:
"Aggregate (cost=4022.80..4022.81 rows=1 width=0) (actual time=38.642..38.642 rows=1 loops=1)"
" -> Hash Join (cost=13.12..4019.73 rows=1230 width=0) (actual time=38.640..38.640 rows=0 loops=1)"
" Hash Cond: (fact_table.time_key = calendar.time_key)"
" -> Seq Scan on fact_table (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.006..17.769 rows=198495 loops=1)"
" -> Hash (cost=12.58..12.58 rows=43 width=2) (actual time=0.054..0.054 rows=43 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 2kB"
" -> Index Scan using cal_is_qtd on calendar (cost=0.00..12.58 rows=43 width=2) (actual time=0.014..0.049 rows=43 loops=1)"
" Index Cond: (is_qtd = true)"
"Total runtime: 38.679 ms"
For some flags I get bitmap heap scan combined with bitmap index scan:
"Aggregate (cost=13341.07..13341.08 rows=1 width=0) (actual time=100.972..100.973 rows=1 loops=1)"
" -> Hash Join (cost=6656.54..13001.52 rows=135820 width=0) (actual time=5.729..86.972 rows=198495 loops=1)"
" Hash Cond: (fact_table.time_key = calendar.time_key)"
" -> Seq Scan on fact_table (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.012..22.667 rows=198495 loops=1)"
" -> Hash (cost=6597.19..6597.19 rows=4748 width=2) (actual time=5.706..5.706 rows=4748 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 158kB"
" -> Bitmap Heap Scan on calendar (cost=97.05..6597.19 rows=4748 width=2) (actual time=0.440..4.971 rows=4748 loops=1)"
" Filter: is_past_quarter"
" -> Bitmap Index Scan on cal_is_past_quarter (cost=0.00..95.86 rows=3249 width=0) (actual time=0.395..0.395 rows=4748 loops=1)"
" Index Cond: (is_past_quarter = true)"
"Total runtime: 101.013 ms"
Only for two flags I get seq scan:
"Aggregate (cost=17195.33..17195.34 rows=1 width=0) (actual time=122.108..122.108 rows=1 loops=1)"
" -> Hash Join (cost=9231.13..16699.10 rows=198495 width=0) (actual time=23.960..108.018 rows=198495 loops=1)"
" Hash Cond: (fact_table.time_key = calendar.time_key)"
" -> Seq Scan on fact_table (cost=0.00..3249.95 rows=198495 width=2) (actual time=0.012..22.153 rows=198495 loops=1)"
" -> Hash (cost=9144.39..9144.39 rows=6939 width=2) (actual time=23.935..23.935 rows=6939 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 231kB"
" -> Seq Scan on calendar (cost=0.00..9144.39 rows=6939 width=2) (actual time=17.427..22.908 rows=6939 loops=1)"
" Filter: is_eoq"
"Total runtime: 122.138 ms"
If is_current_month = true represents more than a few percent of the rows then the index will not be used. 7,000 rows is too few to even bother.
Maybe this is more of a comment than an answer....
Given that the query planner/optimizer gets the cardinalities and join type correct, the execution time of any query involving a join between your fact table and your time dimension will be determined by the size of the fact table.
Your time dimension will either be in cache all the time or fully read in a few ms. You will have bigger variations than that depending on the current load! The rest of the execution time does not have to do with the time dimension.
Having said that, I'm all for using every trick in the bag to help the query planner/optimizer come up with good enough estimates. Sometimes this means creating or disabling constraints and creating unnecessary indexes.