Postgres slow performance when order is added

Postgres slow performance when order is added - postgresql

I have this simple setup with a single table where I import some users having a short description (bio) that is max 255 characters. Next to the description field I have another field of type tsvector which is the bio but tokenized.
My goal is to find user bios that contain certain keywords. A full text search on a single table so I'm not sure if the tsvector field is even needed (I'm new to postgres, used only Mysql in the past) since it seems to me it gets really powerful when querying in different fields/tables.
My actual problem is the fact that this query runs ok when no ordering is involved (0.5s) but extremely slow when a single order by clause is added on an indexed field (8s). I only have about 1 million records in there.
Table setup:
CREATE TABLE public.django_user
(
id integer NOT NULL DEFAULT nextval('django_user_id_seq'::regclass),
username character varying(32) COLLATE pg_catalog."default" NOT NULL,
description character varying(255) COLLATE pg_catalog."default" NOT NULL,
description_tokens tsvector,
streams_count integer NOT NULL,
CONSTRAINT django_user_pkey PRIMARY KEY (id),
CONSTRAINT django_user_streams_count_check CHECK (streams_count >= 0)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX django_user_description_tokens_07422d46
ON public.django_user USING btree
(description_tokens)
TABLESPACE pg_default;
CREATE INDEX django_user_streams_count_66aa1edc
ON public.django_user USING btree
(streams_count)
TABLESPACE pg_default;
The slow query:
SELECT
streams_count, username, description
FROM
"django_user"
WHERE
to_tsvector('english'::regconfig, COALESCE(("django_user"."description_tokens")::text, '')) ## (plainto_tsquery('english'::regconfig, 'react redux')) = true
ORDER BY streams_count ASC LIMIT 20
If I remove the ORDER BY streams_count ASC everything works fine. Here's an explain for the query:
"Limit (cost=174377.42..174379.75 rows=20 width=106) (actual time=7363.660..7368.257 rows=20 loops=1)"
" -> Gather Merge (cost=174377.42..174379.99 rows=22 width=106) (actual time=7363.658..7368.245 rows=20 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=173377.40..173377.42 rows=11 width=106) (actual time=7359.708..7359.710 rows=15 loops=3)"
" Sort Key: streams_count"
" Sort Method: top-N heapsort Memory: 31kB"
" Worker 0: Sort Method: top-N heapsort Memory: 32kB"
" Worker 1: Sort Method: top-N heapsort Memory: 32kB"
" -> Parallel Seq Scan on django_user (cost=0.00..173377.21 rows=11 width=106) (actual time=24.870..7359.379 rows=109 loops=3)"
" Filter: (to_tsvector('english'::regconfig, COALESCE((description_tokens)::text, ''::text)) ## '''react'' & ''redux'''::tsquery)"
" Rows Removed by Filter: 347231"
"Planning Time: 0.298 ms"
"Execution Time: 7368.293 ms"
Any idea what I'm missing?

A btree index on a tsvector is pretty useless. Make it to a GIN index, so that the query can make use of it with the ## operator.
As for why the ORDER BY makes is slower, consider how much harder it would be find the 20 tallest people in Chicago wearing red ties, versus 20 random people in Chicago wearing red ties.

Related

SQL Performance problem with like query after migration from MySQL to PostgreSQL

I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine

To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.

If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.

Postgres index not using correct plan

My postgresql version is 10.6. I have created an index but that is not used for all where clause condition check. Below are more details :
Create index concurrently ticket_created_at_portal_id_created_by_id_assigned_group_id_idx on ticket(created_at, portal_id, created_by_id, assigned_group);
EXPLAIN (analyze true, verbose true, costs true, buffers true, timing true ) select * from ticket where status is not null
and (assigned_group in ('447') or created_by_id in ('39731566'))
and portal_id=8
and created_at>='2020-12-07T03:00:10.973'
and created_at<='2021-02-05T03:00:10.973'
order by updated_at DESC limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18975.23..18975.25 rows=10 width=638) (actual time=278.340..278.345 rows=10 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Buffers: shared hit=2280 read=3105
-> Sort (cost=18975.23..18975.45 rows=87 width=638) (actual time=278.338..278.339 rows=10 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Sort Key: ticket.updated_at DESC
Sort Method: top-N heapsort Memory: 33kB
Buffers: shared hit=2280 read=3105
-> Bitmap Heap Scan on public.ticket (cost=17855.76..18973.35 rows=87 width=638) (actual time=111.871..275.835 rows=1256 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Recheck Cond: (((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8)) OR ((ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone) AND (ticket.portal_id = 8) AND (ticket.created_by_id = '39731566'::bigint)))
Filter: ((ticket.status IS NOT NULL) AND (ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone))
Rows Removed by Filter: 1517
Heap Blocks: exact=2638
Buffers: shared hit=2277 read=3105
-> BitmapOr (cost=17855.76..17855.76 rows=291 width=0) (actual time=106.215..106.216 rows=0 loops=1)
Buffers: shared hit=336 read=2408
-> Bitmap Index Scan on ticket_assigned_group_portal_id_assigned_agent_idx (cost=0.00..11.25 rows=282 width=0) (actual time=10.661..10.661 rows=2776 loops=1)
Index Cond: ((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8))
Buffers: shared hit=4 read=15
-> Bitmap Index Scan on ticket_created_at_portal_id_created_by_id_assigned_group_id_idx (cost=0.00..17844.47 rows=9 width=0) (actual time=95.551..95.551 rows=2 loops=1)
Index Cond: ((ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone) AND (ticket.portal_id = 8) AND (ticket.created_by_id = '39731566'::bigint))
Buffers: shared hit=332 read=2393
Planning time: 43.083 ms
Execution time: 278.556 ms
(25 rows)
ticket_created_at_portal_id_created_by_id_assigned_group_id_idx is having all the columns of where clause except status is not null, but still query is using separate index for Index Cond: ((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8)) which is already present in 2nd index ticket_created_at_portal_id_created_by_id_assigned_group_id_idx.
Why it is so? Even when I include status column as well in the index, still query was using 2 index and hen doing a heavy filter on index heap scan.
How we can optimize it?
Also I did experiment with indexes of table but still unable to remove indexes. It seems that same columns are repeated in multiple indexes for diff queries, Please help if we can reduce these no of indexes.
All indexes for table are:
"ticket_pkey" PRIMARY KEY, btree (id)
"ticket_ticket_id_idx" UNIQUE, btree (ticket_id)
"uk2uors84i0m8sjxc6oaocuy6oj" UNIQUE CONSTRAINT, btree (ticket_id)
"idx_resource_id" btree (resource_id)
"idx_ticket_created_at" btree (created_at)
"ticket_assigned_agent_idx" btree (assigned_agent)
"ticket_assigned_group_idx" btree (assigned_group)
"ticket_assigned_group_portal_id_assigned_agent_idx" btree (assigned_group, portal_id, assigned_agent)
"ticket_created_at_portal_id_created_by_id_assigned_group_id_idx" btree (created_at, portal_id, created_by_id, assigned_group)
"ticket_created_at_portal_id_status_idx" btree (created_at, portal_id, status)
"ticket_id_resolved_at_assigned_group_status_idx" btree (id, resolved_at, assigned_group, status)

How easy would it be to use a phone book (sorted by last-name then first-name) to find every one with a first name of "Francis" whose last name starts with a letter between K and T? Not very easy, because it is not sorted by first name. You would have to go through the entire middle half of the phone book, reading everyone's first name.
Same here. When the first column in your index is used in a range/inequality query rather than equality, it makes all the columns after that one much less efficient. You would want to put the columns used for equality and not in an OR first. Unfortunately, that is only portal_id. The best thing to put next would depend on how selective each of those other conditions are which we can't guess from the info provided.
In deciding this, status IS NULL would be the same thing as equality, but status IS NOT NULL is not as there are any number of values it could be while still being not null, so it is effectively the same as an inequality. If this condition is highly selective, the best way to incorporate it would be in the WHERE of a partial index.
Because of the OR, you might still be best off with 2 indexes which could be combined in a bitmap or.
...(portal_id, assigned_group, created_at) WHERE status IS NOT NULL;
...(portal_id, created_by_id, created_at) WHERE status IS NOT NULL;
Another approach would be to avoid fetching and sorting all of the matching rows, by walking rows in order by updated_at using an index and stopping after 10 of them are found. An index can be used for walking a column in order as long as only things tested for equality (and without ORs) occur before the ORDER BY column, so:
...(portal_id, updated_at) WHERE status IS NOT NULL;

Why is index not being used when 'Check X Min' flag is set to true?

I have a simple query for a simple table in postgres. I have a simple index on that table.
In some environments it is using the index when performing the query, in other environments (on the same RDS instance, different database) it isn't. (checked using EXPLAIN ANALYSE)
One thing I've noticed is that if the 'Check X Min' flag on the index is TRUE then index is not used. (pg_catalog.pg_index.indcheckxmin)
How do I ensure the index is used and, presumably, have the 'Check X Min' flag set to false?
Table contains 100K+ rows.
Things I have tried:
The index is valid and is always used in environments where the 'Check X Min' is set to false.
set enable_seqscan to off; still does not use the index.
Creating/recreating an index in these environments always seems to have 'Check X Min' set to true.
Vacuuming does not seem to help.
Setup of table and index:
CREATE TABLE schema_1.table_1 (
field_1 varchar(20) NOT NULL,
field_2 int4 NULL,
field_3 timestamptz NULL,
field_4 numeric(10,2) NULL
);
CREATE INDEX table_1_field_1_field_3_idx ON schema_1.table_1 USING btree (field_1, field_3 DESC);
Query:
select field_1, field_2, field_3, field_4
from schema_1.table_1
where field_1 = ’abcdef’
order by field_3 desc limit 1;
When not using index:
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------|
Limit (cost=4.41..4.41 rows=1 width=51) (actual time=3.174..3.176 rows=1 loops=1) |
-> Sort (cost=4.41..4.42 rows=6 width=51) (actual time=3.174..3.174 rows=1 loops=1) |
Sort Key: field_3 DESC |
Sort Method: top-N heapsort Memory: 25kB |
-> Seq Scan on table_1 (cost=0.00..4.38 rows=6 width=51) (actual time=3.119..3.150 rows=3 loops=1)|
Filter: ((field_1)::text = 'abcdef'::text) |
Rows Removed by Filter: 96 |
Planning time: 2.895 ms |
Execution time: 3.197 ms |
When using index:
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=0.28..6.30 rows=1 width=51) (actual time=0.070..0.144 rows=1 loops=1) |
-> Index Scan using table_1_field_1_field_3_idx on field_1 (cost=0.28..12.31 rows=2 width=51) (actual time=0.049..0.066 rows=1 loops=1)|
Index Cond: ((field_1)::text = 'abcdef'::text) |
Planning time: 0.184 ms |
Execution time: 0.303 ms |
Have renamed fields, schema, and table to avoid sharing business context

You seem to be using CREATE INDEX CONCURRENTLY, and have long-open transactions. From the docs:
Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build.
You don't have a lot of options here. Hunt down and fix your long-open transactions, don't use CONCURRENTLY, or put up with the limitation.

PostgreSQL is not using a straight forward index

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);
I understand Primary Key implies a unique index
If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1
Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)
But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2
Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)
I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)
But why not using my index if there is data?
Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???
I must be missing something...?
Update 2019-03-07
Thank You for your comments, which are very useful.
This simple query was just me trying to understand why the index was not used...
But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.
bad selectivity because my criteria only filters about 20% of the rows
bad table design (too fat, which we knew and are now addressing)
index not "covering" the query, etc...
So let me change "slightly" my question if I may...
Our table will be normalised in facts/dimensions (no more varchars in the wrong place).
We do only inserts, never updates and so few deletes that we can ignore it.
The table size will not be huge (tens of million of rows order).
Our queries will ALWAYS specify an exact release_key value.
Our new version of the table would look like this
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);
With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).
Question 1)
In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...
Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?
update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us
Question 2
With the query below
EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"
WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;
I get this
Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms
Does anything spring to mind?
If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...
I had created an index I hoped would help but it got ignored...
CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);
ANALYZE milly.dfo_by_quarter;
I also tried to cluster the table but no visible effect either
CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)
ANALYZE milly.dfo_by_quarter;
Many thanks
Eric

Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.
In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.
Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.
Your updated query with no data gives this plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
Buffers: shared hit=4
CTE rank_query
-> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
Buffers: shared hit=4
-> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4
-> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=1
-> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
Sort Key: main.year, main.product_group_key
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: ((release_key = 17) AND (year >= 2010))
Buffers: shared hit=1
CTE beforelookup
-> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=4
-> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
Buffers: shared hit=4
Planning Time: 0.723 ms
Execution Time: 0.485 ms
(27 rows)
So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).
If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.
You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.
But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.
Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.
I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:
Auto increment table column
And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).
If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.
The data type varchar(100) for so many columns may not be optimal. Some normalization might help.
More advise depends on missing information ...

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...
Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT
In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.
This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

patial Index basing on data of second table

I have a master/detail Table situation. For every entry at the master table I have dozens at the detail table.
Lets say these are my tables:
+-----------------------
| Master
+-----------------------
| master_key integer,
| insert_date timestamp
+-----------------------
+-----------------------
| Detail
+-----------------------
| detail_key integer,
| master_key integer
| quantity numeric
| amount numeric
+-----------------------
And my most used Query is something like
SELECT extract(year from insert_date) AS Insert_Year, extract(month from insert_date) AS Insert_Month, sum(quantity) AS Quantity, sum(amount) AS Amount
FROM Master, Detail
WHERE (amount not null) and (insert_date <= '2016-12-31') and (insert_date >= '2015-01-01') and (Detail.master_key=Master.master_key)
GROUP BY Insert_Year, Insert_Month
ORDER BY Insert_Year ASC, Insert_Month ASC;
This query becomes to slow because there are tons of data for many years in both tables.
Of cause I have Indexes at both tables and EXPLAIN ANALYZE tells me that the INDEX scan takes mode than 80% of the hole Execution Time.
"Sort (cost=44013.52..44013.53 rows=1 width=19) (actual time=17073.129..17073.129 rows=16 loops=1)"
" Sort Key: (date_part('year'::text, master.insert_date)), (date_part('month'::text, master.insert_date))"
" Sort Method: quicksort Memory: 26kB"
" -> HashAggregate (cost=44013.49..44013.51 rows=1 width=19) (actual time=17073.046..17073.053 rows=16 loops=1)"
" Group Key: date_part('year'::text, master.insert_date), date_part('month'::text, master.insert_date)"
" -> Nested Loop (cost=0.43..43860.32 rows=15317 width=19) (actual time=0.056..15951.178 rows=843647 loops=1)"
" -> Seq Scan on master (cost=0.00..18881.38 rows=3127 width=12) (actual time=0.027..636.202 rows=182338 loops=1)"
" Filter: ((date(insert_date) >= '2015-01-01'::date) AND (date(insert_date) <= '2016-12-31'::date))"
" Rows Removed by Filter: 443031"
" -> Index Scan using idx_detail_master_key on detail (cost=0.43..7.89 rows=7 width=15) (actual time=0.055..0.077 rows=5 loops=182338)"
" Index Cond: (master_key = master.master_key)"
" Filter: (amount IS NOT NULL)"
" Rows Removed by Filter: 2"
"Planning time: 105.317 ms"
"Execution time: 17073.396 ms"
So my idea was to reduce the index sizes by defining them partial. In most cases only data of the last 2 years are queried.
So I tried something like that:
CREATE INDEX idx_detail_table_master_keys
ON detail (master_key)
WHERE master_key in (SELECT master_key FROM master WHERE (extract( year from insert_date) = 2016) or (extract( year from insert_date) = 2015))
Of cause this is not the final version it should just be a proof of concept and it failed. PGAdmin Tells me that I'm not allowed to use subselects on Index creation.
So my question is: Is it posiple to create a partial Index, basing on the data of an other table?
And of cause I would be thankful for any Tips speeding constellations like this up.
regards

It is not possible to create a partial index, basing on the data of another table because relational databases like postgresql have three ways of joining: nested loops, hash-join and sort-merge. All of these methods load the tables of a join seperately. Since the database optimizer decides which of those methods will be used and also in which direction the join will be executed, it doesn't make sense to create a table index that covers data of another table. This is why you can't define such indexes. A more detailed description about this topics can be found here: http://use-the-index-luke.com/sql/join (and the following sections of the online book)
For further optimization see the comment of Gabriel's Messanger

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse