What indexes would improve the speed of this query? - postgresql

Select link_id, date_trunc('day', inserted_at),
count(*) from clicks
where
previous_click_id is null
and link_workspace_id = 2
group by
link_id, date_trunc('day', inserted_at)
Output of EXPLAIN ANALYZE:
"GroupAggregate (cost=153356.84..163584.13 rows=129848 width=24) (actual time=1315.303..1783.331 rows=25234 loops=1)"
" Group Key: link_id, (date_trunc('day'::text, inserted_at))"
" -> Sort (cost=153356.84..155507.88 rows=860419 width=16) (actual time=1315.280..1645.578 rows=879836 loops=1)"
" Sort Key: link_id, (date_trunc('day'::text, inserted_at))"
" Sort Method: external merge Disk: 22296kB"
" -> Seq Scan on clicks (cost=0.00..53835.41 rows=860419 width=16) (actual time=0.054..741.964 rows=879836 loops=1)"
" Filter: ((previous_click_id IS NULL) AND (link_workspace_id = 2))"
" Rows Removed by Filter: 418485"
"Planning time: 0.204 ms"
"Execution time: 1794.119 ms"

A regular index scan won't help, since you are retrieving most of the table.
But you should increase work_mem to get a faster in-memory sort.
You could experiment with a covering index:
CREATE INDEX ON clicks (link_workspace_id)
INCLUDE (link_id, inserted_at)
WHERE previous_click_id IS NULL;
VACUUM clicks;
You need to keep the table well vacuumed to get an index only scan.

Related

Postgres is using UNIQUE index instead of FTS index

I have a table with > 10.000.000 rows. The table has a column OfficialEnterprise_vatNumber that should be unique and can be part of a full text search.
Here are the indexes:
"uq_officialenterprise_vatnumber" "CREATE UNIQUE INDEX uq_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING btree (""OfficialEnterprise_vatNumber"")"
"ix_officialenterprise_vatnumber" "CREATE INDEX ix_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING gin (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text))"
But if I EXPLAIN a query that should be using the FTS index like this
SELECT * FROM commonservices."OfficialEnterprise"
WHERE
to_tsvector('commonservices.unaccent_dictionary', "OfficialEnterprise_vatNumber") ## to_tsquery('FR:* | IE:*')
ORDER BY "OfficialEnterprise_vatNumber" ASC
LIMIT 100
It shows that the used index is uq_officialenterprise_vatnumber and not ix_officialenterprise_vatnumber.
Is their something I'm missing ?
EDIT:
Here is the EXPLAIN ANALYZE statement of the original query.
"Limit (cost=0.43..1460.27 rows=100 width=238) (actual time=6996.976..6997.057 rows=15 loops=1)"
" -> Index Scan using uq_officialenterprise_vatnumber on ""OfficialEnterprise"" (cost=0.43..1067861.32 rows=73149 width=238) (actual time=6996.975..6997.054 rows=15 loops=1)"
" Filter: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Rows Removed by Filter: 1847197"
"Planning Time: 0.185 ms"
"Execution Time: 6997.081 ms"
Here is the EXPLAIN ANALYZE of the query if I add || '0' to the order by.
"Limit (cost=55558.82..55570.49 rows=100 width=270) (actual time=7.069..9.827 rows=15 loops=1)"
" -> Gather Merge (cost=55558.82..62671.09 rows=60958 width=270) (actual time=7.068..9.823 rows=15 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=54558.80..54635.00 rows=30479 width=270) (actual time=0.235..0.238 rows=5 loops=3)"
" Sort Key: (((""OfficialEnterprise_vatNumber"")::text || '0'::text))"
" Sort Method: quicksort Memory: 28kB"
" Worker 0: Sort Method: quicksort Memory: 25kB"
" Worker 1: Sort Method: quicksort Memory: 25kB"
" -> Parallel Bitmap Heap Scan on ""OfficialEnterprise"" (cost=719.16..53393.91 rows=30479 width=270) (actual time=0.157..0.166 rows=5 loops=3)"
" Recheck Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Heap Blocks: exact=6"
" -> Bitmap Index Scan on ix_officialenterprise_vatnumber (cost=0.00..700.87 rows=73149 width=0) (actual time=0.356..0.358 rows=15 loops=1)"
" Index Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
"Planning Time: 0.108 ms"
"Execution Time: 9.886 ms"
rows=73149...(actual...rows=15)
So it vastly misestimates the number of rows it will find. If it actually found 73149 rows, using the ordering index probably would be faster.
Have you ANALYZED your table since creating the functional index on it? Doing that might fix the estimation problem. Or at least improve it enough to fix the planning decision.
Yes, doing dummy operations like +0 or ||'' is hackish. They work by forcing the planner to think it can't use the index to fulfill the ORDER BY.
Doing full text search on a VAT number seems rather misguided in the first place. Maybe this would be best addressed with a LIKE query, or by creating an explicit column to hold the country of origin flag.

PostgreSQL Slow DISTINCT WHERE

Imagine the following table:
CREATE TABLE drops(
id BIGSERIAL PRIMARY KEY,
loc VARCHAR(5) NOT NULL,
tag INT NOT NULL
);
What I want to do is perform a query where I can find all unique locations where a value matches the tag.
SELECT DISTINCT loc
FROM drops
WHERE tag = '1'
GROUP BY loc;
I am not sure whether it is due to the size (its 9m rows big!) or me being inefficient, but the query takes way too long for users to efficiently use it. At the time I was writing this, the above query took me 1:14 minutes.
Is there any tricks or methods I can utilize in order to shorten this to a mere few seconds?
Much appreciated!
The execution plan:
"Unique (cost=1967352.72..1967407.22 rows=41 width=4) (actual time=40890.768..40894.984 rows=30 loops=1)"
" -> Group (cost=1967352.72..1967407.12 rows=41 width=4) (actual time=40890.767..40894.972 rows=30 loops=1)"
" Group Key: loc"
" -> Gather Merge (cost=1967352.72..1967406.92 rows=82 width=4) (actual time=40890.765..40895.031 rows=88 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Group (cost=1966352.70..1966397.43 rows=41 width=4) (actual time=40879.910..40883.362 rows=29 loops=3)"
" Group Key: loc"
" -> Sort (cost=1966352.70..1966375.06 rows=8946 width=4) (actual time=40879.907..40881.154 rows=19129 loops=3)"
" Sort Key: loc"
" Sort Method: quicksort Memory: 1660kB"
" -> Parallel Seq Scan on drops (cost=0.00..1965765.53 rows=8946 width=4) (actual time=1.341..40858.553 rows=19129 loops=3)"
" Filter: (tag = 1)"
" Rows Removed by Filter: 3113338"
"Planning time: 0.146 ms"
"Execution time: 40895.280 ms"
The table is indexed on loc and tag.
Your 40 seconds are spent sequentially reading the whole table, throwing away 3113338 rows to keep only 19129.
The remedy is simple:
CREATE INDEX ON drops(tag);
But you say you have already done that, but I find it hard to believe. What is the command you used?
Change the condition in the query from
WHERE tag = '1'
to
WHERE tag = 1
It happens to work because '1' is a literal, but don't try to compare strings and numbers.
And, as has been mentioned, keep either the DISTINCT or the GROUP BY, but not both.
If you have used a GROUP BY clause, then there is no need to use the DISTINCT keyword. Omitting that should speed up the running time for the query.

Postgresql Query Too Slow

I use PostgreSQL to storage my DB, and I also create index to speed up the query time.
After I created index on table, the query runs very fast, about 1.5s per query.
But, a few days later, the query runs too low, about 20-28s per query.
I have tried to Drop index then Re-Create index again. The query runs fast again?
Could you help me resolve this issue or Do you have any ideal about this problem?
P/S: here is the query:
SELECT category,
video_title AS title,
event_count AS contentView,
VOD_GROUPBY_ANDSORT.rank
FROM
(SELECT VOD_SORTBY_VIEW.category,
VOD_SORTBY_VIEW.video_title,
VOD_SORTBY_VIEW.event_count,
ROW_NUMBER() OVER(PARTITION BY VOD_SORTBY_VIEW.category
ORDER BY VOD_SORTBY_VIEW.event_count DESC) AS RN,
DENSE_RANK() OVER(
ORDER BY VOD_SORTBY_VIEW.category ASC) AS rank
FROM
(SELECT VOD.category AS category,
VOD.video_title,
SUM(INV.event_count) AS event_count
FROM
(SELECT content_hash.hash_value,
VODCT.category,
VODCT.video_title
FROM
(SELECT vod_content.content_id,
vod_content.category,
vod_content.video_title
FROM vod_content
WHERE vod_content.category IS NOT NULL) VODCT
LEFT JOIN content_hash ON content_hash.content_value = VODCT.content_id) VOD
LEFT JOIN inventory_stats INV ON INV.hash_value = VOD.hash_value
WHERE TIME BETWEEN '2017-02-06 08:00:00'::TIMESTAMP AND '2017-03-06 08:00:00'::TIMESTAMP
GROUP BY VOD.category,
VOD.video_title ) VOD_SORTBY_VIEW ) VOD_GROUPBY_ANDSORT
WHERE RN <= 3
AND event_count > 100
ORDER BY category
And here is the Analyze:
"QUERY PLAN"
"Subquery Scan on vod_groupby_andsort (cost=368586.86..371458.16 rows=6381 width=63) (actual time=19638.213..19647.468 rows=177 loops=1)"
" Filter: ((vod_groupby_andsort.rn <= 3) AND (vod_groupby_andsort.event_count > 100))"
" Rows Removed by Filter: 10246"
" -> WindowAgg (cost=368586.86..370596.77 rows=57426 width=71) (actual time=19638.199..19646.856 rows=10423 loops=1)"
" -> WindowAgg (cost=368586.86..369735.38 rows=57426 width=63) (actual time=19638.194..19642.030 rows=10423 loops=1)"
" -> Sort (cost=368586.86..368730.43 rows=57426 width=55) (actual time=19638.185..19638.984 rows=10423 loops=1)"
" Sort Key: vod_sortby_view.category, vod_sortby_view.event_count DESC"
" Sort Method: quicksort Memory: 1679kB"
" -> Subquery Scan on vod_sortby_view (cost=350535.62..362084.01 rows=57426 width=55) (actual time=16478.589..19629.400 rows=10423 loops=1)"
" -> GroupAggregate (cost=350535.62..361509.75 rows=57426 width=55) (actual time=16478.589..19628.381 rows=10423 loops=1)"
" Group Key: vod_content.category, vod_content.video_title"
" -> Sort (cost=350535.62..353135.58 rows=1039987 width=51) (actual time=16478.570..19436.741 rows=1275817 loops=1)"
" Sort Key: vod_content.category, vod_content.video_title"
" Sort Method: external merge Disk: 76176kB"
" -> Hash Join (cost=95179.29..175499.62 rows=1039987 width=51) (actual time=299.040..807.418 rows=1275817 loops=1)"
" Hash Cond: (inv.hash_value = content_hash.hash_value)"
" -> Bitmap Heap Scan on inventory_stats inv (cost=48708.84..114604.81 rows=1073198 width=23) (actual time=133.873..269.249 rows=1397466 loops=1)"
" Recheck Cond: ((""time"" >= '2017-02-06 08:00:00'::timestamp without time zone) AND (""time"" <= '2017-03-06 08:00:00'::timestamp without time zone))"
" Heap Blocks: exact=11647"
" -> Bitmap Index Scan on inventory_stats_pkey (cost=0.00..48440.54 rows=1073198 width=0) (actual time=132.113..132.113 rows=1397466 loops=1)"
" Index Cond: ((""time"" >= '2017-02-06 08:00:00'::timestamp without time zone) AND (""time"" <= '2017-03-06 08:00:00'::timestamp without time zone))"
" -> Hash (cost=46373.37..46373.37 rows=7766 width=66) (actual time=165.125..165.125 rows=13916 loops=1)"
" Buckets: 16384 (originally 8192) Batches: 1 (originally 1) Memory Usage: 1505kB"
" -> Nested Loop (cost=1.72..46373.37 rows=7766 width=66) (actual time=0.045..159.441 rows=13916 loops=1)"
" -> Seq Scan on content_hash (cost=0.00..389.14 rows=8014 width=43) (actual time=0.007..2.185 rows=16365 loops=1)"
" -> Bitmap Heap Scan on vod_content (cost=1.72..5.73 rows=1 width=72) (actual time=0.009..0.009 rows=1 loops=16365)"
" Recheck Cond: (content_id = content_hash.content_value)"
" Filter: (category IS NOT NULL)"
" Rows Removed by Filter: 0"
" Heap Blocks: exact=15243"
" -> Bitmap Index Scan on vod_content_pkey (cost=0.00..1.72 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=16365)"
" Index Cond: (content_id = content_hash.content_value)"
"Planning time: 1.665 ms"
"Execution time: 19655.693 ms"
You probably need to vacuum and analyze your tables more aggressively, especially if you're doing a lot of deletes and updates.
When a row is deleted or updated, it isn't removed, it's just marked obsolete. vacuum cleans up these dead rows.
analyze updates the statistics about your data used by the query planner.
Normally these are run by the autovacuum daemon. It's possible this has been disabled, or its not running frequently enough.
See this blog about Slow PostgreSQL Performance and the PostgreSQL docs about Routine Vacuuming for more details.
Here is an attempt at a condensed version of the query. I'm not saying it's any faster. Also since I can't run it, there might be some issues.
Left joins where converted to inner since the time value from the second join is required.
Also, I'm curious what the purpose of the dense_rank function is. Looks like you are getting the top 3 titles for each category and then giving the ones in the same category all the same number based on alphanumeric sort? The category already gives them a unique common identifier.
SELECT category, video_title AS title, event_count AS contentView,
DENSE_RANK() OVER(ORDER BY v.category ASC) AS rank
FROM (
SELECT c.category, c.video_title,
SUM(i.event_count) AS event_count,
ROW_NUMBER() OVER(PARTITION BY category ORDER BY sum(i.event_count) DESC) AS rn
FROM vod_content c
JOIN content_hash h ON h.content_value = c.content_id
JOIN inventory_stats i ON i.hash_value = v.hash_value
where c.category is not null
and i.time BETWEEN '2017-02-06 08:00:00'::TIMESTAMP AND '2017-03-06 08:00:00'::TIMESTAMP
GROUP BY c.category, c.video_title
) v
where rn <= 3 and event_count > 100
ORDER BY category

Query too slow in Postgresql in table with > 12M rows

I have a simple table with more than 12 Million rows growing every time, in my web app.
+-----+-----+------+-------+--------+
| id | dtt | cus | event | server |
-------------------------------------
I'm getting the count of today events by customer using this query
SELECT COUNT(*) FROM events
WHERE dtt AT TIME ZONE 'America/Santiago' >=date(now() AT TIME ZONE 'America/Santiago') + interval '1s'
AND cus=2
And the performance is very bad for my web app : 22702 ms.
"Aggregate (cost=685814.54..685814.55 rows=1 width=0) (actual time=21773.451..21773.452 rows=1 loops=1)"
" -> Seq Scan on events (cost=0.00..675644.52 rows=4068008 width=0) (actual time=10277.508..21732.548 rows=409808 loops=1)"
" Filter: ((cus = 2) AND (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval)))"
" Rows Removed by Filter: 12077798"
"Planning time: 0.127 ms"
"Execution time: 21773.509 ms"
I have the next Indexes created:
CREATE INDEX events_dtt_idx
ON events
USING btree
(dtt);
CREATE INDEX events_id_desc
ON events
USING btree
(id DESC NULLS LAST);
CREATE INDEX events_cus_idx
ON events
USING btree
(cus);
CREATE INDEX events_id_idx
ON events
USING btree
(id);
Using Postgresql 9.4, Linux x64
How can I improve that? Thanks in advance.
something like:
CREATE INDEX dtt_tz_idx ON events (DATE(dtt AT TIME ZONE 'America/Santiago'));
then query
SELECT COUNT(*) FROM events
WHERE DATE(TIMEZONE('America/Santiago'::text, dtt)) >=date(now() AT TIME ZONE 'America/Santiago') + interval '1s'
AND cus=2
If it doesn't work, try "\d dtt_tz_idx" in psql and try to match the datatypes on your query with the index.
Finally I could fix the problem with that index:
CREATE INDEX dtt_tz_idx ON events (TIMEZONE('America/Santiago'::text, dtt));
Thanks sivan & vyegorov for your guide, now the plan is:
"Aggregate (cost=567240.43..567240.44 rows=1 width=0) (actual time=238.440..238.440 rows=1 loops=1)"
" -> Bitmap Heap Scan on events (cost=82620.28..556463.97 rows=4310584 width=0) (actual time=41.445..208.870 rows=344453 loops=1)"
" Recheck Cond: (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval))"
" Filter: (cus = 2)"
" Rows Removed by Filter: 9433"
" Heap Blocks: exact=9426"
" -> Bitmap Index Scan on dtt_tz_idx (cost=0.00..81542.63 rows=4415225 width=0) (actual time=38.866..38.866 rows=353886 loops=1)"
" Index Cond: (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval))"
"Planning time: 0.221 ms"
"Execution time: 238.509 ms"

Postgres chooses wrong query plan

I have problems with a query that uses a wrong query plan. Because of the non-optimal query plan the query takes almost 20s.
The problem occurs only for a small number of owner_ids. The distribution of the owner_ids is not uniform. The owner_id in the example has 7948 routes. The total number of routes is 2903096.
The database is hosted on Amazon RDS on a server with 34.2 GiB memory, 4vCPU and provisioned IOPS (instance type db.m2.2xlarge). The Postgres version is 9.3.5.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
61
Query plan:
"Limit (cost=0.86..58637.88 rows=61 width=24) (actual time=49.731..18828.052 rows=61 loops=1)"
" -> Nested Loop (cost=0.86..7934263.10 rows=8254 width=24) (actual time=49.728..18827.887 rows=61 loops=1)"
" -> Index Scan using route_meta_i_name on route_meta (cost=0.43..289911.22 rows=2902910 width=24) (actual time=0.016..2825.932 rows=1411126 loops=1)"
" -> Index Scan using route_pkey on route (cost=0.43..2.62 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=1411126)"
" Index Cond: (id = route_meta.id)"
" Filter: (owner_id = 128905)"
" Rows Removed by Filter: 1"
"Total runtime: 18828.214 ms"
If I increase the limit to 100, a better query plan is used. It takes now less then 100ms.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
100
Query plan:
"Limit (cost=79964.98..79965.23 rows=100 width=24) (actual time=93.037..93.294 rows=100 loops=1)"
" -> Sort (cost=79964.98..79985.61 rows=8254 width=24) (actual time=93.033..93.120 rows=100 loops=1)"
" Sort Key: route_meta.name"
" Sort Method: top-N heapsort Memory: 31kB"
" -> Nested Loop (cost=0.86..79649.52 rows=8254 width=24) (actual time=0.039..77.955 rows=7948 loops=1)"
" -> Index Scan using route_i_owner_id on route (cost=0.43..22765.84 rows=8408 width=4) (actual time=0.023..13.839 rows=7948 loops=1)"
" Index Cond: (owner_id = 128905)"
" -> Index Scan using route_meta_pkey on route_meta (cost=0.43..6.76 rows=1 width=24) (actual time=0.003..0.004 rows=1 loops=7948)"
" Index Cond: (id = route.id)"
"Total runtime: 93.444 ms"
I already tried following things:
increasing statistics for owner_id (The owner_id in the example is included in the pg_stats)
ALTER TABLE route ALTER COLUMN owner_id SET STATISTICS 1000;
reindex owner_id and name
vacuum analyse
increased work_mem from 1MB to 16MB
when I rewrite the query to row_number() OVER (ORDER BY xxx) AS rn
... WHERE rn <= yyy in a subquery, the specific case is solved. However it
introduces performance problems with other ownerids.
A similar problem was solved with a combined index, but that seems impossible here because of the different tables.
Postgres uses wrong index in query plan