Query too slow in Postgresql in table with > 12M rows - postgresql

I have a simple table with more than 12 Million rows growing every time, in my web app.
+-----+-----+------+-------+--------+
| id | dtt | cus | event | server |
-------------------------------------
I'm getting the count of today events by customer using this query
SELECT COUNT(*) FROM events
WHERE dtt AT TIME ZONE 'America/Santiago' >=date(now() AT TIME ZONE 'America/Santiago') + interval '1s'
AND cus=2
And the performance is very bad for my web app : 22702 ms.
"Aggregate (cost=685814.54..685814.55 rows=1 width=0) (actual time=21773.451..21773.452 rows=1 loops=1)"
" -> Seq Scan on events (cost=0.00..675644.52 rows=4068008 width=0) (actual time=10277.508..21732.548 rows=409808 loops=1)"
" Filter: ((cus = 2) AND (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval)))"
" Rows Removed by Filter: 12077798"
"Planning time: 0.127 ms"
"Execution time: 21773.509 ms"
I have the next Indexes created:
CREATE INDEX events_dtt_idx
ON events
USING btree
(dtt);
CREATE INDEX events_id_desc
ON events
USING btree
(id DESC NULLS LAST);
CREATE INDEX events_cus_idx
ON events
USING btree
(cus);
CREATE INDEX events_id_idx
ON events
USING btree
(id);
Using Postgresql 9.4, Linux x64
How can I improve that? Thanks in advance.

something like:
CREATE INDEX dtt_tz_idx ON events (DATE(dtt AT TIME ZONE 'America/Santiago'));
then query
SELECT COUNT(*) FROM events
WHERE DATE(TIMEZONE('America/Santiago'::text, dtt)) >=date(now() AT TIME ZONE 'America/Santiago') + interval '1s'
AND cus=2
If it doesn't work, try "\d dtt_tz_idx" in psql and try to match the datatypes on your query with the index.

Finally I could fix the problem with that index:
CREATE INDEX dtt_tz_idx ON events (TIMEZONE('America/Santiago'::text, dtt));
Thanks sivan & vyegorov for your guide, now the plan is:
"Aggregate (cost=567240.43..567240.44 rows=1 width=0) (actual time=238.440..238.440 rows=1 loops=1)"
" -> Bitmap Heap Scan on events (cost=82620.28..556463.97 rows=4310584 width=0) (actual time=41.445..208.870 rows=344453 loops=1)"
" Recheck Cond: (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval))"
" Filter: (cus = 2)"
" Rows Removed by Filter: 9433"
" Heap Blocks: exact=9426"
" -> Bitmap Index Scan on dtt_tz_idx (cost=0.00..81542.63 rows=4415225 width=0) (actual time=38.866..38.866 rows=353886 loops=1)"
" Index Cond: (timezone('America/Santiago'::text, dtt) >= (date(timezone('America/Santiago'::text, now())) + '00:00:01'::interval))"
"Planning time: 0.221 ms"
"Execution time: 238.509 ms"

Related

Postgres is using UNIQUE index instead of FTS index

I have a table with > 10.000.000 rows. The table has a column OfficialEnterprise_vatNumber that should be unique and can be part of a full text search.
Here are the indexes:
"uq_officialenterprise_vatnumber" "CREATE UNIQUE INDEX uq_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING btree (""OfficialEnterprise_vatNumber"")"
"ix_officialenterprise_vatnumber" "CREATE INDEX ix_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING gin (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text))"
But if I EXPLAIN a query that should be using the FTS index like this
SELECT * FROM commonservices."OfficialEnterprise"
WHERE
to_tsvector('commonservices.unaccent_dictionary', "OfficialEnterprise_vatNumber") ## to_tsquery('FR:* | IE:*')
ORDER BY "OfficialEnterprise_vatNumber" ASC
LIMIT 100
It shows that the used index is uq_officialenterprise_vatnumber and not ix_officialenterprise_vatnumber.
Is their something I'm missing ?
EDIT:
Here is the EXPLAIN ANALYZE statement of the original query.
"Limit (cost=0.43..1460.27 rows=100 width=238) (actual time=6996.976..6997.057 rows=15 loops=1)"
" -> Index Scan using uq_officialenterprise_vatnumber on ""OfficialEnterprise"" (cost=0.43..1067861.32 rows=73149 width=238) (actual time=6996.975..6997.054 rows=15 loops=1)"
" Filter: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Rows Removed by Filter: 1847197"
"Planning Time: 0.185 ms"
"Execution Time: 6997.081 ms"
Here is the EXPLAIN ANALYZE of the query if I add || '0' to the order by.
"Limit (cost=55558.82..55570.49 rows=100 width=270) (actual time=7.069..9.827 rows=15 loops=1)"
" -> Gather Merge (cost=55558.82..62671.09 rows=60958 width=270) (actual time=7.068..9.823 rows=15 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=54558.80..54635.00 rows=30479 width=270) (actual time=0.235..0.238 rows=5 loops=3)"
" Sort Key: (((""OfficialEnterprise_vatNumber"")::text || '0'::text))"
" Sort Method: quicksort Memory: 28kB"
" Worker 0: Sort Method: quicksort Memory: 25kB"
" Worker 1: Sort Method: quicksort Memory: 25kB"
" -> Parallel Bitmap Heap Scan on ""OfficialEnterprise"" (cost=719.16..53393.91 rows=30479 width=270) (actual time=0.157..0.166 rows=5 loops=3)"
" Recheck Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Heap Blocks: exact=6"
" -> Bitmap Index Scan on ix_officialenterprise_vatnumber (cost=0.00..700.87 rows=73149 width=0) (actual time=0.356..0.358 rows=15 loops=1)"
" Index Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
"Planning Time: 0.108 ms"
"Execution Time: 9.886 ms"
rows=73149...(actual...rows=15)
So it vastly misestimates the number of rows it will find. If it actually found 73149 rows, using the ordering index probably would be faster.
Have you ANALYZED your table since creating the functional index on it? Doing that might fix the estimation problem. Or at least improve it enough to fix the planning decision.
Yes, doing dummy operations like +0 or ||'' is hackish. They work by forcing the planner to think it can't use the index to fulfill the ORDER BY.
Doing full text search on a VAT number seems rather misguided in the first place. Maybe this would be best addressed with a LIKE query, or by creating an explicit column to hold the country of origin flag.

What indexes would improve the speed of this query?

Select link_id, date_trunc('day', inserted_at),
count(*) from clicks
where
previous_click_id is null
and link_workspace_id = 2
group by
link_id, date_trunc('day', inserted_at)
Output of EXPLAIN ANALYZE:
"GroupAggregate (cost=153356.84..163584.13 rows=129848 width=24) (actual time=1315.303..1783.331 rows=25234 loops=1)"
" Group Key: link_id, (date_trunc('day'::text, inserted_at))"
" -> Sort (cost=153356.84..155507.88 rows=860419 width=16) (actual time=1315.280..1645.578 rows=879836 loops=1)"
" Sort Key: link_id, (date_trunc('day'::text, inserted_at))"
" Sort Method: external merge Disk: 22296kB"
" -> Seq Scan on clicks (cost=0.00..53835.41 rows=860419 width=16) (actual time=0.054..741.964 rows=879836 loops=1)"
" Filter: ((previous_click_id IS NULL) AND (link_workspace_id = 2))"
" Rows Removed by Filter: 418485"
"Planning time: 0.204 ms"
"Execution time: 1794.119 ms"
A regular index scan won't help, since you are retrieving most of the table.
But you should increase work_mem to get a faster in-memory sort.
You could experiment with a covering index:
CREATE INDEX ON clicks (link_workspace_id)
INCLUDE (link_id, inserted_at)
WHERE previous_click_id IS NULL;
VACUUM clicks;
You need to keep the table well vacuumed to get an index only scan.

What is the difference between Index-Only and Bitmap Index Scan in PostgreSQL?

In my query, I just want to call data with exact where conditions. These where conditions were created in index. Bu the explain shows bit index-scan. I couldn't understand why.
My query looks like below:
Select
r.spend,
r.date,
...
from metadata m
inner join
report r
on m.org_id = r.org_id and m.country_or_region = r.country_or_region and m.campaign_id = r.campaign_id and m.keyword_id = r.keyword_id
where r.org_id = 1 and m.keyword_type = 'KEYWORD'
offset 0 limit 20
Indexes:
Metadata(org_id, keyword_type, country_or_region, campaign_id, keyword_id);
Report(org_id, country_or_region, campaign_id, keyword_id, date);
Explain Analyze:
"Limit (cost=811883.21..910327.87 rows=20 width=8) (actual time=18120.268..18235.831 rows=20 loops=1)"
" -> Gather (cost=811883.21..2702020.67 rows=384 width=8) (actual time=18120.267..18235.791 rows=20 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Parallel Hash Join (cost=810883.21..2700982.27 rows=160 width=8) (actual time=18103.440..18103.496 rows=14 loops=3)"
" Hash Cond: (((r.country_or_region)::text = (m.country_or_region)::text) AND (r.campaign_id = m.campaign_id) AND (r.keyword_id = m.keyword_id))"
" -> Parallel Bitmap Heap Scan on report r (cost=260773.11..2051875.83 rows=3939599 width=35) (actual time=552.601..8532.962 rows=3162553 loops=3)"
" Recheck Cond: (org_id = 479360)"
" Rows Removed by Index Recheck: 21"
" Heap Blocks: exact=20484 lossy=84350"
" -> Bitmap Index Scan on idx_kr_org_date_camp (cost=0.00..258409.35 rows=9455038 width=0) (actual time=539.329..539.329 rows=9487660 loops=1)"
" Index Cond: (org_id = 479360)"
" -> Parallel Hash (cost=527278.08..527278.08 rows=938173 width=26) (actual time=7425.062..7425.062 rows=727133 loops=3)"
" Buckets: 65536 Batches: 64 Memory Usage: 2656kB"
" -> Parallel Bitmap Heap Scan on metadata m (cost=88007.61..527278.08 rows=938173 width=26) (actual time=1007.028..7119.233 rows=727133 loops=3)"
" Recheck Cond: ((org_id = 479360) AND ((keyword_type)::text = 'KEYWORD'::text))"
" Rows Removed by Index Recheck: 3"
" Heap Blocks: exact=14585 lossy=11054"
" -> Bitmap Index Scan on idx_primary (cost=0.00..87444.71 rows=2251615 width=0) (actual time=1014.631..1014.631 rows=2181399 loops=1)"
" Index Cond: ((org_id = 479360) AND ((keyword_type)::text = 'KEYWORD'::text))"
"Planning Time: 0.492 ms"
"Execution Time: 18235.879 ms"
In here, I just want to call 20 items. It should be more effective?
The Bitmap Index Scan happens when the result set will have a high selectivity rate with respect to the search conditions (i.e., there is a high percentage of rows that satisfy the search criteria). In this case, the planner will plan to scan the entire index, forming a bitmap of which pages on disk to pull the data out from (which happens during the Bitmap Heap Scan step). This is better than a Sequential Scan because it only scans the relevant pages on disk, skipping the pages that it knows relevant data does not exist. Depending on the statistics available to the optimizer, it may not be advantageous to do an Index Scan or an Index-Only Scan, but it is still better than a Sequential Scan.
To complete the answer to the question, an Index-Only Scan is a scan of the index that will pull the relevant data without having to visit the actual table. This is because the relevant data is already in the index. Take, for example, this table:
postgres=# create table foo (id int primary key, name text);
CREATE TABLE
postgres=# insert into foo values (generate_series(1,1000000),'foo');
INSERT 0 1000000
There is an index on the id column of this table, and suppose we call the following query:
postgres=# EXPLAIN ANALYZE SELECT * FROM foo WHERE id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=8) (actual time=0.012..1.027 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.190 ms
Execution Time: 2.067 ms
(4 rows)
This query results in an Index scan because it scans the index for the rows that have id < 100, and then visits the actual table on disk to pull the other columns included in the * portion of the SELECT query.
However, suppose we call the following query (notice SELECT id instead of SELECT *):
postgres=# EXPLAIN ANALYZE SELECT id FROM foo WHERE id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=4) (actual time=0.019..0.996 rows=99 loops=1)
Index Cond: (id < 100)
Heap Fetches: 99
Planning Time: 0.098 ms
Execution Time: 1.980 ms
(5 rows)
This results in an Index-only scan because only the id column is requested, and that is included (naturally) in the index, so there's no need to visit the actual table on disk to retrieve anything else. This saves time, but its occurrence is very limited.
To answer your question about limiting to 20 results, the limiting occurs after the Bitmap Index Scan has occurred, so the runtime will still be the same whether you limit to 20, 40, or some other value. In the case of an Index/Index-Only Scan, the executor will stop scanning after it has acquired enough rows as specified by the LIMIT clause. In your case, with the Bitmap Heap Scan, this isn’t possible

Boolean column in multicolumn index

Test table and indexes:
CREATE TABLE public.t (id serial, cb boolean, ci integer, co integer)
INSERT INTO t(cb, ci, co)
SELECT ((round(random()*1))::int)::boolean, round(random()*100), round(random()*100)
FROM generate_series(1, 1000000)
CREATE INDEX "right" ON public.t USING btree (ci, cb, co);
CREATE INDEX wrong ON public.t USING btree (ci, co);
CREATE INDEX right_hack ON public.t USING btree (ci, (cb::integer), co);
The problem is that I can't force PostgreSQL to use the "right" index. The next query uses the "wrong" index. It's not optimal because it uses "Filter" (condition: cb = TRUE) and so reads more data from memory (and execution becomes longer):
explain (analyze, buffers)
SELECT * FROM t WHERE cb = TRUE AND ci = 46 ORDER BY co LIMIT 1000
"Limit (cost=0.42..4063.87 rows=1000 width=13) (actual time=0.057..4.405 rows=1000 loops=1)"
" Buffers: shared hit=1960"
" -> Index Scan using wrong on t (cost=0.42..21784.57 rows=5361 width=13) (actual time=0.055..4.256 rows=1000 loops=1)"
" Index Cond: (ci = 46)"
" Filter: cb"
" Rows Removed by Filter: 967"
" Buffers: shared hit=1960"
"Planning time: 0.318 ms"
"Execution time: 4.530 ms"
But when I cast bool column to int, that works fine. This is unclear, because selectivity of both indexes (right and right_hack) remains the same.
explain (analyze, buffers)
SELECT * FROM t WHERE cb::int = 1 AND ci = 46 ORDER BY co LIMIT 1000
"Limit (cost=0.42..2709.91 rows=1000 width=13) (actual time=0.027..1.484 rows=1000 loops=1)"
" Buffers: shared hit=1003"
" -> Index Scan using right_hack on t (cost=0.42..14525.95 rows=5361 width=13) (actual time=0.025..1.391 rows=1000 loops=1)"
" Index Cond: ((ci = 46) AND ((cb)::integer = 1))"
" Buffers: shared hit=1003"
"Planning time: 0.202 ms"
"Execution time: 1.565 ms"
Are there any limitations of using boolean column inside multicolumn index?
A conditional index (or two) does seem to work:
CREATE INDEX true_bits ON ttt (ci, co)
WHERE cb = True ;
CREATE INDEX false_bits ON ttt (ci, co)
WHERE cb = False ;
VACUUM ANALYZE ttt;
EXPLAIN (ANALYZE, buffers)
SELECT * FROM ttt
WHERE cb = TRUE AND ci = 46 ORDER BY co LIMIT 1000
;
Plan
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.25..779.19 rows=1000 width=13) (actual time=0.024..1.804 rows=1000 loops=1)
Buffers: shared hit=1001
-> Index Scan using true_bits on ttt (cost=0.25..3653.46 rows=4690 width=13) (actual time=0.020..1.570 rows=1000 loops=1)
Index Cond: (ci = 46)
Buffers: shared hit=1001
Planning time: 0.468 ms
Execution time: 1.949 ms
(7 rows)
Still, there is very little gain in indexes on low-cardinality columns. The chance that an index-entry can avoid a page-read is very small. For a page size of 8K and a rowsize of ~20, there are ~400 records on a page. There will (almost) always be a true record on any page (and a false record), so the page will have to be read anyway.

postgresql date filed query performance

i have two table . news table have 7m record and news_publish table have 70m record
when i execute this query took enormous amounts of time and very slow .
i add three index for tuning but query is slow.
when i google this problem i found that someone suggest that change statistics to 1000 and i chage it ,but problem is yet
alter table khb_news alter submitteddate set statistics 1000;
SELECT n.id as newsid ,n.title,p.submitteddate as publishdate,
n.summary ,n.smallImageid ,
n.classification ,n.submitteddate as newsdate,
p.toorganizationid
from khb_news n
join khb_news_publish p
on n.id=p.newsid
left join dataitem b on b.id=n.classification
where
n.classification in (1) and n.newstype=60
AND n.submitteddate >= '2014/06/01'::timestamp AND n.submitteddate <'2014/08/01'::timestamp and p.toorganizationid=123123
order by p.id desc
limit 10 offset 0
indexes is :
CREATE INDEX "p.id"
ON khb_news_publish
USING btree
(id DESC);
CREATE INDEX idx_toorganization
ON khb_news_publish
USING btree
(toorganizationid);
CREATE INDEX "idx_n.classification_n.newstype_n.submitteddate"
ON khb_news
USING btree
(classification, newstype, submitteddate);
after add this indexes and run explain analyze i get this explain
"Limit (cost=0.99..10100.13 rows=10 width=284) (actual time=24711.831..24712.849 rows=10 loops=1)"
" -> Nested Loop (cost=0.99..5946373.12 rows=5888 width=284) (actual time=24711.827..24712.837 rows=10 loops=1)"
" -> Index Scan using "p.id" on khb_news_publish p (cost=0.56..4748906.31 rows=380294 width=32) (actual time=2.068..23338.731 rows=194209 loops=1)"
" Filter: (toorganizationid = 95607)"
" Rows Removed by Filter: 36333074"
" -> Index Scan using khb_news_pkey on khb_news n (cost=0.43..3.14 rows=1 width=260) (actual time=0.006..0.006 rows=0 loops=194209)"
" Index Cond: (id = p.newsid)"
" Filter: ((submitteddate >= '2014-06-01 00:00:00'::timestamp without time zone) AND (submitteddate < '2014-08-01 00:00:00'::timestamp without time zone) AND (newstype = 60) AND (classification = ANY ('{19,20,21}'::bigint[])))"
" Rows Removed by Filter: 1"
"Planning time: 3.871 ms"
"Execution time: 24712.982 ms"
i add explain in https://explain.depesz.com/s/Gym
how can change query to make it faster ??
You should start with creating an index on khb_news_publish(toorganizationid, id)
CREATE INDEX idx_toorganization_id
ON khb_news_publish
USING btree
(toorganizationid, id);
This should fix the problem but you might also need index:
CREATE INDEX idx_id_classification_newstype_submitteddate
ON khb_news
USING btree
(classification, newstype, submitteddate, id);