I have a table learners which has around 3.2 million rows. This table contains user related information like name and email. I need to optimize some queries that uses order by on some column. So for testing I have created a temp_learners table, with 0.8 million rows. I have created two indexes on this table:
CREATE UNIQUE INDEX "temp_learners_companyId_userId_idx"
ON temp_learners ("companyId" ASC, "userId" ASC, "learnerUserName" ASC, "learnerEmailId" ASC);
and
CREATE INDEX temp_learners_company_name_email_index
ON temp_learners ("companyId", "learnerUserName", "learnerEmailId");
The second index is just for testing.
Now When I run this query:
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431 AND "userId" IN (
4990609084216745771,
4990610022492247987,
4990609742667096366,
4990609476136523663,
5451985767018841230,
5451985767078553638,
5270390122102920730,
4763688819142650938,
5056979692501246449,
5279569274741647114,
5031660827132289520,
4862889373349389098,
5299864070077160421,
4740222596778406913,
5320170488686569878,
5270367618320474818,
5320170488587895729,
5228888485293847415,
4778050469432720821,
5270392314970177842,
4849087862439244546,
5270392117430427860,
5270351184072717902,
5330263074228870897,
4763688829301614114,
4763684609695916489,
5270390232949727716
) ORDER BY "learnerUserName","learnerEmailId";
The query plan used by db is this:
Sort (cost=138.75..138.76 rows=4 width=1581) (actual time=0.169..0.171 rows=27 loops=1)
" Sort Key: ""learnerUserName"", ""learnerEmailId"""
Sort Method: quicksort Memory: 73kB
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners (cost=0.55..138.71 rows=4 width=1581) (actual time=0.018..0.112 rows=27 loops=1)
" Index Cond: ((""companyId"" = '909666665757230431'::bigint) AND (""userId"" = ANY ('{4990609084216745771,4990610022492247987,4990609742667096366,4990609476136523663,5451985767018841230,5451985767078553638,5270390122102920730,4763688819142650938,5056979692501246449,5279569274741647114,5031660827132289520,4862889373349389098,5299864070077160421,4740222596778406913,5320170488686569878,5270367618320474818,5320170488587895729,5228888485293847415,4778050469432720821,5270392314970177842,4849087862439244546,5270392117430427860,5270351184072717902,5330263074228870897,4763688829301614114,4763684609695916489,5270390232949727716}'::bigint[])))"
Planning time: 0.116 ms
Execution time: 0.191 ms
In this it does not sort on indexs.
But when I run this query
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431
ORDER BY "learnerUserName","learnerEmailId" limit 500;
This query uses indexs on sorting.
Limit (cost=0.42..1360.05 rows=500 width=1581) (actual time=0.018..0.477 rows=500 loops=1)
-> Index Scan using temp_learners_company_name_email_index on temp_learners (cost=0.42..332639.30 rows=122327 width=1581) (actual time=0.018..0.442 rows=500 loops=1)
Index Cond: ("companyId" = '909666665757230431'::bigint)
Planning time: 0.093 ms
Execution time: 0.513 ms
What I am not able to understand is why postgre does not uses index in first query? Also, I want to clear out that the normal use case of this table learner is to join with other tables. So the first query I written is more similar to joins equation. So for example,
SELECT *
FROM temp_learners AS l
INNER JOIN entity_learners_basic AS elb
ON l."companyId" = elb."companyId" AND l."userId" = elb."userId"
WHERE l."companyId" = 909666665757230431 AND elb."gameId" = 1050403501267716928
ORDER BY "learnerUserName", "learnerEmailId" limit 5000;
Even after correcting indexes the query plan does not indexes for sorting.
QUERY PLAN
Limit (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.554..173.135 rows=5000 loops=1)
-> Sort (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.553..172.791 rows=5000 loops=1)
" Sort Key: l.""learnerUserName"", l.""learnerEmailId"""
Sort Method: external merge Disk: 35416kB
-> Nested Loop (cost=1.12..3783.91 rows=44 width=1767) (actual time=0.019..63.743 rows=21195 loops=1)
-> Index Scan using primary_index__entity_learners_basic on entity_learners_basic elb (cost=0.57..1109.79 rows=314 width=186) (actual time=0.010..6.221 rows=21195 loops=1)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("gameId" = '1050403501267716928'::bigint))
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners l (cost=0.55..8.51 rows=1 width=1581) (actual time=0.002..0.002 rows=1 loops=21195)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("userId" = elb."userId"))
Planning time: 0.309 ms
Execution time: 178.422 ms
Does Postgres not use indexes when joining and ordering data?
PostgreSQL chooses the plan it thinks will be faster. Using the index that provides rows in the correct order means using a much less selective index, so it doesn't think that will be faster overall.
If you want to force PostgreSQL into believing that sorting is the worst thing in the world, you could set enable_sort=off. If it still sorts after that, then you know PostgreSQL doesn't have the right indexes to avoid sorting, as opposed to just thinking they will not actually be faster.
PostgreSQL could use an index on ("companyId", "learnerUserName", "learnerEmailId") for your first query, but the additional IN condition reduces the number of result rows to an estimated 4 rows, which means that the sort won't cost anything at all. So it chooses to use the index that can support the IN condition.
Rows returned with that index won't be in the correct order automatically, because
you specified DESC for the last index column, but ASC to the preceding one
you have more than one element in the IN list.
Without the IN condition, enough rows are returned, so that PostgreSQL thinks that it is cheaper to order by the index and filter out rows that don't match the condition.
With your first query, it is impossible to have an index that supports both the IN list in the WHERE condition and the ORDER BY clause, so PostgreSQL has to make a choice.
Edited: added Explain Analyze
I've got the following table (simplified for example):
CREATE TABLE public.streamscombined
(
eventtype text COLLATE pg_catalog."default",
payload jsonb,
clienttime bigint, //as millis from epoch
)
And a b-tree compound index on clienttime + eventtype
Correct use of index when index prunes a lot of rows
Doing a query of the following format correctly uses the index with a clienttime that filters a lot of documents. e.g.:
explain SELECT * FROM streamscombined WHERE eventtype='typeA' AND clienttime <= 1522550900000 order by clienttime;
=>
Index Scan using "clienttime/type" on streamscombined (cost=0.56..1781593.82 rows=1135725 width=583)
Index Cond: ((clienttime <= '1522540900000'::bigint) AND (eventtype = 'typeA'::text))
Explain Analyze
Index Scan using "clienttime/type" on streamscombined (cost=0.56..1711616.01 rows=1079021 width=592) (actual time=1.369..13069.861 rows=1074896 loops=1)
Index Cond: ((clienttime <= '1522540900000'::bigint) AND (eventtype = 'typeA'::text))
Planning time: 0.208 ms
Execution time: 13369.330 ms
RESULT: streaming results I see data coming in within 100ms.
Ignoring index when index prunes less rows
However, if completely falls flat when relaxing the clienttime-condition e.g (adding 3 hours):
explain SELECT * FROM streamscombined WHERE eventtype='typeA' AND clienttime <= (1522540900000 + (3*3600*1000)) order by clienttime;
=>
Gather Merge (cost=2897003.10..3192254.78 rows=2530552 width=583)
Workers Planned: 2
-> Sort (cost=2896003.07..2899166.26 rows=1265276 width=583)
Sort Key: clienttime
-> Parallel Seq Scan on streamscombined (cost=0.00..2110404.89 rows=1265276 width=583)
Filter: ((clienttime <= '1522551700000'::bigint) AND (eventtype = 'typeA'::text))
Explain analyze
Gather Merge (cost=2918263.39..3193771.83 rows=2361336 width=592) (actual time=72505.138..75142.127 rows=2852704 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=2917263.37..2920215.04 rows=1180668 width=592) (actual time=70764.052..71430.200 rows=950901 loops=3)
Sort Key: clienttime
Sort Method: external merge Disk: 722336kB
-> Parallel Seq Scan on streamscombined (cost=0.00..2176719.08 rows=1180668 width=592) (actual time=0.451..57458.888 rows=950901 loops=3)
Filter: ((clienttime <= '1522551700000'::bigint) AND (eventtype = 'typeA'::text))
Rows Removed by Filter: 7736119
Planning time: 0.109 ms
Execution time: 76164.816 ms
RESULT: streaming results I've waited for > 5 minutes without any result.
This is likely because PG believes the index will not prune the resultset that much, so it will use a different strategy.
However, and this is key, it completely seems to ignore the fact that I want to order by clienttime and the index is giving me that for free.
Is there any way to force PG to use the index independent on the actual value for the clienttime-condition?
sorting result is cheap, index scan is expensive as it does many disk seeks.
a lower setting of ramdom_page_cost reduces the cost estimate for the index scan resulting in index scans being used for larger result-sets.
Say you have a table with some indices:
create table mail
(
identifier serial primary key,
member text,
read boolean
);
create index on mail(member_identifier);
create index on mail(read);
If you now query on multiple columns which have separate indices, will it ever use both indices?
select * from mail where member = 'Jess' and read = false;
That is, can PostgreSQL decide to first use the index on member to fetch all mails for Jess and then use the index on read to fetch all unread mails and then intersect both results to construct the output set?
I know you can have an index with multiple columns (on (member, read) in this case), but what happens if you have two separate indices? Will PostgreSQL pick just one or can it use both in some cases?
This is not a question about a specific query. It is a generic question to understand the internals.
Postgres Documentation about multiple query indexes
Article says it will create an abstract representation of where both indexes apply then combine the results.
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of table rows that
are reported as matching that index's conditions. The bitmaps are then
ANDed and ORed together as needed by the query. Finally, the actual
table rows are visited and returned.
CREATE TABLE fifteen
(one serial PRIMARY KEY
, three integer not NULL
, five integer not NULL
);
INSERT INTO fifteen(three,five)
SELECT gs%33+5,gs%55+11
FROM generate_series(1,60000) gs
;
CREATE INDEX ON fifteen(three);
CREATE INDEX ON fifteen(five);
ANALYZE fifteen;
EXPLAIN ANALYZE
SELECT*
FROM fifteen
WHERE three= 7
AND five =13
;
Result:
CREATE TABLE
INSERT 0 60000
CREATE INDEX
CREATE INDEX
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fifteen (cost=19.24..51.58 rows=31 width=12) (actual time=0.391..0.761 rows=364 loops=1)
Recheck Cond: ((five = 13) AND (three = 7))
Heap Blocks: exact=324
-> BitmapAnd (cost=19.24..19.24 rows=31 width=0) (actual time=0.355..0.355 rows=0 loops=1)
-> Bitmap Index Scan on fifteen_five_idx (cost=0.00..7.15 rows=1050 width=0) (actual time=0.136..0.136 rows=1091 loops=1)
Index Cond: (five = 13)
-> Bitmap Index Scan on fifteen_three_idx (cost=0.00..11.93 rows=1788 width=0) (actual time=0.194..0.194 rows=1819 loops=1)
Index Cond: (three = 7)
Planning time: 0.259 ms
Execution time: 0.796 ms
(10 rows)
Changing {33,55} to {3,5} will yield an index scan over only one index, plus an addtional filter condition .
(probablythe costsavings would be too little)
I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.
I have the following situation:
Data = around 400 million (string1, string2, score) tuples
Data size ~ 20gb, doesn't fit in memory.
Data is stored in a file in csv format, and not sorted by any
field.
I need to efficiently retrieve all tuples with a particular
string, e.g. all tuples s.t. string1 = 'google'.
How do I design a system such that I can do this efficiently ?
I have already tried postgresql with a B-tree index and GIN index, but they aren't fast enough (> 20-30 seconds) per query.
Ideally, I need a solution which sorts the tuples by string1, stores them in sorted fashion and then run binary search, followed by sequential scan for retrieval. But, I don't know which database or system implements such functionality.
UPDATE:
Here's the postgres details:
I bulk-loaded data into postgres using COPY command. Then I created two indices on string1, one b-tree and one GIN. However, postgres is not using either of them.
Create tables:
CREATE TABLE mytable(
string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
CREATE INDEX string1_index ON mytable(lower(string1));
Query plan:
isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.mytable (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
Output: hyponym, string2, source_id, score
Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
Rows Removed by Index Recheck: 34573
-> Bitmap Index Scan on string1_gin_index (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
Index Cond: ((mytable.string1)::text ~~* 'google'::text)
Total runtime: 25568.209 ms
(7 rows)
isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.mytable (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
Output: string1, string2, source_id, score
Filter: ((mytable.string1)::text = 'google'::text)
Rows Removed by Filter: 124417194
Total runtime: 139403.950 ms
(5 rows)