PostgreSQL - Will it ever use two indices on the same table?

PostgreSQL - Will it ever use two indices on the same table? - postgresql

Say you have a table with some indices:
create table mail
(
identifier serial primary key,
member text,
read boolean
);
create index on mail(member_identifier);
create index on mail(read);
If you now query on multiple columns which have separate indices, will it ever use both indices?
select * from mail where member = 'Jess' and read = false;
That is, can PostgreSQL decide to first use the index on member to fetch all mails for Jess and then use the index on read to fetch all unread mails and then intersect both results to construct the output set?
I know you can have an index with multiple columns (on (member, read) in this case), but what happens if you have two separate indices? Will PostgreSQL pick just one or can it use both in some cases?
This is not a question about a specific query. It is a generic question to understand the internals.

Postgres Documentation about multiple query indexes
Article says it will create an abstract representation of where both indexes apply then combine the results.
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of table rows that
are reported as matching that index's conditions. The bitmaps are then
ANDed and ORed together as needed by the query. Finally, the actual
table rows are visited and returned.

CREATE TABLE fifteen
(one serial PRIMARY KEY
, three integer not NULL
, five integer not NULL
);
INSERT INTO fifteen(three,five)
SELECT gs%33+5,gs%55+11
FROM generate_series(1,60000) gs
;
CREATE INDEX ON fifteen(three);
CREATE INDEX ON fifteen(five);
ANALYZE fifteen;
EXPLAIN ANALYZE
SELECT*
FROM fifteen
WHERE three= 7
AND five =13
;
Result:
CREATE TABLE
INSERT 0 60000
CREATE INDEX
CREATE INDEX
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fifteen (cost=19.24..51.58 rows=31 width=12) (actual time=0.391..0.761 rows=364 loops=1)
Recheck Cond: ((five = 13) AND (three = 7))
Heap Blocks: exact=324
-> BitmapAnd (cost=19.24..19.24 rows=31 width=0) (actual time=0.355..0.355 rows=0 loops=1)
-> Bitmap Index Scan on fifteen_five_idx (cost=0.00..7.15 rows=1050 width=0) (actual time=0.136..0.136 rows=1091 loops=1)
Index Cond: (five = 13)
-> Bitmap Index Scan on fifteen_three_idx (cost=0.00..11.93 rows=1788 width=0) (actual time=0.194..0.194 rows=1819 loops=1)
Index Cond: (three = 7)
Planning time: 0.259 ms
Execution time: 0.796 ms
(10 rows)
Changing {33,55} to {3,5} will yield an index scan over only one index, plus an addtional filter condition .
(probablythe costsavings would be too little)

Related

Postgres not using index to sort data

I have a table learners which has around 3.2 million rows. This table contains user related information like name and email. I need to optimize some queries that uses order by on some column. So for testing I have created a temp_learners table, with 0.8 million rows. I have created two indexes on this table:
CREATE UNIQUE INDEX "temp_learners_companyId_userId_idx"
ON temp_learners ("companyId" ASC, "userId" ASC, "learnerUserName" ASC, "learnerEmailId" ASC);
and
CREATE INDEX temp_learners_company_name_email_index
ON temp_learners ("companyId", "learnerUserName", "learnerEmailId");
The second index is just for testing.
Now When I run this query:
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431 AND "userId" IN (
4990609084216745771,
4990610022492247987,
4990609742667096366,
4990609476136523663,
5451985767018841230,
5451985767078553638,
5270390122102920730,
4763688819142650938,
5056979692501246449,
5279569274741647114,
5031660827132289520,
4862889373349389098,
5299864070077160421,
4740222596778406913,
5320170488686569878,
5270367618320474818,
5320170488587895729,
5228888485293847415,
4778050469432720821,
5270392314970177842,
4849087862439244546,
5270392117430427860,
5270351184072717902,
5330263074228870897,
4763688829301614114,
4763684609695916489,
5270390232949727716
) ORDER BY "learnerUserName","learnerEmailId";
The query plan used by db is this:
Sort (cost=138.75..138.76 rows=4 width=1581) (actual time=0.169..0.171 rows=27 loops=1)
" Sort Key: ""learnerUserName"", ""learnerEmailId"""
Sort Method: quicksort Memory: 73kB
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners (cost=0.55..138.71 rows=4 width=1581) (actual time=0.018..0.112 rows=27 loops=1)
" Index Cond: ((""companyId"" = '909666665757230431'::bigint) AND (""userId"" = ANY ('{4990609084216745771,4990610022492247987,4990609742667096366,4990609476136523663,5451985767018841230,5451985767078553638,5270390122102920730,4763688819142650938,5056979692501246449,5279569274741647114,5031660827132289520,4862889373349389098,5299864070077160421,4740222596778406913,5320170488686569878,5270367618320474818,5320170488587895729,5228888485293847415,4778050469432720821,5270392314970177842,4849087862439244546,5270392117430427860,5270351184072717902,5330263074228870897,4763688829301614114,4763684609695916489,5270390232949727716}'::bigint[])))"
Planning time: 0.116 ms
Execution time: 0.191 ms
In this it does not sort on indexs.
But when I run this query
SELECT *
FROM temp_learners
WHERE "companyId" = 909666665757230431
ORDER BY "learnerUserName","learnerEmailId" limit 500;
This query uses indexs on sorting.
Limit (cost=0.42..1360.05 rows=500 width=1581) (actual time=0.018..0.477 rows=500 loops=1)
-> Index Scan using temp_learners_company_name_email_index on temp_learners (cost=0.42..332639.30 rows=122327 width=1581) (actual time=0.018..0.442 rows=500 loops=1)
Index Cond: ("companyId" = '909666665757230431'::bigint)
Planning time: 0.093 ms
Execution time: 0.513 ms
What I am not able to understand is why postgre does not uses index in first query? Also, I want to clear out that the normal use case of this table learner is to join with other tables. So the first query I written is more similar to joins equation. So for example,
SELECT *
FROM temp_learners AS l
INNER JOIN entity_learners_basic AS elb
ON l."companyId" = elb."companyId" AND l."userId" = elb."userId"
WHERE l."companyId" = 909666665757230431 AND elb."gameId" = 1050403501267716928
ORDER BY "learnerUserName", "learnerEmailId" limit 5000;
Even after correcting indexes the query plan does not indexes for sorting.
QUERY PLAN
Limit (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.554..173.135 rows=5000 loops=1)
-> Sort (cost=3785.11..3785.22 rows=44 width=1767) (actual time=163.553..172.791 rows=5000 loops=1)
" Sort Key: l.""learnerUserName"", l.""learnerEmailId"""
Sort Method: external merge Disk: 35416kB
-> Nested Loop (cost=1.12..3783.91 rows=44 width=1767) (actual time=0.019..63.743 rows=21195 loops=1)
-> Index Scan using primary_index__entity_learners_basic on entity_learners_basic elb (cost=0.57..1109.79 rows=314 width=186) (actual time=0.010..6.221 rows=21195 loops=1)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("gameId" = '1050403501267716928'::bigint))
-> Index Scan using "temp_learners_companyId_userId_idx" on temp_learners l (cost=0.55..8.51 rows=1 width=1581) (actual time=0.002..0.002 rows=1 loops=21195)
Index Cond: (("companyId" = '909666665757230431'::bigint) AND ("userId" = elb."userId"))
Planning time: 0.309 ms
Execution time: 178.422 ms
Does Postgres not use indexes when joining and ordering data?

PostgreSQL chooses the plan it thinks will be faster. Using the index that provides rows in the correct order means using a much less selective index, so it doesn't think that will be faster overall.
If you want to force PostgreSQL into believing that sorting is the worst thing in the world, you could set enable_sort=off. If it still sorts after that, then you know PostgreSQL doesn't have the right indexes to avoid sorting, as opposed to just thinking they will not actually be faster.

PostgreSQL could use an index on ("companyId", "learnerUserName", "learnerEmailId") for your first query, but the additional IN condition reduces the number of result rows to an estimated 4 rows, which means that the sort won't cost anything at all. So it chooses to use the index that can support the IN condition.
Rows returned with that index won't be in the correct order automatically, because
you specified DESC for the last index column, but ASC to the preceding one
you have more than one element in the IN list.
Without the IN condition, enough rows are returned, so that PostgreSQL thinks that it is cheaper to order by the index and filter out rows that don't match the condition.
With your first query, it is impossible to have an index that supports both the IN list in the WHERE condition and the ORDER BY clause, so PostgreSQL has to make a choice.

Optimizing indexes for query on large table with multiple joins

We have an images table containing around ~25 million records and when I query the table based on the values from several joins the planner's estimates are quite different from the actual results for row counts. We have other queries that are roughly the same without all of the joins and it is much faster. I would like to know what steps I can take to debug and optimize the query. Also, is it better to have one index covering all columns included in the join and the where clause or a multiple indexes one for each join column and then another with all of the fields in the where clause?
The query:
EXPLAIN ANALYZE
SELECT "images".* FROM "images"
INNER JOIN "locations" ON "locations"."id" = "images"."location_id"
INNER JOIN "users" ON "images"."creator_id" = "users"."id"
INNER JOIN "user_groups" ON "users"."id" = "user_groups"."user_id"
WHERE "images"."deleted_at" IS NULL
AND "user_groups"."group_id" = 7
AND "images"."creator_type" = 'User'
AND "images"."status" = 2
AND "locations"."active" = TRUE
ORDER BY date_uploaded DESC
LIMIT 50
OFFSET 0;
The explain:
Limit (cost=25670.61..25670.74 rows=50 width=585) (actual time=1556.250..1556.278 rows=50 loops=1)
-> Sort (cost=25670.61..25674.90 rows=1714 width=585) (actual time=1556.250..1556.264 rows=50 loops=1)
Sort Key: images.date_uploaded
Sort Method: top-N heapsort Memory: 75kB
-> Nested Loop (cost=1.28..25613.68 rows=1714 width=585) (actual time=0.097..1445.777 rows=160886 loops=1)
-> Nested Loop (cost=0.85..13724.04 rows=1753 width=585) (actual time=0.069..976.326 rows=161036 loops=1)
-> Nested Loop (cost=0.29..214.87 rows=22 width=8) (actual time=0.023..0.786 rows=22 loops=1)
-> Seq Scan on user_groups (cost=0.00..95.83 rows=22 width=4) (actual time=0.008..0.570 rows=22 loops=1)
Filter: (group_id = 7)
Rows Removed by Filter: 5319
-> Index Only Scan using users_pkey on users (cost=0.29..5.40 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=22)
Index Cond: (id = user_groups.user_id)
Heap Fetches: 18
-> Index Scan using creator_date_uploaded_Where_pub_not_del on images (cost=0.56..612.08 rows=197 width=585) (actual time=0.062..40.992 rows=7320 loops=22)
Index Cond: ((creator_id = users.id) AND ((creator_type)::text = 'User'::text) AND (status = 2))
-> Index Scan using locations_pkey on locations (cost=0.43..6.77 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=161036)
Index Cond: (id = images.location_id)
Filter: active
Rows Removed by Filter: 0
Planning time: 1.694 ms
Execution time: 1556.352 ms
We are running Postgres 9.4 on an RDS db.m4.large instance.

As for the query itself, the only thing you can do is skipping on users table. From EXPLAIN you can see that it only does an Index Only Scan without actually touching the table. So, technically your query could look like this:
SELECT images.* FROM images
INNER JOIN locations ON locations.id = images.location_id
INNER JOIN user_groups ON images.creator_id = user_groups.user_id
WHERE images.deleted_at IS NULL
AND user_groups.group_id = 7
AND images.creator_type = 'User'
AND images.status = 2
AND locations.active = TRUE
ORDER BY date_uploaded DESC
OFFSET 0 LIMIT 50
The rest is about indexes. locations seems to have very little data, so optimization here will gain you nothing. user_groups on the other hand could benefit from an index ON (user_id) WHERE group_id = 7 or ON (group_id, user_id). This should remove some extra filtering on table content.
-- Option 1
CREATE INDEX ix_usergroups_userid_groupid7
ON user_groups (user_id)
WHERE group_id = 7;
-- Option 2
CREATE INDEX ix_usergroups_groupid_userid
ON user_groups (group_id, user_id);
Of course, the biggest thing here is images. Currently, the planer would do an index scan on creator_date_uploaded_Where_pub_not_del which I suspect does not fully match the requirements. Here, multiple options come to mind depending on your usage pattern - from one where the search parameters are rather common:
-- Option 1
CREATE INDEX ix_images_creatorid_typeuser_status2_notdel
ON images (creator_id)
WHERE creator_type = 'User' AND status = 2 AND deleted_at IS NULL;
to one with completely dynamic parameters:
-- Option 2
CREATE INDEX ix_images_status_creatortype_creatorid_notdel
ON images (status, creator_type, creator_id)
WHERE deleted_at IS NULL;
The first index is preferable as it is smaller (values are filtered-out rather than indexed).
To summarize, unless you are limited by memory (or other factors), I would add indexes on user_groups and images. Correct choice of indexes must be confirmed empirically, as multiple options are usually available and the situation depends on statistical distribution of data.

Here's a different approach:
I think one of the problems is that you are doing joins 1714 times, and then just returning the first 50 results. We'll probably want to avoid extra joins as soon as possible.
For this, We'll try to have an index by date_uploaded first. And then we will filter by the rest of the columns. Also, We add creator_id for getting an index-only scan:
CREATE INDEX ix_images_sort_test
ON images (date_uploaded desc, creator_id)
WHERE creator_type = 'User' AND status = 2 AND deleted_at IS NULL;
Also you may use the generic version (unfiltered). But it should somewhat worse. Since the first column will be date_uploaded we will need to read the whole index for the filtering of the rest of the columns.
CREATE INDEX ix_images_sort_test
ON images (date_uploaded desc, status, creator_type, creator_id)
WHERE deleted_at IS NULL;
The pity here is that you are also filtering by group_id, which is on another table. But even that, It may be worth trying this approach.
Also, verify that all joined tables have an index on the foreign key.
So, add an index for user_groups as (user_id, group_id)
Also, as Boris noticed, you may remove the "Users" join.

Forcing PG to use index with timerange. Works on small resultset, doesn't work on bigger sets

Edited: added Explain Analyze
I've got the following table (simplified for example):
CREATE TABLE public.streamscombined
(
eventtype text COLLATE pg_catalog."default",
payload jsonb,
clienttime bigint, //as millis from epoch
)
And a b-tree compound index on clienttime + eventtype
Correct use of index when index prunes a lot of rows
Doing a query of the following format correctly uses the index with a clienttime that filters a lot of documents. e.g.:
explain SELECT * FROM streamscombined WHERE eventtype='typeA' AND clienttime <= 1522550900000 order by clienttime;
=>
Index Scan using "clienttime/type" on streamscombined (cost=0.56..1781593.82 rows=1135725 width=583)
Index Cond: ((clienttime <= '1522540900000'::bigint) AND (eventtype = 'typeA'::text))
Explain Analyze
Index Scan using "clienttime/type" on streamscombined (cost=0.56..1711616.01 rows=1079021 width=592) (actual time=1.369..13069.861 rows=1074896 loops=1)
Index Cond: ((clienttime <= '1522540900000'::bigint) AND (eventtype = 'typeA'::text))
Planning time: 0.208 ms
Execution time: 13369.330 ms
RESULT: streaming results I see data coming in within 100ms.
Ignoring index when index prunes less rows
However, if completely falls flat when relaxing the clienttime-condition e.g (adding 3 hours):
explain SELECT * FROM streamscombined WHERE eventtype='typeA' AND clienttime <= (1522540900000 + (3*3600*1000)) order by clienttime;
=>
Gather Merge (cost=2897003.10..3192254.78 rows=2530552 width=583)
Workers Planned: 2
-> Sort (cost=2896003.07..2899166.26 rows=1265276 width=583)
Sort Key: clienttime
-> Parallel Seq Scan on streamscombined (cost=0.00..2110404.89 rows=1265276 width=583)
Filter: ((clienttime <= '1522551700000'::bigint) AND (eventtype = 'typeA'::text))
Explain analyze
Gather Merge (cost=2918263.39..3193771.83 rows=2361336 width=592) (actual time=72505.138..75142.127 rows=2852704 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=2917263.37..2920215.04 rows=1180668 width=592) (actual time=70764.052..71430.200 rows=950901 loops=3)
Sort Key: clienttime
Sort Method: external merge Disk: 722336kB
-> Parallel Seq Scan on streamscombined (cost=0.00..2176719.08 rows=1180668 width=592) (actual time=0.451..57458.888 rows=950901 loops=3)
Filter: ((clienttime <= '1522551700000'::bigint) AND (eventtype = 'typeA'::text))
Rows Removed by Filter: 7736119
Planning time: 0.109 ms
Execution time: 76164.816 ms
RESULT: streaming results I've waited for > 5 minutes without any result.
This is likely because PG believes the index will not prune the resultset that much, so it will use a different strategy.
However, and this is key, it completely seems to ignore the fact that I want to order by clienttime and the index is giving me that for free.
Is there any way to force PG to use the index independent on the actual value for the clienttime-condition?

sorting result is cheap, index scan is expensive as it does many disk seeks.
a lower setting of ramdom_page_cost reduces the cost estimate for the index scan resulting in index scans being used for larger result-sets.

Postgres Proper Index for Sorting and Join

I have a simple schema and query, but am experiencing consistent awful performance with certain parameters.
Schema:
CREATE TABLE locations (
id integer NOT NULL,
barcode_id integer NOT NULL
);
CREATE TABLE barcodes (
id integer NOT NULL,
value citext NOT NULL
);
ALTER TABLE ONLY locations ADD CONSTRAINT locations_pkey PRIMARY KEY (id);
ALTER TABLE ONLY barcodes ADD CONSTRAINT barcodes_pkey PRIMARY KEY (id);
ALTER TABLE ONLY locations ADD CONSTRAINT fk_locations_barcodes FOREIGN KEY (barcode_id) REFERENCES barcodes(id);
CREATE INDEX index_barcodes_on_value ON barcodes (value);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);
Query:
EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;
Analysis:
Limit (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
-> Nested Loop (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms
500+ ms for the number of entries in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Can I do anything to improve the performance?
Note:
Even stranger is the execution time depends on the data. In drafting this question I attempted to include seeded random data, but the seeds seem to be performant:
Seed:
INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000) FROM generate_series(1,60000) AS seed(id);
Analysis:
Limit (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
-> Nested Loop (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms
Edit:
Analysis of the tables:
ANALYZE VERBOSE barcodes;
INFO: analyzing "public.barcodes"
INFO: "barcodes": scanned 2760 of 2760 pages, containing 496157 live
rows and 0 dead rows; 30000 rows in sample, 496157 estimated total rows
ANALYZE
Time: 62.937 ms
ANALYZE VERBOSE locations;
INFO: analyzing "public.locations"
INFO: "locations": scanned 254 of 254 pages, containing 57394 live rows
and 0 dead rows; 30000 rows in sample, 57394 estimated total rows
ANALYZE
Time: 21.447 ms

The problem is that the barcodes with low values don't have matches in locations, which PostgreSQL cannot know. So its plan to fetch the barcodes in the correct output order via the index and then join values from locations until it found 50 of them is much worse than it expected.
I would ANALYZE the tables and
DROP INDEX index_barcodes_on_value;
That should keep PostgreSQL from choosing that plan.
I don't know what plan PostgreSQL will choose then.
For a nested loop the following index might help:
CREATE INDEX ON locations(id);

How to efficiently perform equality query on key-value data with duplicate keys allowed?

I have the following situation:
Data = around 400 million (string1, string2, score) tuples
Data size ~ 20gb, doesn't fit in memory.
Data is stored in a file in csv format, and not sorted by any
field.
I need to efficiently retrieve all tuples with a particular
string, e.g. all tuples s.t. string1 = 'google'.
How do I design a system such that I can do this efficiently ?
I have already tried postgresql with a B-tree index and GIN index, but they aren't fast enough (> 20-30 seconds) per query.
Ideally, I need a solution which sorts the tuples by string1, stores them in sorted fashion and then run binary search, followed by sequential scan for retrieval. But, I don't know which database or system implements such functionality.
UPDATE:
Here's the postgres details:
I bulk-loaded data into postgres using COPY command. Then I created two indices on string1, one b-tree and one GIN. However, postgres is not using either of them.
Create tables:
CREATE TABLE mytable(
string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
CREATE INDEX string1_index ON mytable(lower(string1));
Query plan:
isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.mytable (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
Output: hyponym, string2, source_id, score
Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
Rows Removed by Index Recheck: 34573
-> Bitmap Index Scan on string1_gin_index (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
Index Cond: ((mytable.string1)::text ~~* 'google'::text)
Total runtime: 25568.209 ms
(7 rows)
isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.mytable (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
Output: string1, string2, source_id, score
Filter: ((mytable.string1)::text = 'google'::text)
Rows Removed by Filter: 124417194
Total runtime: 139403.950 ms
(5 rows)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse