I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine
To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.
If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.
The table looks something like this:
CREATE TABLE "audit_log" (
"id" int4 NOT NULL DEFAULT nextval('audit_log_id_seq'::regclass),
"entity" varchar(50) COLLATE "public"."ci",
"updated" timestamp(6) NOT NULL,
"transaction_id" uuid,
CONSTRAINT "PK_audit_log" PRIMARY KEY ("id")
);
It contains millions of row.
I tried adding an index on one column like this:
CREATE INDEX "testing" ON "audit_log" USING btree (
"entity" COLLATE "public"."ci" "pg_catalog"."text_ops" ASC NULLS LAST
);
Then ran the following query over both the indexed column, and the primary key:
EXPLAIN ANALYZE SELECT entity, id FROM audit_log WHERE entity = 'abcd'
As I expected, the query plan uses both a Bitmap Index Scan (to find the 'entity' column, presumably) and a Bitmap Heap Scan (to retrieve the 'id' column, I assume):
Gather (cost=2640.10..260915.23 rows=87166 width=122) (actual time=2.828..3.764 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on audit_log (cost=1640.10..251198.63 rows=36319 width=122) (actual time=0.061..0.062 rows=0 loops=3)
Recheck Cond: ((entity)::text = '1234'::text)
-> Bitmap Index Scan on testing (cost=0.00..1618.31 rows=87166 width=0) (actual time=0.036..0.036 rows=0 loops=1)
Index Cond: ((entity)::text = '1234'::text)
Next I added an INCLUDE column to the index in order to make it cover the above query:
DROP INDEX testing
CREATE INDEX testing ON audit_log USING btree (
"entity" COLLATE "public"."ci" "pg_catalog"."text_ops" ASC NULLS LAST
)
INCLUDE
(
"id"
)
Then I reran my query, but it still does the Bitmap Heap Scan:
Gather (cost=2964.10..261239.23 rows=87166 width=122) (actual time=2.711..3.570 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on audit_log (cost=1964.10..251522.63 rows=36319 width=122) (actual time=0.062..0.062 rows=0 loops=3)
Recheck Cond: ((entity)::text = '1234'::text)
-> Bitmap Index Scan on testing (cost=0.00..1942.31 rows=87166 width=0) (actual time=0.029..0.029 rows=0 loops=1)
Index Cond: ((entity)::text = '1234'::text)
Why is that?
PostgreSQL implements row versioning using a concept called visibility. Each query knows which version of a row it can see.
Now that visibility information is stored in the table row, but not in the index entry, so that table has to be visited just to test if the row is visible or not.
Because of that, every bitmap index scan needs a bitmap heap scan.
To overcome the unfortunate property, PostgreSQL has introduced the visibility map, a data structure that stores for each 8kB-block of the table if all rows in that block are visible to everybody. If that is the case, looking up the table row can be skipped. This is only possible for a regular index scan, not a bitmap index scan.
That visibility map is maintained by VACUUM. So run VACUUM on the table, then you may get an index only scan on the table.
If that alone is not enough, you could try CLUSTER to rewrite the table in index order.
Some detail information on how PostgreSQL estimates the cost of an index scan. The following code is from cost_index in src/backend/optimizer/path/costsize.c:
/*----------
[...]
* If it's an index-only scan, then we will not need to fetch any heap
* pages for which the visibility map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
* that this query will fetch; but it's not clear how to do better.
*----------
*/
[...]
if (indexonly)
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
allvisfrac is calculated using pg_class.relallvisible, which holds an estimate for the number of all-visible pages in the table, and pg_class.relpages.
I have two tables:
table_1 with ~1 million lines, with columns id_t1: integer, c1_t1: varchar, etc.
table_2 with ~50 million lines, with columns id_t2: integer, ref_id_t1: integer, c1_t2: varchar, etc.
ref_id_t1 is filled with id_t1 values , however they are not linked by a foreign key as table_2 doesn't know about table_1.
I need to do a request on both table like the following:
SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');
Without any change or with basic indexes the request takes about a minute to complete as a sequencial scan is peformed on table_2. To prevent this I created a GIN idex with gin_trgm_ops option:
CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);
However this does not solve the problem as the inner request still takes a very long time.
EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'
Gives the following
Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
Rows Removed by Index Recheck: 49069703
Heap Blocks: exact=611548
-> Bitmap Index Scan on gin_trg (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms
The Bitmap Index Scan is fast, but as we need t2.ref_id_t1 PostgreSQL needs to perform an Bitmap Heap Scan which is not quick on 65000 lines of data.
The solution to avoid the Bitmap Heap Scan would be to perform an Index Only Scan. This is possible using multiple column with btree indexes, see https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
If I change the request like to search the begining of c1_t2, even with the inner request returning 90000 lines, and if I create a btree index on c1_t2 and ref_id_t1 the request takes just over a second.
CREATE INDEX c1_t2_ref_id_t1_index
ON table_2 USING btree
(c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)
EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');
Hash Join (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
Hash Cond: (t1.id_t1 = t2.ref_id_t1)
-> Seq Scan on table_1 t1 (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
Filter: (c1_t1 = 'A')
Rows Removed by Filter: 1083798
-> Hash (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 3472kB
-> HashAggregate (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
Group Key: t2.ref_id_t1
-> Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2 (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
Index Cond: ((c1_t2 ~>=~ 'aaa'::text) AND (c1_t2 ~<~ 'chb'::text))
Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms
However this is not possible with gin indexes, as these indexes don't store all data in the key.
Is there a way to use pg_trmg like extension with btree index so we can have index only scan with LIKE '%abc%' requests?
I have a table
CREATE TABLE timedevent
(
id bigint NOT NULL,
eventdate timestamp with time zone NOT NULL,
newstateids character varying(255) NOT NULL,
sourceid character varying(255) NOT NULL,
CONSTRAINT timedevent_pkey PRIMARY KEY (id)
) WITH (OIDS=FALSE);
with PK id.
I have to query rows between two dates with certain newstate and source from a set of possible sources.
I created a btree indexes on eventdate and newstateids and one more (hash index) on sourceid. Only the index on date made the queries faster - it seems the other two are not used. Why is that so? How could I make my queries faster?
CREATE INDEX eventdate_index ON timedevent USING btree (eventdate);
CREATE INDEX newstateids_index ON timedevent USING btree (newstateids COLLATE pg_catalog."default");
CREATE INDEX sourceid_index_hash ON timedevent USING hash (sourceid COLLATE pg_catalog."default");
Here is the query as Hibernate generates it:
select this_.id as id1_0_0_, this_.description as descript2_0_0_, this_.eventDate as eventDat3_0_0_, this_.locationId as location4_0_0_, this_.newStateIds as newState5_0_0_, this_.oldStateIds as oldState6_0_0_, this_.sourceId as sourceId7_0_0_
from TimedEvent this_
where ((this_.newStateIds=? and this_.sourceId in (?, ?, ?, ?, ?, ?)))
and this_.eventDate between ? and ?
limit ?
EDIT:
Sorry for the misleading title but it seem postges uses all indexes. The problem is my query time still remains the same. Here ist the query plan I got:
Limit (cost=25130.29..33155.77 rows=321 width=161) (actual time=705.330..706.744 rows=279 loops=1)
Buffers: shared hit=6 read=8167 written=61
-> Bitmap Heap Scan on timedevent this_ (cost=25130.29..33155.77 rows=321 width=161) (actual time=705.330..706.728 rows=279 loops=1)
Recheck Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer" (...)
Filter: ((eventdate >= '2017-11-01 15:41:00+01'::timestamp with time zone) AND (eventdate <= '2018-03-20 14:58:16.724+01'::timestamp with time zone))
Buffers: shared hit=6 read=8167 written=61
-> BitmapAnd (cost=25130.29..25130.29 rows=2122 width=0) (actual time=232.990..232.990 rows=0 loops=1)
Buffers: shared hit=6 read=2152
-> Bitmap Index Scan on sourceid_index_hash (cost=0.00..1403.36 rows=39182 width=0) (actual time=1.195..1.195 rows=9308 loops=1)
Index Cond: ((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Z (...)
Buffers: shared hit=6 read=26
-> Bitmap Index Scan on state_index (cost=0.00..23726.53 rows=777463 width=0) (actual time=231.160..231.160 rows=776520 loops=1)
Index Cond: ((newstateids)::text = 'ACTIV'::text)
Buffers: shared read=2126
Total runtime: 706.804 ms
After creating an index using btree on (sourceid, newstateids) as a_horse_with_no_name suggested, the cost reduced:
Limit (cost=125.03..8150.52 rows=321 width=161) (actual time=13.611..14.454 rows=279 loops=1)
Buffers: shared hit=18 read=4336
-> Bitmap Heap Scan on timedevent this_ (cost=125.03..8150.52 rows=321 width=161) (actual time=13.609..14.432 rows=279 loops=1)
Recheck Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer","r (...)
Filter: ((eventdate >= '2017-11-01 15:41:00+01'::timestamp with time zone) AND (eventdate <= '2018-03-20 14:58:16.724+01'::timestamp with time zone))
Buffers: shared hit=18 read=4336
-> Bitmap Index Scan on src_state_index (cost=0.00..124.95 rows=2122 width=0) (actual time=0.864..0.864 rows=4526 loops=1)
Index Cond: (((sourceid)::text = ANY ('{"root,kus-chemnitz,ize-159,Anwesend Bad","root,kus-chemnitz,ize-159,Alarmruf","root,kus-chemnitz,ize-159,Bett Alarm 1","root,kus-chemnitz,ize-159,Bett Alarm 2","root,kus-chemnitz,ize-159,Anwesend Zimmer (...)
Buffers: shared hit=18 read=44
Total runtime: 14.497 ms"
Basically only one index is used because database would have to combine your indexes into one so that they would be useful (or combine results from searches over more indexes) and doing this is so expensive, that it in this case chooses not to and use just one of the indexes relevant for one predicate and check the other predicates directly on values in found rows.
One B-tree index with several columns would work better, just as a_horse_with_no_name suggests in comments. Also note, that order of columns matter a lot (columns that are used for single value search should be first, the one for range search later, you want to limit range searches as much as possible).
Then databese will go through index, looking for rows that satisfy on predicate using the first column of index (hopefully narrowing the numbers of rows a lot) and the the second column and second predicate come to play, ...
Using separate B-tree indexes when predicates are combined using AND operator does not make sense to database, because it would have to use one index to choose all rows, which satisfy one predicate and then, it would have to use another index, read its blocks (where the index is stored) from disk again, only to get to rows which satisfy the condition relevant to this second index, but possibly not the other condition. And if they satisfy it, it's probably cheaper to just load the row after first using index and check the other predicates directly, not using index.
I have the following situation:
Data = around 400 million (string1, string2, score) tuples
Data size ~ 20gb, doesn't fit in memory.
Data is stored in a file in csv format, and not sorted by any
field.
I need to efficiently retrieve all tuples with a particular
string, e.g. all tuples s.t. string1 = 'google'.
How do I design a system such that I can do this efficiently ?
I have already tried postgresql with a B-tree index and GIN index, but they aren't fast enough (> 20-30 seconds) per query.
Ideally, I need a solution which sorts the tuples by string1, stores them in sorted fashion and then run binary search, followed by sequential scan for retrieval. But, I don't know which database or system implements such functionality.
UPDATE:
Here's the postgres details:
I bulk-loaded data into postgres using COPY command. Then I created two indices on string1, one b-tree and one GIN. However, postgres is not using either of them.
Create tables:
CREATE TABLE mytable(
string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
CREATE INDEX string1_index ON mytable(lower(string1));
Query plan:
isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.mytable (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
Output: hyponym, string2, source_id, score
Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
Rows Removed by Index Recheck: 34573
-> Bitmap Index Scan on string1_gin_index (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
Index Cond: ((mytable.string1)::text ~~* 'google'::text)
Total runtime: 25568.209 ms
(7 rows)
isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.mytable (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
Output: string1, string2, source_id, score
Filter: ((mytable.string1)::text = 'google'::text)
Rows Removed by Filter: 124417194
Total runtime: 139403.950 ms
(5 rows)