PostgreSQL daterange not using index correctly - postgresql

I have a simple table which has a user_birthday field with a type of date (which can be
NULL value)
CREATE TABLE users
(
user_id bigserial NOT NULL,
user_email text NOT NULL,
user_password text,
user_first_name text NOT NULL,
user_middle_name text,
user_last_name text NOT NULL,
user_birthday date,
CONSTRAINT pk_users PRIMARY KEY (user_id)
)
There's an index (btree) defined on that field, with the rule of NOT
user_birthday IS NULL.
CREATE INDEX ix_users_birthday
ON users
USING btree
(user_birthday)
WHERE NOT user_birthday IS NULL;
Trying to follow up on another idea, I've added the extension btree_gist and created the following index:
CREATE INDEX ix_users_birthday_gist
ON glances.users
USING gist
(user_birthday)
WHERE NOT user_birthday IS NULL;
But it had no affect either, as from what I could read it is not used for range checking.
The PostgreSQL version is 9.3.4.0 (22) Postgres.app
and issue also exists in 9.3.3.0 (21) Postgres.app
I've been intrigued by the following queries:
Query #1:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01)')
Query #2:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
which, at first glance both should have the same execution plan, but for some
reason, here are the results:
Query #1:
"Seq Scan on users (cost=0.00..52314.25 rows=11101 width=241) (actual
time=0.014..478.983 rows=208886 loops=1)"
" Filter: (user_birthday <# '[1978-07-15,1983-03-01)'::daterange)"
" Rows Removed by Filter: 901214"
"Total runtime: 489.584 ms"
Query #2:
"Bitmap Heap Scan on users (cost=4468.01..46060.53 rows=210301 width=241)
(actual time=57.104..489.785 rows=209019 loops=1)"
" Recheck Cond: ((user_birthday >= '1978-07-15'::date) AND (user_birthday
<= '1983-03-01'::date))"
" Rows Removed by Index Recheck: 611375"
" -> Bitmap Index Scan on ix_users_birthday (cost=0.00..4415.44
rows=210301 width=0) (actual time=54.621..54.621 rows=209019 loops=1)"
" Index Cond: ((user_birthday >= '1978-07-15'::date) AND
(user_birthday <= '1983-03-01'::date))"
"Total runtime: 500.983 ms"
As you can see, the <# daterange is not utilizing the existing index, while
BETWEEN does.
Important to note that the actual use case for this rule is in a more complex query,
which doesn't result in the Recheck Cond and Bitmap Heap scan.
In the application complex query, the difference between the two methods (with 1.2 million records) is massive:
Query #1 at 415ms
Query #2 at 84ms.
Is this a bug with daterange?
Am I doing something wrong? or datarange <# is performing as designed?
There's also a discussion in the pgsql-bugs mailing list

BETWEEN includes upper and lower border. Your condition
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
matches
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01]')
I see you mention a btree index. For that use simple comparison operators.
Detailed manual page on which index is good for which operators.
The range type operators <# or #> would work with GiST indexes.
Example:
Perform this hours of operation query in PostgreSQL

Related

Is this the right way to create a partial index in Postgres?

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

Postgres similarity function not appropriately using trigram index

I have a simple person table with a last_name column that I've added a GIST index with
CREATE INDEX last_name_idx ON person USING gist (last_name gist_trgm_ops);
According to the docs at https://www.postgresql.org/docs/10/pgtrgm.html, the <-> operator should utilize this index. However, when I actually try to use this difference operator using this query:
explain verbose select * from person where last_name <-> 'foobar' > 0.5;
I get this back:
Seq Scan on public.person (cost=0.00..290.82 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)
And it doesn't look like the index is being used. However, if I use the % operator with this command:
explain verbose select * from person where last_name % 'foobar';
It seems to use the index:
Bitmap Heap Scan on public.person (cost=4.25..41.51 rows=13 width=233)
Output: person_id, first_name, last_name
Recheck Cond: (person.last_name % 'foobar'::text)
-> Bitmap Index Scan on last_name_idx (cost=0.00..4.25 rows=13 width=0)
Index Cond: (person.last_name % 'foobar'::text)
I also noticed that if I move the operator to the select portion of the query, the index gets ignored again:
explain verbose select last_name % 'foobar' from person;
Seq Scan on public.person (cost=0.00..257.19 rows=13455 width=1)
Output: (last_name % 'foobar'::text)
Am I missing something obvious about how the similarity function uses the trigram index?
I am using Postgres 10.5 on OSX.
EDIT 1
As per Laurenz's suggestion, I tried setting enable_seqscan = off but unfortunately, the query with the <-> operator still seems to ignore the index.
show enable_seqscan;
enable_seqscan
----------------
off
explain verbose select * from person where last_name <-> 'foobar' < 0.5;
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.person (cost=10000000000.00..10000000290.83 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)
This behavior is normal for all kinds of indexes.
The first query is not in a form that can use the index. For that, a condition would have to be of the form
<indexed expression> <operator supported by the index> <quasi-constant>
where the last expressions remains constant for the duration of the index scan and the operator returns a boolean value. Your expression ´last_name <-> 'foobar' > 0.5` is not of that form.
The <-> operator has to be used in an ORDER BY clause to be able to use the index.
The third query doesn't use the index because the query affects all rows of the table. An index does not speed up the evaluation of an expression, it is only useful to quickly identify a subset of the table (or to get rows in a certain sort order).

Slow query in postgresql geonames db despite indexes

I imported all tables from http://www.geonames.org/ into my local postgresql 9.5.3.0 database and peppered it with indexes like so:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
And now I would like to query the country names in different languages and cross reference the geonames attributes with these alternative names like so:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
Unfortunately though this query takes as long as 13 to 15 seconds on an octacore machine with a fast SSD. 'Explain analyze verbose' shows this:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
Which to me seems to indicate that somehow a sequence scan is performed on 481 rows (which I deem to be fairly low), but nevertheless takes very long. I currently can't make sense of this. Any ideas?
The trigrams only work if you have minimum of 3 characters you're searching for %Sa% won't work, %foo% will. However your indexes are still not good enough. Depending on what parameters are dynamic use multicolumn or filtered indexes:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
Same for the other table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')

Limits of Postgres Query Optimization (Already using Index-Only Scans)

I have a Postgres query that has already been optimized, but we're hitting 100% CPU usage under peak load, so I wanted to see if there's more that can yet be done in optimizing the database interactions. It already is using two index-only scans in the join, so my question is if there's much more to be done on the Postgres side of things.
The database is an Amazon-hosted Postgres RDS db.m3.2xlarge instance (8 vCPUs and 30 GB of memory) running 9.4.1, and the results below are from a period with low CPU usage and minimal connections (around 15). Peak usage is around 300 simultaneous connections, and that's when we're maxing our CPU (which kills performance on everything).
Here's the query and the EXPLAIN:
Query:
EXPLAIN (ANALYZE, BUFFERS)
SELECT m.valdate, p.index_name, m.market_data_closing, m.available_date
FROM md.market_data_closing m
JOIN md.primitive p on (m.primitive_id = p.index_id)
where p.index_name = ?
order by valdate desc
;
Output:
Sort (cost=183.80..186.22 rows=967 width=44) (actual time=44.590..54.788 rows=11133 loops=1)
Sort Key: m.valdate
Sort Method: quicksort Memory: 1254kB
Buffers: shared hit=181
-> Nested Loop (cost=0.85..135.85 rows=967 width=44) (actual time=0.041..32.853 rows=11133 loops=1)
Buffers: shared hit=181
-> Index Only Scan using primitive_index_name_index_id_idx on primitive p (cost=0.29..4.30 rows=1 width=25) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (index_name = '?'::text)
Heap Fetches: 0
Buffers: shared hit=3
-> Index Only Scan using market_data_closing_primitive_id_valdate_available_date_mar_idx on market_data_closing m (cost=0.56..109.22 rows=2233 width=27) (actual time=0.016..12.059 rows=11133 loops=1)
Index Cond: (primitive_id = p.index_id)
Heap Fetches: 42
Buffers: shared hit=178
Planning time: 0.261 ms
Execution time: 64.957 ms
Here are the table sizes:
md.primitive: 14283 rows
md.market_data_closing: 13544087 rows
For reference, here is the underlying spec for the tables and indices:
CREATE TABLE md.primitive(
index_id serial NOT NULL,
index_name text NOT NULL UNIQUE,
index_description text not NULL,
index_source_code text NOT NULL DEFAULT 'MAN',
index_source_spec json NOT NULL DEFAULT '{}',
frequency text NULL,
primitive_type text NULL,
is_maintained boolean NOT NULL default true,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
index_id
)
) ;
CREATE INDEX ON md.primitive
(
index_name ASC,
index_id ASC
);
CREATE TABLE md.market_data_closing(
valdate timestamp NOT NULL,
primitive_id int references md.primitive,
market_data_closing decimal(28, 10) not NULL,
available_date timestamp NULL,
pricing_source text not NULL,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
valdate,
primitive_id
)
) ;
CREATE INDEX ON md.market_data_closing
(
primitive_id ASC,
valdate DESC,
available_date DESC,
market_data_closing ASC
);
What else can be done?
It seems the nested loop is taking an absurd amount of time and primitive table is returning only one row. You can try eliminating the nested loop by doing something like this:
SELECT m.valdate, m.market_data_closing, m.available_date
FROM md.market_data_closing m
WHERE m.primitive_id = (SELECT p.index_id
FROM md.primitive p
WHERE p.index_name = ?
OFFSET 0 -- probably not needed, try it)
ORDER BY valdate DESC;
This does not return p.index_name but that can be easily fixed by selecting it as a const.
For next generations reading this: the problem seems to be with index
md.market_data_closing(
...
PRIMARY KEY
(
valdate,
primitive_id
)
This seems to be an incorrect index. Should be:
md.market_data_closing(
...
PRIMARY KEY
(
primitive_id,
valdate
)
Explanation why. This kind of query:
...
JOIN md.primitive p on (m.primitive_id = p.index_id)
...
Will only be effective only if primitive_id is the first field.
Also
order by validate
Will be more effective if validate is second.
Why?
Because index is a B-tree structure.
(
valdate,
primitive_id
)
results in
valdate1
primitive_id1
primitive_id2
primitive_id3
valdate2
primitive_id1
primitive_id2
Using this tree you can't search by primitive_id1 effectively
But
(
primitive_id,
valdate
)
results in
primitive_id1
valdate1
valdate2
valdate3
primitive_id2
valdate1
valdate2
Which is effective for looking up by primitive_id.
There is another solution to this problem:
If you don't want to change the index, you add a strict equal condition on valdate.
Say 'valdate = some_date', this will make your index effective.

Postgres: Sorting by an immutable function index doesn't use index

I have a simple table.
CREATE TABLE posts
(
id uuid NOT NULL,
vote_up_count integer,
vote_down_count integer,
CONSTRAINT post_pkey PRIMARY KEY(id)
);
I have an IMMUTABLE function that does simple (but could be complex) arithmetic.
CREATE OR REPLACE FUNCTION score(
ups integer,
downs integer)
RETURNS integer AS
$BODY$
select $1 - $2
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;
ALTER FUNCTION score(integer, integer)
OWNER TO postgres;
I create an index on the posts table that uses my function.
CREATE INDEX posts_score_index ON posts(score(vote_up_count, vote_down_count), date_created);
When I EXPLAIN the following query, it doesn't seem to be using the index.
SELECT * FROM posts ORDER BY score(vote_up_count, vote_down_count), date_created
Sort (cost=1.02..1.03 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_of_com (...)"
Sort Key: ((posts.vote_up_count - posts.vote_down_count)), posts.date_created
-> Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_ (...)
How do I get my ORDER BY to use an index from an IMMUTABLE function that could have some very complex arithmetic?
EDIT: Based off of #Егор-Рогов's suggestions, I change the query a bit to see if I can get it to use an index. Still no luck.
set enable_seqscan=off;
EXPLAIN VERBOSE select date_created from posts ORDER BY (hot(vote_up_count, vote_down_count, date_created),date_created);
Here is the output.
Sort (cost=10000000001.06..10000000001.06 rows=1 width=16)
Output: date_created, (ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision) / 4 (...)
Sort Key: (ROW(round((((log((GREATEST(abs((posts.vote_up_count - posts.vote_down_count)), 1))::double precision) * sign(((posts.vote_up_count - posts.vote_down_count))::double precision)) + ((date_part('epoch'::text, posts.date_created) - 1134028003::dou (...)
-> Seq Scan on public.posts (cost=10000000000.00..10000000001.05 rows=1 width=16)
Output: date_created, ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision (...)
EDIT2: It seems that I was not using the index because of a second order by with date_created.
I can see a couple of points that discourages the planner from using the index.
1.
Look at this line in the explain output:
Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
It says that the planner believes there is only one row in the table. In this case it makes no sense to use index scan, for sequential scan is faster.
Try to add more rows to the table, do analyze and try again. You can also test it by temporarily disabling sequential scans by set enable_seqscan=off;.
2.
You use your the function to sort the results. So the planner may decide to use the index in order to get tuple ids in the correct order. But then it needs to fetch each tuple from the table to get values of all columns (because of select *).
You can make the index more attractive to the planner by adding all necessary columns to it, which make possible to avoid table scan. This is called index-only scan.
CREATE INDEX posts_score_index ON posts(
score(vote_up_count, vote_down_count),
date_created,
id, -- do you actually need it in result set?
vote_up_count, -- do you actually need it in result set?
vote_down_count -- do you actually need it in result set?
);
And make sure you run vacuum after inserting/updating/deleting rows to update the visibility map.
The downside is the increased index size, of course.