I have a Postgres query that has already been optimized, but we're hitting 100% CPU usage under peak load, so I wanted to see if there's more that can yet be done in optimizing the database interactions. It already is using two index-only scans in the join, so my question is if there's much more to be done on the Postgres side of things.
The database is an Amazon-hosted Postgres RDS db.m3.2xlarge instance (8 vCPUs and 30 GB of memory) running 9.4.1, and the results below are from a period with low CPU usage and minimal connections (around 15). Peak usage is around 300 simultaneous connections, and that's when we're maxing our CPU (which kills performance on everything).
Here's the query and the EXPLAIN:
Query:
EXPLAIN (ANALYZE, BUFFERS)
SELECT m.valdate, p.index_name, m.market_data_closing, m.available_date
FROM md.market_data_closing m
JOIN md.primitive p on (m.primitive_id = p.index_id)
where p.index_name = ?
order by valdate desc
;
Output:
Sort (cost=183.80..186.22 rows=967 width=44) (actual time=44.590..54.788 rows=11133 loops=1)
Sort Key: m.valdate
Sort Method: quicksort Memory: 1254kB
Buffers: shared hit=181
-> Nested Loop (cost=0.85..135.85 rows=967 width=44) (actual time=0.041..32.853 rows=11133 loops=1)
Buffers: shared hit=181
-> Index Only Scan using primitive_index_name_index_id_idx on primitive p (cost=0.29..4.30 rows=1 width=25) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (index_name = '?'::text)
Heap Fetches: 0
Buffers: shared hit=3
-> Index Only Scan using market_data_closing_primitive_id_valdate_available_date_mar_idx on market_data_closing m (cost=0.56..109.22 rows=2233 width=27) (actual time=0.016..12.059 rows=11133 loops=1)
Index Cond: (primitive_id = p.index_id)
Heap Fetches: 42
Buffers: shared hit=178
Planning time: 0.261 ms
Execution time: 64.957 ms
Here are the table sizes:
md.primitive: 14283 rows
md.market_data_closing: 13544087 rows
For reference, here is the underlying spec for the tables and indices:
CREATE TABLE md.primitive(
index_id serial NOT NULL,
index_name text NOT NULL UNIQUE,
index_description text not NULL,
index_source_code text NOT NULL DEFAULT 'MAN',
index_source_spec json NOT NULL DEFAULT '{}',
frequency text NULL,
primitive_type text NULL,
is_maintained boolean NOT NULL default true,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
index_id
)
) ;
CREATE INDEX ON md.primitive
(
index_name ASC,
index_id ASC
);
CREATE TABLE md.market_data_closing(
valdate timestamp NOT NULL,
primitive_id int references md.primitive,
market_data_closing decimal(28, 10) not NULL,
available_date timestamp NULL,
pricing_source text not NULL,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
valdate,
primitive_id
)
) ;
CREATE INDEX ON md.market_data_closing
(
primitive_id ASC,
valdate DESC,
available_date DESC,
market_data_closing ASC
);
What else can be done?
It seems the nested loop is taking an absurd amount of time and primitive table is returning only one row. You can try eliminating the nested loop by doing something like this:
SELECT m.valdate, m.market_data_closing, m.available_date
FROM md.market_data_closing m
WHERE m.primitive_id = (SELECT p.index_id
FROM md.primitive p
WHERE p.index_name = ?
OFFSET 0 -- probably not needed, try it)
ORDER BY valdate DESC;
This does not return p.index_name but that can be easily fixed by selecting it as a const.
For next generations reading this: the problem seems to be with index
md.market_data_closing(
...
PRIMARY KEY
(
valdate,
primitive_id
)
This seems to be an incorrect index. Should be:
md.market_data_closing(
...
PRIMARY KEY
(
primitive_id,
valdate
)
Explanation why. This kind of query:
...
JOIN md.primitive p on (m.primitive_id = p.index_id)
...
Will only be effective only if primitive_id is the first field.
Also
order by validate
Will be more effective if validate is second.
Why?
Because index is a B-tree structure.
(
valdate,
primitive_id
)
results in
valdate1
primitive_id1
primitive_id2
primitive_id3
valdate2
primitive_id1
primitive_id2
Using this tree you can't search by primitive_id1 effectively
But
(
primitive_id,
valdate
)
results in
primitive_id1
valdate1
valdate2
valdate3
primitive_id2
valdate1
valdate2
Which is effective for looking up by primitive_id.
There is another solution to this problem:
If you don't want to change the index, you add a strict equal condition on valdate.
Say 'valdate = some_date', this will make your index effective.
Related
I need to increase performance of the following query, which filters on column status_classification and aggregrates on classification -> 'flags' (a jsonb field in the form: '{"flags": ["NO_CLASS_FOUND"]}'::jsonb):
SELECT SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_CLASS_FOUND"]' THEN 1 ELSE 0 END) AS "no_class_found",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["RULE"]' THEN 1 ELSE 0 END) AS "rule",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_MAPPING"]' THEN 1 ELSE 0 END) AS "no_mapping"
FROM "result_materials"
WHERE "result_materials"."status_classification" = 'PROCESSED';
To improve performance i created an index on status_classification, but the query plan shows that the index was never hit, and a Seq Scan was performed:
Aggregate (cost=1010.15..1010.16 rows=1 width=24) (actual time=19.942..19.946 rows=1 loops=1)
-> Seq Scan on result_materials (cost=0.00..869.95 rows=6231 width=202) (actual time=0.024..4.660 rows=6231 loops=1)
Filter: ((status_classification)::text = 'PROCESSED'::text)
Rows Removed by Filter: 5
Planning Time: 1.212 ms
Execution Time: 20.187 ms
I've tried (all sql at the end of question):
adding an index to status_classification
adding a GIN index to classification -> 'flags'
adding a multi field GIN index, with classification -> 'flags' and status_classification (see here)
The index is still not hit, and performance suffers on as the table grows. Cardinality is low in status_classification field, but the entries in classification -> 'flags' are quite rare, so i would have thought an index very practical here.
Why is the index not used? What am i doing wrong?
SQL to recreate my db:
create table result_materials (
uuid int,
status_classification varchar(30),
classification jsonb
);
insert into result_materials(uuid, classification, status_classification)
select seq
, case(random() *2)::int
when 0 then '{"flags": ["NO_CLASS_FOUND"]}'::jsonb
when 1 then '{"flags": ["RULE"]}'::jsonb
when 2 then '{"flags": ["NO_MAPPING"]}'::jsonb end
as dummy
, case(random() *2)::int
when 0 then 'NOT_PROCESSABLE'
when 1 then 'PROCESSABLE' end
as sta
from generate_series(1, 150000) seq;
Indexes attempted:
-- status_classification
create index other_testes on result_materials (status_classification);
-- classification -> 'flags'
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'));
-- multi field gin
-- REQUIRES you to run: CREATE EXTENSION btree_gin;
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'), status_classification);
The query takes 20ms and only removes 5 rows of 6k, yes a scan it's a good choice.
Try adding more rows to the table, and check the cardinality of your clause.
We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.
I imported all tables from http://www.geonames.org/ into my local postgresql 9.5.3.0 database and peppered it with indexes like so:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
And now I would like to query the country names in different languages and cross reference the geonames attributes with these alternative names like so:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
Unfortunately though this query takes as long as 13 to 15 seconds on an octacore machine with a fast SSD. 'Explain analyze verbose' shows this:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
Which to me seems to indicate that somehow a sequence scan is performed on 481 rows (which I deem to be fairly low), but nevertheless takes very long. I currently can't make sense of this. Any ideas?
The trigrams only work if you have minimum of 3 characters you're searching for %Sa% won't work, %foo% will. However your indexes are still not good enough. Depending on what parameters are dynamic use multicolumn or filtered indexes:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
Same for the other table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')
I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.
I have a simple table which has a user_birthday field with a type of date (which can be
NULL value)
CREATE TABLE users
(
user_id bigserial NOT NULL,
user_email text NOT NULL,
user_password text,
user_first_name text NOT NULL,
user_middle_name text,
user_last_name text NOT NULL,
user_birthday date,
CONSTRAINT pk_users PRIMARY KEY (user_id)
)
There's an index (btree) defined on that field, with the rule of NOT
user_birthday IS NULL.
CREATE INDEX ix_users_birthday
ON users
USING btree
(user_birthday)
WHERE NOT user_birthday IS NULL;
Trying to follow up on another idea, I've added the extension btree_gist and created the following index:
CREATE INDEX ix_users_birthday_gist
ON glances.users
USING gist
(user_birthday)
WHERE NOT user_birthday IS NULL;
But it had no affect either, as from what I could read it is not used for range checking.
The PostgreSQL version is 9.3.4.0 (22) Postgres.app
and issue also exists in 9.3.3.0 (21) Postgres.app
I've been intrigued by the following queries:
Query #1:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01)')
Query #2:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
which, at first glance both should have the same execution plan, but for some
reason, here are the results:
Query #1:
"Seq Scan on users (cost=0.00..52314.25 rows=11101 width=241) (actual
time=0.014..478.983 rows=208886 loops=1)"
" Filter: (user_birthday <# '[1978-07-15,1983-03-01)'::daterange)"
" Rows Removed by Filter: 901214"
"Total runtime: 489.584 ms"
Query #2:
"Bitmap Heap Scan on users (cost=4468.01..46060.53 rows=210301 width=241)
(actual time=57.104..489.785 rows=209019 loops=1)"
" Recheck Cond: ((user_birthday >= '1978-07-15'::date) AND (user_birthday
<= '1983-03-01'::date))"
" Rows Removed by Index Recheck: 611375"
" -> Bitmap Index Scan on ix_users_birthday (cost=0.00..4415.44
rows=210301 width=0) (actual time=54.621..54.621 rows=209019 loops=1)"
" Index Cond: ((user_birthday >= '1978-07-15'::date) AND
(user_birthday <= '1983-03-01'::date))"
"Total runtime: 500.983 ms"
As you can see, the <# daterange is not utilizing the existing index, while
BETWEEN does.
Important to note that the actual use case for this rule is in a more complex query,
which doesn't result in the Recheck Cond and Bitmap Heap scan.
In the application complex query, the difference between the two methods (with 1.2 million records) is massive:
Query #1 at 415ms
Query #2 at 84ms.
Is this a bug with daterange?
Am I doing something wrong? or datarange <# is performing as designed?
There's also a discussion in the pgsql-bugs mailing list
BETWEEN includes upper and lower border. Your condition
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
matches
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01]')
I see you mention a btree index. For that use simple comparison operators.
Detailed manual page on which index is good for which operators.
The range type operators <# or #> would work with GiST indexes.
Example:
Perform this hours of operation query in PostgreSQL