High RDS CPU Utilization By Select Query - postgresql

I am using a query for finding my results. My table has almost 2000000 records.
Each time when query is executing it's taking high CPU.
Can anybody help me to get out of this?
I am using below query as:
select * from demo.document_details
where udid ilike '06AAACT2727Q1Z0:HR1000249801:%'
and active='Y'
and ewb_no=''
Query Plan is :
Seq Scan on document_details (cost=0.00..469135.94 rows=86 width=1132) (actual time=1711.248..2116.794 rows=1 loops=1)
Filter: (((udid)::text ~~ '06AAACT2727Q1Z0:HR1000249801:%'::text) AND ((active)::text = 'Y'::text) AND ((ewb_no)::text = ''::text))
Rows Removed by Filter: 2047478
Planning time: 0.348 ms
Execution time: 2116.870 ms

You need an index. If your collation is not 'C', then then index should specify "text_pattern_ops", so that it can support the prefix matching.
create index on demo.document_details (udid text_pattern_ops);
You might also include "active" and "ewb_no" in the index, depending on how selective those columns are, and whether most of your queries which search on "udid" also include those. If you do include them, you should include them in the index before "udid", because using the LIKE for a prefix is basically a range criterion, and once you have a range criterion on an indexed column, it greatly impairs the utility of columns occuring after it in the index definition. (Imagine the difference between looking in phone book for "Joh%, James" and "Johnson, J%").

Related

Faster Postgres Counting with where condition

I need to count the total no of rows in a table with a where clause. My application can tolerate some level of inaccuracy.
SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" = 'paid';
But clearly, this is a very slow query. I came across this answer but that returns the count of all rows in the table.
What's a faster method of counting when I have a where clause? I'm using sequelize's ORM, so any relevant method in sequelize would also help.
So, doing EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" != 'paid'; returns me the following:
Aggregate (cost=47268.10..47268.11 rows=1 width=8) (actual time=719.722..719.723 rows=1 loops=1)
Buffers: shared hit=32043
-> Seq Scan on ""Orders"" ""Order"" (cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)
Filter: (((status)::text <> 'paid'::text) AND ((""orderType"")::text = 'online'::text))
Rows Removed by Filter: 830133
Buffers: shared hit=32043
Planning time: 0.069 ms
Execution time: 719.755 ms
My application can tolerate some level of inaccuracy.
This is pretty hard to capitalize on in PostgreSQL. It is fairly easy to get an approximate answer, but hard to put a limit on how approximate that answer is. There will always be cases where the estimate can be very wrong, and it is hard to know when that is occurring without doing the work needed to get an exact answer.
In your query plan, it is off by a factor of 2.17. Is that good enough?
(cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)
Or, can you put bounds on tolerable inaccuracy in some other dimension? Like "accurate as of some point in the last hour"? With that kind of tolerance, you could make a materialized view to partially summarize the data, like:
create materialized view order_counts as
SELECT "orderType", "status", count(*) AS "count" FROM "Orders"
group by 1,2;
and then pull the counts out of that with your WHERE clause (and possibly resummarize them). The effectiveness of this depends on the number of combinations of "orderType" and "status" being much less than the total number of rows in the main table. You would have to set up a scheduled job to refresh the matview periodically. It is not implemented to have PostgreSQL rewrite your original query to use the matview, you have to rewrite it yourself.
You have shown us two different queries, status = 'paid' and status != 'paid'. Is one of those a mistake, or do they reflect variation in the queries you actually want to run? What other things might vary in this pool of similar queries? You should be able to get some speed up using indexes, but which index in particular will depend on your queries. For the equality query, you can include "status" in the index. For inequality query, that wouldn't do much good, so instead you could use a partial index WHERE status<>'paid'. (But then that index wouldn't be useful if 'paid' was changed to 'delinquent', for example.)

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

postgres trigram index is too slow with ilike search

I'm doing a pattern matching search with ILIKE in our system, but it gets too slow with some tables due to the amount of records in the table. So I'm implementing trigram index following instructions in this post https://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/. I'm not using full text search because I need searchs like '%xxx%' and full text search does not work well with that. The test table has 16000 records and I have created a new column in the table for the search concatenating some other columns.
I've run some test and this are de results:
SELECT * FROM "table" WHERE "searchField" ILIKE '%ose%'
NO INDEX
1639 rows 30.3 sec. avg.
GIN INDEX
1639 rows 26.4 sec. avg.
SELECT * FROM "table" WHERE "searchField" ILIKE '%ose%' OR "searchField" ILIKE '%ria%'
NO INDEX
1639 rows 3:02 min. avg.
GIN INDEX
1639 rows 2.56 min. avg.
As you can see it's not a great inprovement, the post said that query time reduce to miliseconds. The explain analyze shows this:
Bitmap Heap Scan on "table" (cost=22.31..1827.93 rows=1331 width=511)
(actual time=0.354..4.644 rows=1639 loops=1)
Recheck Cond: (("searchField")::text ~~* '%ose%'::text)
Heap Blocks: exact=585
-> Bitmap Index Scan on "table_trgm_gin" (cost=0.00..21.98 rows=1331 width=0)
(actual time=0.276..0.276 rows=1639 loops=1)
Index Cond: (("searchField")::text ~~* '%ose%'::text)
The index scan is fast but the condition recheck is too slow. I have read that rechecking is unavoidable due to false positives posibilities. But then I don´t know how to get better results.
Can anyone explain why the index does not make much of a difference and how to get better query times?
The EXPLAIN (ANALYZE) you show must be from a different table, because there the duration is under 5 milliseconds.
I notice that the patterns you search for are very short (3 characters).
Trigram indexes don't perform good for short patterns, because many rows will match during the index scan and all of these rows have to be rechecked.
Two things to check to see if my analysis is correct:
Test with longer patterns and see if the performance improves.
Look at the EXPLAIN (ANALYZE) output of the query that takes three minutes and see if a lot of rows are found during the index scan.
If I am right, there is not much you can do. Looking for short patterns just doesn't perform very well. You could try to limit the minimum length of characters in a pattern to avoid the problem.

PostgreSQL index not used for query on IP ranges

I'm using PostgreSQL 9.2 and have a table of IP ranges. Here's the SQL:
CREATE TABLE ips (
id serial NOT NULL,
begin_ip_num bigint,
end_ip_num bigint,
country_name character varying(255),
CONSTRAINT ips_pkey PRIMARY KEY (id )
)
I've added plain B-tree indices on both begin_ip_num and end_ip_num:
CREATE INDEX index_ips_on_begin_ip_num ON ips (begin_ip_num);
CREATE INDEX index_ips_on_end_ip_num ON ips (end_ip_num );
The query being used is:
SELECT ips.* FROM ips
WHERE 3065106743 BETWEEN begin_ip_num AND end_ip_num;
The problem is that my BETWEEN query is only using the index on begin_ip_num. After using the index, it filters the result using end_ip_num. Here's the EXPLAIN ANALYZE result:
Index Scan using index_ips_on_begin_ip_num on ips (cost=0.00..2173.83 rows=27136 width=76) (actual time=16.349..16.350 rows=1 loops=1)
Index Cond: (3065106743::bigint >= begin_ip_num)
Filter: (3065106743::bigint <= end_ip_num)
Rows Removed by Filter: 47596
Total runtime: 16.425 ms
I've already tried various combinations of indices including adding a composite index on both begin_ip_num and end_ip_num.
Try a multicolumn index, but with reversed order on the second column:
CREATE INDEX index_ips_begin_end_ip_num ON ips (begin_ip_num, end_ip_num DESC);
Ordering is mostly irrelevant for a single-column index, since it can be scanned backwards almost as fast. But it is important for multicolumn indexes.
With the index I propose, Postgres can scan the first column and find the address, where the rest of the index fulfills the first condition. Then it can, for each value of the first column, return all rows that fulfill the second condition, until the first one fails. Then jump to the next value of the first column, etc.
This is still not very efficient and Postgres may be faster just scanning the first index column and filtering for the second. Very much depends on your data distribution.
Either way, CLUSTER using the multicolumn index from above can help performance:
CLUSTER ips USING index_ips_begin_end_ip_num
This way, candidates fulfilling your first condition are packed onto the same or adjacent data pages. Can help performance a lot with if you have lots of rows per value of the first column. Else it is hardly effective.
(There are also non-blocking external tools for the purpose: pg_repack or pg_squeeze.)
Also, is autovacuum running and configured properly or have you run ANALYZE on the table? You need current statistics for Postgres to pick appropriate query plans.
What would really help here is a GiST index for a int8range column, available since PostgreSQL 9.2. See:
Optimizing queries on a range of timestamps (two columns)
If your IP ranges can be covered with one of the built-in network types inet or cidr, consider to replace your two bigint columns. Or, better yet, look to the additional module ip4r by Andrew Gierth (not in the standard distribution. The indexing strategy changes accordingly.
Barring that, you can check out this related answer on dba.SE with using a sophisticated regime with partial indexes. Advanced stuff, but it delivers great performance:
Can spatial index help a “range - order by - limit” query
I had exactly this same problem on a nearly identical dataset from maxmind.com's free geiop table. I solved it using Erwin's tip about range types and GiST indexes. The GiST index was key. Without it I was querying at best about 3 rows per second. With it I queried nearly 500000 rows in under 10 seconds! Since Erwin didn't post detailed instructions on how to do this, I thought I'd add them, here...
First of all, you must add a new column having the range type, note that int8range is required for bigint types. Next set its values appropriately, note that the '[]' parameter indicates to make the range inclusive at lower and upper bounds (rtfm). Finally add the index, note that the GiST index is where all the performance advantage comes from.
alter table ips add column iprange int8range;
update ips set iprange=int8range(begin_ip_num, end_ip_num, '[]');
create index index_ips_on_iprange on ips using gist (iprange);
Having laid the groundwork, you can now use the '<#' contained-by operator to search specific addresses against the table. See http://www.postgresql.org/docs/9.2/static/functions-range.html
SELECT "ips".* FROM "ips" WHERE (3065106743::bigint <# iprange);
I'm a bit late to this party, but this is what works really well for me.
Consider installing ip4r extension. It basically allows you to define a column that can hold IP ranges. The name of the extension implies it is just for IPv4, but currently it is also support IPv6.
After you populate table with ranges within that column all you need, is to create GIST index:
CREATE INDEX ip_zip_ip4_range ON ip_zip USING gist (ip4_range);
I have almost 10 million ranges in my database, but queries take fraction of a milisecond:
region=> select count(*) from ip_zip ;
count
---------
9566133
region=> explain analyze select * from ip_zip where '8.8.8.8'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.085..0.086 rows=1 loops=1)
Recheck Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Planning time: 0.106 ms
Execution time: 0.118 ms
(7 rows)
region=> explain analyze select * from ip_zip where '254.50.22.54'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.059..0.059 rows=1 loops=1)
Recheck Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.048..0.048 rows=1 loops=1)
Index Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Planning time: 0.102 ms
Execution time: 0.145 ms
(7 rows)
I believe your query looks like WHERE [constant] BETWEEN begin_ip_num AND end_ipnum or
As far as I know Postgres doesn't have "AND-EQUAL " access plan, so you need to add a composite index on 2 columns as suggested by Erwin Brandstetter.

PostgreSQL query doesn't use index

I have a very simple db schema, which has a multi column b-tree index on following columns:
PersonId, Amount, Commission
Now, if I try to select the table with following query:
explain select * from "Order" where "PersonId" = 2 AND "Commission" > 3
Pg is scanning the index and the query is very fast, but if I try the following query:
explain select * from "Order" where "PersonId" > 2 AND "Commission" > 3
It does a sequential scan, even when the index is present. Even this query
explain select * from "Order" where "Commission" > 3
does a sequential scan.
Anyone care to explain why? :-)
Thank you very much.
UPDATE
The table contains 100 million rows. I have created it just to test PostgreSQL performance against MS SQL. The table is already VACUUMED. I'm runnning Core I5 2500k quad core cpu and 8 GB of ram.
Here's the result of explain analyze for this query:
explain ANALYZE select * from "Order" where "Commission" BETWEEN 3000000 AND 3000010 LIMIT 20
Limit (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.249..28043.249 rows=0 loops=1)
-> Seq Scan on "Order" (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.247..28043.247 rows=0 loops=1)
Filter: (("Commission" >= 3000000::numeric) AND ("Commission" <= 3000010::numeric))
Total runtime: 28043.278 ms
The short answer is that when comparing the various available plans, the sequential scan is expected to be the fastest, based on the costing factors you have configured and the latest statistics available. From what little information you've provided, it seems quite likely that the planner has made the right choice. If you had three single-column indexes, it might be able to use bitmap index scans, particularly if the rows to be selected are less than about 10% of the rows in the table.
Note that with the index you describe, the entire index would need to be scanned from for all rows where "PersonId" > 2; which unless you have a lot of negative values for "PersonId" is very likely to be most of the table.
Also note that if you have a tiny table -- say a few thousand rows or less, accessing the rows through an index will rarely be faster than just scanning those few rows. Plans are sensitive to data volume, and the plan you get with a small number of rows is very unlikely to be the same plan you get with a lot of rows.
If it is, in fact, not picking the fastest plan, the odds are good that you need to adjust your cost factors to better model the costs on your machine. Another possibility is that you need to be more aggressive in your autovacuum settings, to make sure up-to-date statistics are available, or you may need to configure collection of finer-grained statistics.
People will be able to provide more specific advice if you show the table descriptions (including indexes), the EXPLAIN ANALYZE output for the query, and a description of the hardware.