Faster Postgres Counting with where condition - postgresql

I need to count the total no of rows in a table with a where clause. My application can tolerate some level of inaccuracy.
SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" = 'paid';
But clearly, this is a very slow query. I came across this answer but that returns the count of all rows in the table.
What's a faster method of counting when I have a where clause? I'm using sequelize's ORM, so any relevant method in sequelize would also help.
So, doing EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" != 'paid'; returns me the following:
Aggregate (cost=47268.10..47268.11 rows=1 width=8) (actual time=719.722..719.723 rows=1 loops=1)
Buffers: shared hit=32043
-> Seq Scan on ""Orders"" ""Order"" (cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)
Filter: (((status)::text <> 'paid'::text) AND ((""orderType"")::text = 'online'::text))
Rows Removed by Filter: 830133
Buffers: shared hit=32043
Planning time: 0.069 ms
Execution time: 719.755 ms

My application can tolerate some level of inaccuracy.
This is pretty hard to capitalize on in PostgreSQL. It is fairly easy to get an approximate answer, but hard to put a limit on how approximate that answer is. There will always be cases where the estimate can be very wrong, and it is hard to know when that is occurring without doing the work needed to get an exact answer.
In your query plan, it is off by a factor of 2.17. Is that good enough?
(cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)
Or, can you put bounds on tolerable inaccuracy in some other dimension? Like "accurate as of some point in the last hour"? With that kind of tolerance, you could make a materialized view to partially summarize the data, like:
create materialized view order_counts as
SELECT "orderType", "status", count(*) AS "count" FROM "Orders"
group by 1,2;
and then pull the counts out of that with your WHERE clause (and possibly resummarize them). The effectiveness of this depends on the number of combinations of "orderType" and "status" being much less than the total number of rows in the main table. You would have to set up a scheduled job to refresh the matview periodically. It is not implemented to have PostgreSQL rewrite your original query to use the matview, you have to rewrite it yourself.
You have shown us two different queries, status = 'paid' and status != 'paid'. Is one of those a mistake, or do they reflect variation in the queries you actually want to run? What other things might vary in this pool of similar queries? You should be able to get some speed up using indexes, but which index in particular will depend on your queries. For the equality query, you can include "status" in the index. For inequality query, that wouldn't do much good, so instead you could use a partial index WHERE status<>'paid'. (But then that index wouldn't be useful if 'paid' was changed to 'delinquent', for example.)

Related

High RDS CPU Utilization By Select Query

I am using a query for finding my results. My table has almost 2000000 records.
Each time when query is executing it's taking high CPU.
Can anybody help me to get out of this?
I am using below query as:
select * from demo.document_details
where udid ilike '06AAACT2727Q1Z0:HR1000249801:%'
and active='Y'
and ewb_no=''
Query Plan is :
Seq Scan on document_details (cost=0.00..469135.94 rows=86 width=1132) (actual time=1711.248..2116.794 rows=1 loops=1)
Filter: (((udid)::text ~~ '06AAACT2727Q1Z0:HR1000249801:%'::text) AND ((active)::text = 'Y'::text) AND ((ewb_no)::text = ''::text))
Rows Removed by Filter: 2047478
Planning time: 0.348 ms
Execution time: 2116.870 ms
You need an index. If your collation is not 'C', then then index should specify "text_pattern_ops", so that it can support the prefix matching.
create index on demo.document_details (udid text_pattern_ops);
You might also include "active" and "ewb_no" in the index, depending on how selective those columns are, and whether most of your queries which search on "udid" also include those. If you do include them, you should include them in the index before "udid", because using the LIKE for a prefix is basically a range criterion, and once you have a range criterion on an indexed column, it greatly impairs the utility of columns occuring after it in the index definition. (Imagine the difference between looking in phone book for "Joh%, James" and "Johnson, J%").

Slow distinct PostgreSQL query on nested jsonb field won't use index

I'm trying to get distinct values from a nested field on JSONB column, but it takes about 2 minutes on a 400K rows table.
The original query used DISTINCT but then I read that GROUP BY works better so tried this too, but no luck - still extremely slow.
Adding an index did not help either:
create index "orders_financial_status_index" on orders ((data ->'data'->> 'financial_status'));
ANALYZE EXPLAIN gave this result:
HashAggregate (cost=13431.16..13431.22 rows=4 width=32) (actual time=123074.941..123074.943 rows=4 loops=1)
Group Key: ((data -> 'data'::text) ->> 'financial_status'::text)
-> Seq Scan on orders (cost=0.00..12354.14 rows=430809 width=32) (actual time=2.993..122780.325 rows=434080 loops=1)
Planning time: 0.119 ms
Execution time: 123074.979 ms
It's worth mentioning that there are no null values on this column, and currently there are 4 unique values.
What should I do in order to query the distinct values faster?
No index will make this faster, because the query has to scan the whole table.
As you can see, the sequential scan uses almost all the time; the hash aggregate is fast.
Still I would not drop the index, because it allows PostgreSQL to estimate the number of groups accurately and decide on the more efficient hash aggregate rather than sorting the rows. You can try without the index to be sure.
However, two minutes for half a million rows is not very fast. Do you have slow storage? Is the table bloated? If the latter, VACUUM (FULL) should improve things.
You can speed up the query by reducing I/O. Load the table into RAM with pg_prewarm, then processing should be considerably faster.

Slow index scan

I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?
The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.
1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.

PostgreSQL query doesn't use index

I have a very simple db schema, which has a multi column b-tree index on following columns:
PersonId, Amount, Commission
Now, if I try to select the table with following query:
explain select * from "Order" where "PersonId" = 2 AND "Commission" > 3
Pg is scanning the index and the query is very fast, but if I try the following query:
explain select * from "Order" where "PersonId" > 2 AND "Commission" > 3
It does a sequential scan, even when the index is present. Even this query
explain select * from "Order" where "Commission" > 3
does a sequential scan.
Anyone care to explain why? :-)
Thank you very much.
UPDATE
The table contains 100 million rows. I have created it just to test PostgreSQL performance against MS SQL. The table is already VACUUMED. I'm runnning Core I5 2500k quad core cpu and 8 GB of ram.
Here's the result of explain analyze for this query:
explain ANALYZE select * from "Order" where "Commission" BETWEEN 3000000 AND 3000010 LIMIT 20
Limit (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.249..28043.249 rows=0 loops=1)
-> Seq Scan on "Order" (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.247..28043.247 rows=0 loops=1)
Filter: (("Commission" >= 3000000::numeric) AND ("Commission" <= 3000010::numeric))
Total runtime: 28043.278 ms
The short answer is that when comparing the various available plans, the sequential scan is expected to be the fastest, based on the costing factors you have configured and the latest statistics available. From what little information you've provided, it seems quite likely that the planner has made the right choice. If you had three single-column indexes, it might be able to use bitmap index scans, particularly if the rows to be selected are less than about 10% of the rows in the table.
Note that with the index you describe, the entire index would need to be scanned from for all rows where "PersonId" > 2; which unless you have a lot of negative values for "PersonId" is very likely to be most of the table.
Also note that if you have a tiny table -- say a few thousand rows or less, accessing the rows through an index will rarely be faster than just scanning those few rows. Plans are sensitive to data volume, and the plan you get with a small number of rows is very unlikely to be the same plan you get with a lot of rows.
If it is, in fact, not picking the fastest plan, the odds are good that you need to adjust your cost factors to better model the costs on your machine. Another possibility is that you need to be more aggressive in your autovacuum settings, to make sure up-to-date statistics are available, or you may need to configure collection of finer-grained statistics.
People will be able to provide more specific advice if you show the table descriptions (including indexes), the EXPLAIN ANALYZE output for the query, and a description of the hardware.

How can I "think better" when reading a PostgreSQL query plan?

I spent over an hour today puzzling myself over a query plan that I couldn't understand. The query was an UPDATE and it just wouldn't run at all. Totally deadlocked: pg_locks showed it wasn't waiting for anything either. Now, I don't consider myself the best or worst query plan reader, but I find this one exceptionally difficult. I'm wondering how does one read these? Is there a methodology that the Pg aces follow in order to pinpoint the error?
I plan on asking another question as to how to work around this issue, but right now I'm speaking specifically about how to read these types of plans.
QUERY PLAN
--------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=47680.88..169413.12 rows=1 width=77)
Join Filter: ((co.fkey_style = v.chrome_styleid) AND (co.name = o.name))
-> Nested Loop (cost=5301.58..31738.10 rows=1 width=81)
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
-> Hash (cost=4547.33..4547.33 rows=36150 width=24)
-> Seq Scan on vehicles iv (cost=0.00..4547.33 rows=36150 width=24)
Filter: (date_sold IS NULL)
-> Index Scan using options_pkey on options co (cost=0.00..8.79 rows=1 width=49)
Index Cond: ((co.fkey_style = iv.chrome_styleid) AND (co.code = io.code))
-> Hash Join (cost=42379.30..137424.09 rows=16729 width=26)
Hash Cond: ((v.lot_id = o.lot_id) AND ((v.vin)::text = (o.vin)::text))
-> Seq Scan on vehicles v (cost=0.00..4547.33 rows=65233 width=24)
-> Hash (cost=20223.32..20223.32 rows=931332 width=44)
-> Seq Scan on options o (cost=0.00..20223.32 rows=931332 width=44)
(17 rows)
The issue with this query plan - I believe I understand - is probably best said by RhodiumToad (he is definitely better at this, so I'll bet on his explanation being better) of irc://irc.freenode.net/#postgresql:
oh, that plan is potentially disastrous
the problem with that plan is that it's running a hugely expensive hashjoin for each row
the problem is the rows=1 estimate from the other join and
the planner thinks it's ok to put a hugely expensive query in the inner path of a nestloop where the outer path is estimated to return only one row.
since, obviously, by the planner's estimate the expensive part will only be run once
but this has an obvious tendency to really mess up in practice
the problem is that the planner believes its own estimates
ideally, the planner needs to know the difference between "estimated to return 1 row" and "not possible to return more than 1 row"
but it's not at all clear how to incorporate that into the existing code
He goes on to say:
it can affect any join, but usually joins against subqueries are the most likely
Now when I read this plan the first thing I noticed was the Nested Loop Anti Join, this had a cost of 169,413 (I'll stick to upper bounds). This Anti-Join breaks down to the result of a Nested Loop at cost of 31,738, and the result of a Hash Join at a cost of 137,424. Now, the 137,424, is much greater than 31,738 so I knew the problem was the Hash Join.
Then I proceed to EXPLAIN ANALYZE the Hash Join segment outside of the query. It executed in 7 secs. I made sure there was indexes on (lot_id, vin), and (co.code, and v.code) -- there was. I disabled seq_scan and hashjoin individually and notice a speed increase of less than 2 seconds. Not near enough to account for why it wasn't progressing after an hour.
But, after all this I'm totally wrong! Yes, it was the slower part of the query, but because the rows="1" bit (I presume it was on the Nested Loop Anti Join). Here it is a bug (lack of ability) in the planner mis-estimating the amount of rows? How am I supposed to read into this to come to the same conclusion RhodiumToad did?
Is it simply rows="1" that is supposed to trigger me figuring this out?
I did run VACUUM FULL ANALYZE on all of the tables involved, and this is Postgresql 8.4.
Seeing through issues like this requires some experience on where things can go wrong. But to find issues in the query plans, try to validate the produced plan from inside out, check if the number of rows estimates are sane and cost estimates match spent time. Btw. the two cost estimates aren't lower and upper bounds, first is the estimated cost to produce the first row of output, the second number is the estimated total cost, see explain documentation for details, there is also some planner documentation available. It also helps to know how the different access methods work. As a starting point Wikipedia has information on nested loop, hash and merge joins.
In your example, you'd start with:
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
Run EXPLAIN ANALYZE SELECT * FROM options WHERE name IS NULL; and see if the returned rows matches the estimate. A factor of 2 off isn't usually a problem, you're trying to spot order of magnitude differences.
Then see EXPLAIN ANALYZE SELECT * FROM vehicles WHERE date_sold IS NULL; returns expected amount of rows.
Then go up one level to the hash join:
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
See if EXPLAIN ANALYZE SELECT * FROM vehicles AS iv INNER JOIN options io ON (io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text) WHERE iv.date_sold IS NULL AND io.name IS NULL; results in 229 rows.
Up one more level adds INNER JOIN options co ON (co.fkey_style = iv.chrome_styleid) AND (co.code = io.code) and is expected to return only one row. This is probably where the issue is because if the actual numebr of rows goes from 1 to 100, the total cost estimate of traversing the inner loop of the containing nested loop is off by a factor of 100.
The underlying mistake that the planner is making is probably that it expects that the two predicates for joining in co are independent of each other and multiplies their selectivities. While in reality they may be heavily correlated and the selectivity is closer to MIN(s1, s2) not s1*s2.
Did you ANALYZE the tables? And what does pg_stats has to say about these tables? The queryplan is based on the stats, these have to be ok. And what version do you use? 8.4?
The costs can be calculated by using the stats, the amount of relpages, amount of rows and the settings in postgresql.conf for the Planner Cost Constants.
work_mem is also involved, it might be too low and force the planner to do a seqscan, to kill performance...