I currently have a large table mivehdetailedtrajectory (25B rows) and a small table cell_data_tower (400 rows) that I need to join using PostGIS. Specifically, I need to run this query:
SELECT COUNT(traj.*), tower.id
FROM cell_data_tower tower LEFT OUTER JOIN mivehdetailedtrajectory traj
ON ST_Contains(tower.geom, traj.location)
GROUP BY tower.id
ORDER BY tower.id;
It errors out angry that it can't write to disk. This seemed weird for a SELECT so I ran EXPLAIN:
NOTICE: gserialized_gist_joinsel: jointype 1 not supported
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sort (cost=28905094882.25..28905094883.25 rows=400 width=120)
Sort Key: tower.id
-> HashAggregate (cost=28905094860.96..28905094864.96 rows=400 width=120)
-> Nested Loop Left Join (cost=0.00..28904927894.80 rows=33393232 width=120)
Join Filter: ((tower.geom && traj.location) AND _st_contains(tower.geom, traj.location))
-> Seq Scan on cell_data_tower tower (cost=0.00..52.00 rows=400 width=153)
-> Materialize (cost=0.00..15839886.96 rows=250449264 width=164)
-> Seq Scan on mivehdetailedtrajectory traj (cost=0.00..8717735.64 rows=250449264 width=164)
I don't understand why postgres thinks it should materialize the inner table. Also, I don't understand the plan in general to be honest. Seems like it should keep the cell_data_tower table in memory and iterate over the mivehdetailedtrajectory table. Any thoughts on how I can optimize this to (a) run, (b) do so in a reasonable amount of time. Specifically, it seems like this should be do-able in less than 1 day.
Edit: Postgres version 9.3
Queries that need a lot of memory are those rare places where correlated subqueries perform better (LATERAL JOIN should work too but those are beyond me). Also please note you didn't select tower.id so your result wouldn't be too useful.
SELECT tower.id, (SELECT COUNT(traj.*)
FROM mivehdetailedtrajectory traj
WHERE ST_Contains(tower.geom, traj.location))
FROM cell_data_tower tower
ORDER BY tower.id;
Try running it with LIMIT 1 first. The total runtime should be the runtime for one tower * number of towers.
I don't have a db so big like you, only 80M. But in my case i create a LinkID field to know where is each geom, and calculate which one is the closest LinkID when i insert a new record.
When i found out a single LinkID take 30ms and doing that 80M times would take 27 days i went from pre calculate those values.
Also i don't keep all the records, i only keep a month at any time.
Related
PostgreSQL version 10
Windows 10
16GB RAM
SSD
I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?
Thanks!
little_table with 12,000 rows
CREATE TABLE public.little_table
(
id bigint,
start_time timestamp without time zone,
stop_time timestamp without time zone
);
CREATE INDEX idx_little_table
ON public.little_table USING btree
(start_time, stop_time DESC);
big_table with 45 million rows
CREATE TABLE public.big_table
(
id bigint,
datetime timestamp without time zone
) ;
CREATE INDEX idx_big_table
ON public.big_table USING btree
(datetime);
Query
explain analyze
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime between lt.start_time and lt.stop_time)
Explain Results
Nested Loop Left Join (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
-> Seq Scan on big_table bt (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
-> Index Scan using idx_little_table on little_table lt (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms
NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.
This query cannot perform any faster.
Since there is no equality operator (=) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.
I would suggest trying to change the start_time and end_time columns in the
little table to a single tsrange column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator #>. Maybe this will do better than the index scan on your current btree.
Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?
To explain a bit about your current plan:
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime, using bt.datetime <= stop_time as an in-index filter that need to examine each row before rejecting it.
To flesh out Bergi's answer, you could do this:
alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime <# lt.range)
In my hands, that is about 4 times faster than your current method.
If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.
I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)
When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.
I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment
This is the query:
select *
from person_addresses pa
join address a
on pa.address_id = a.address_id
This is the plan :
Merge Join (cost=1506935.96..2416648.39 rows=16033774 width=338)
Merge Cond: (pa.address_id = ((a.address_id)::numeric))
-> Index Scan using person_addresses_pkey on person_addresses pa (cost=0.43..592822.76 rows=5256374 width=104)
-> Materialize (cost=1506935.53..1526969.90 rows=4006874 width=234)
-> Sort (cost=1506935.53..1516952.71 rows=4006874 width=234)
Sort Key: ((a.address_id)::numeric)
-> Seq Scan on address a (cost=0.00..163604.74 rows=4006874 width=234)
Thanks.
Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following
Hash Join (cost=343467.18..881125.47 rows=5256374 width=348)
Hash Cond: (pa.address_id = a.address_id)
-> Seq Scan on person_addresses pa (cost=0.00..147477.74 rows=5256374 width=104)
-> Hash (cost=159113.97..159113.97 rows=4033697 width=244)
-> Seq Scan on address_normalization a (cost=0.00..159113.97 rows=4033697 width=244)
Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters
So there are two questions here:
why did Postgres choose the (expensive) "Merge Join" in the first query?
The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.
The second question is:
I am wondering if the sequential scans are expected without any filters
Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.
You can run explain (analyze, buffers) to see how much "logical reads" each step takes.
I am running Postgres 9.4.4 on an Amazon RDS db.r3.4xlarge instance
- 16CPUs, 122GB Memory.
I recently came across one of the queries which needed a fairly straight forward aggregation on a large table (~270 million records). The query takes over 5 hours to execute.
The joining column and the grouping column on the large table have indexes defined. I have tried experimenting with the work_mem and temp_buffers by setting each to 1GB but it dint help much.
Here's the query and the execution plan. Any leads will be highly appreciated.
explain SELECT
largetable.column_group,
MAX(largetable.event_captured_dt) AS last_open_date,
.....
FROM largetable
LEFT JOIN smalltable
ON smalltable.column_b = largetable.column_a
WHERE largetable.column_group IS NOT NULL
GROUP BY largetable.column_group
Here is the execution plan -
GroupAggregate (cost=699299968.28..954348399.96 rows=685311 width=38)
Group Key: largetable.column_group
-> Sort (cost=699299968.28..707801354.23 rows=3400554381 width=38)
Sort Key: largetable.column_group
-> Merge Left Join (cost=25512.78..67955201.22 rows=3400554381 width=38)
Merge Cond: (largetable.column_a = smalltable.column_b)
-> Index Scan using xcrmstg_largetable_launch_id on largetable (cost=0.57..16241746.24 rows=271850823 width=34)
Filter: (column_a IS NOT NULL)
-> Sort (cost=25512.21..26127.21 rows=246000 width=4)
Sort Key: smalltable.column_b
-> Seq Scan on smalltable (cost=0.00..3485.00 rows=246000 width=4)
You say the joining key and the grouping key on the large table are indexed, but you don't mention the joining key on the small table.
The merges and sorts are a big source of slowness. However, I'm also worried that you're returning ~700,000 rows of data. Is that really useful to you? What's the situation where you need to return that much data, but a 5 hour wait is too long? If you don't need all that data coming out, then filtering as early as possible is by far and away the largest speed gain you'll realize.
I've been looking for a straight clean answer to the this question. Let's say I have a photo table.
Now this table has 1,000,000 rows. Let's do the following query:
SELECT * FROM photos ORDER BY creation_time LIMIT 10;
Will this query grab all 1,000,000 rows and then give me 10? or does it just grab the latest 10? I'm quite curious as to how this works because if it does grab 1,000,000 (mind you this table is constantly growing) then it's wasteful query. You're basically throwing away 999,980 rows away. Is there a more efficient way to do this?
Whether the database has to scan the whole table or not depends on a number of
factors - in the case you describe the main factors are whether there is an ORDER BY
clause and whether there is an index on the sort field(s).
All is revealed by looking at the query plan, and the cost approximations on each
of the operations. Consider the case where there is no ordering clause:
testdb=> explain select * from bigtable limit 10;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=0.00..0.22 rows=10 width=39)
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(2 rows)
The planner has decided that a sequential scan is the way to go. The expected cost
already gives us a clue. It is expressed as a range, 0.00..6943.06. The first number
(0.00) is the amount of work the database expects to have to do before it can deliver
any rows, while the second number is an estimate of the work required to deliver
the whole scan.
Thus, the input to the 'Limit' clause is going to start straight away, and it will
not have to process the full output of the sequential scan (since the total cost
is only 0.22, not 6943.06). So it definitely will not have to read the whole table
and discard most of it.
Now lets see what happens if you add an ORDER BY clause, using a column that is not
indexed.
testdb=> explain select * from bigtable ORDER BY title limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit (cost=13737.26..13737.29 rows=10 width=39)
-> Sort (cost=13737.26..14523.28 rows=314406 width=39)
Sort Key: title
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(4 rows)
We have a similar plan, but there is a 'Sort' operation in between the seq scan
and the limit. It has to scan the complete table, sort the full content of it,
and only then can is start delivering rows to the Limit clause. It makes sense
when you think about it - LIMIT is supposed to apply after ORDER BY; so it would
have to be sure to have found the top 10 rows in the whole table.
Now what happens when an index is used? Suppose we have a 'time' column which is
indexed:
testdb=> explain select * from bigtable ORDER BY time limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.35 rows=10 width=39)
-> Index Scan using bigtable_time_idx on bigtable (cost=0.00..10854.96 rows=314406 width=39)
(2 rows)
An index scan, using the time index, is able to start delivering rows in already
sorted order (cost starts at 0.00). The LIMIT can cut the query short after
only 10 rows, so the overall cost is very small.
The moral to the story is to carefully choose which columns or combinations of
columns you will index. You can't add them indiscriminately because adding an
index has a cost of its own - it makes it more expensive to insert, update or
delete records.
I spent over an hour today puzzling myself over a query plan that I couldn't understand. The query was an UPDATE and it just wouldn't run at all. Totally deadlocked: pg_locks showed it wasn't waiting for anything either. Now, I don't consider myself the best or worst query plan reader, but I find this one exceptionally difficult. I'm wondering how does one read these? Is there a methodology that the Pg aces follow in order to pinpoint the error?
I plan on asking another question as to how to work around this issue, but right now I'm speaking specifically about how to read these types of plans.
QUERY PLAN
--------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=47680.88..169413.12 rows=1 width=77)
Join Filter: ((co.fkey_style = v.chrome_styleid) AND (co.name = o.name))
-> Nested Loop (cost=5301.58..31738.10 rows=1 width=81)
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
-> Hash (cost=4547.33..4547.33 rows=36150 width=24)
-> Seq Scan on vehicles iv (cost=0.00..4547.33 rows=36150 width=24)
Filter: (date_sold IS NULL)
-> Index Scan using options_pkey on options co (cost=0.00..8.79 rows=1 width=49)
Index Cond: ((co.fkey_style = iv.chrome_styleid) AND (co.code = io.code))
-> Hash Join (cost=42379.30..137424.09 rows=16729 width=26)
Hash Cond: ((v.lot_id = o.lot_id) AND ((v.vin)::text = (o.vin)::text))
-> Seq Scan on vehicles v (cost=0.00..4547.33 rows=65233 width=24)
-> Hash (cost=20223.32..20223.32 rows=931332 width=44)
-> Seq Scan on options o (cost=0.00..20223.32 rows=931332 width=44)
(17 rows)
The issue with this query plan - I believe I understand - is probably best said by RhodiumToad (he is definitely better at this, so I'll bet on his explanation being better) of irc://irc.freenode.net/#postgresql:
oh, that plan is potentially disastrous
the problem with that plan is that it's running a hugely expensive hashjoin for each row
the problem is the rows=1 estimate from the other join and
the planner thinks it's ok to put a hugely expensive query in the inner path of a nestloop where the outer path is estimated to return only one row.
since, obviously, by the planner's estimate the expensive part will only be run once
but this has an obvious tendency to really mess up in practice
the problem is that the planner believes its own estimates
ideally, the planner needs to know the difference between "estimated to return 1 row" and "not possible to return more than 1 row"
but it's not at all clear how to incorporate that into the existing code
He goes on to say:
it can affect any join, but usually joins against subqueries are the most likely
Now when I read this plan the first thing I noticed was the Nested Loop Anti Join, this had a cost of 169,413 (I'll stick to upper bounds). This Anti-Join breaks down to the result of a Nested Loop at cost of 31,738, and the result of a Hash Join at a cost of 137,424. Now, the 137,424, is much greater than 31,738 so I knew the problem was the Hash Join.
Then I proceed to EXPLAIN ANALYZE the Hash Join segment outside of the query. It executed in 7 secs. I made sure there was indexes on (lot_id, vin), and (co.code, and v.code) -- there was. I disabled seq_scan and hashjoin individually and notice a speed increase of less than 2 seconds. Not near enough to account for why it wasn't progressing after an hour.
But, after all this I'm totally wrong! Yes, it was the slower part of the query, but because the rows="1" bit (I presume it was on the Nested Loop Anti Join). Here it is a bug (lack of ability) in the planner mis-estimating the amount of rows? How am I supposed to read into this to come to the same conclusion RhodiumToad did?
Is it simply rows="1" that is supposed to trigger me figuring this out?
I did run VACUUM FULL ANALYZE on all of the tables involved, and this is Postgresql 8.4.
Seeing through issues like this requires some experience on where things can go wrong. But to find issues in the query plans, try to validate the produced plan from inside out, check if the number of rows estimates are sane and cost estimates match spent time. Btw. the two cost estimates aren't lower and upper bounds, first is the estimated cost to produce the first row of output, the second number is the estimated total cost, see explain documentation for details, there is also some planner documentation available. It also helps to know how the different access methods work. As a starting point Wikipedia has information on nested loop, hash and merge joins.
In your example, you'd start with:
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
Run EXPLAIN ANALYZE SELECT * FROM options WHERE name IS NULL; and see if the returned rows matches the estimate. A factor of 2 off isn't usually a problem, you're trying to spot order of magnitude differences.
Then see EXPLAIN ANALYZE SELECT * FROM vehicles WHERE date_sold IS NULL; returns expected amount of rows.
Then go up one level to the hash join:
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
See if EXPLAIN ANALYZE SELECT * FROM vehicles AS iv INNER JOIN options io ON (io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text) WHERE iv.date_sold IS NULL AND io.name IS NULL; results in 229 rows.
Up one more level adds INNER JOIN options co ON (co.fkey_style = iv.chrome_styleid) AND (co.code = io.code) and is expected to return only one row. This is probably where the issue is because if the actual numebr of rows goes from 1 to 100, the total cost estimate of traversing the inner loop of the containing nested loop is off by a factor of 100.
The underlying mistake that the planner is making is probably that it expects that the two predicates for joining in co are independent of each other and multiplies their selectivities. While in reality they may be heavily correlated and the selectivity is closer to MIN(s1, s2) not s1*s2.
Did you ANALYZE the tables? And what does pg_stats has to say about these tables? The queryplan is based on the stats, these have to be ok. And what version do you use? 8.4?
The costs can be calculated by using the stats, the amount of relpages, amount of rows and the settings in postgresql.conf for the Planner Cost Constants.
work_mem is also involved, it might be too low and force the planner to do a seqscan, to kill performance...