postgresql: implementation of hierarchical tree - postgresql

I have been struggling with marker-clustering problem with 1000+ markers (that should be put on a Google map). I am not very keen on rendering large JSON structures with all the markers, neither I am fond of some complex server "geo"-computations with PostGIS.
The solution I came up with is to divide world map into some sort of hierarchical spatial tree, let's say quad tree, where each point in my db will be assigned with "coordinates" in that tree. These coordinates are strings that have on position_x index_of_tile in tier_x, e.g. '031232320012'. The length of the string depends on number of zoom levels that will be enabled for the front-end map. Basically if a user moves or zooms the map, I'll launch Ajax GET request with the current zoom level and view port coordinates as parameters. Then in back-end I plan to build a string that should point to the "viewport at the given zoom level", e.g. '02113' and I want to find all points that have this prefix ('02113') in the tree coordinates column.
EDIT: I will also need fast GROUP BY, e.g. SELECT count(*) from points GROUP BY left(coordinates, 5);
My question is how to perform these operations as fast as possible? My database is PostgreSQL.

Then in back-end I plan to build a string that should point to the "viewport at the given zoom level", e.g. '02113' and I want to find all points that have this prefix ('02113') in the tree coordinates column.
An ordinary index should perform well on any modern dbms as long as you're looking at the left-most five (or six or seven) characters of a string in an indexed column.
SELECT ...
...
WHERE column_name LIKE '02113%';
In PostgreSQL, you can also build an index on an expression. So you could create an index on the first five characters.
CREATE INDEX your_index_name ON your_table (left(column_name, 5));
I'd expect PostgreSQL's query optimizer to pick the right index if there were three or four like that. (One for 5 characters, one for 6 characters, etc.)
I build a table, and I populated it with a million rows of random data.
In the following query, PostgreSQL's query optimizer did pick the right index.
explain analyze
select s
from coords
where left(s, 5) ='12345';
It returned in 0.1 ms.
I also tested using GROUP BY. Again, PostgreSQL's query optimizer picked the right index.
"GroupAggregate (cost=0.00..62783.15 rows=899423 width=8) (actual time=91.300..3096.788 rows=90 loops=1)"
" -> Index Scan using coords_left_idx1 on coords (cost=0.00..46540.36 rows=1000000 width=8) (actual time=0.051..2915.265 rows=1000000 loops=1)"
"Total runtime: 3096.914 ms"
An expression like left(name, 2) in the GROUP BY clause will require PostgreSQL to touch every row in the index, if not every row in the table. That's why my query took 3096ms; it had to touch a million rows in the index. But you can see from the EXPLAIN plan that it used the index.
Ordinarily, I'd expect a geographic application to use a bounding box against a PostGIS table to reduce the number of rows you access. If your quad tree implementation can't do better than that, I'd stick with PostGIS long enough to become an expert with it. (You won't know for sure that it can't do the job until you've spent some time in it.)

Related

Why does postgresql planner choose bad plan, for a few specific values only

I have a biggish table in postgresql 15.1 — maybe 50 million rows and growing. A column mmsi has about 30k distinct values, so 1000+ rows per mmsi.
My problem is that I have a query that I need to execute repeatedly during DB load, and for certain values of mmsi it takes hundreds of seconds instead of milliseconds. The model query is simply
SELECT max(to_timestamp) FROM track WHERE mmsi = :mmsi
The anomaly is visible in the EXPLAIN output. The bad case (which only happens for a small fraction of mmsi values):
trackdb=# EXPLAIN SELECT max(to_timestamp) FROM track WHERE mmsi = 354710000;
QUERY PLAN
----------
Result (cost=413.16..413.17 rows=1 width=8)
InitPlan 1 (returns $0)
- > Limit (cost=0.56..413.16 rows=1 width=8)
- > Index Scan Backward using ix_track_to_timestamp on track (cost=0.56..3894939.14 rows=9440 width=8)
Index Cond: (to_timestamp IS NOT NULL)
Filter: (mmsi = 354710000)
(6 rows)
Good case (the vast majority):
trackdb=# EXPLAIN SELECT max(to_timestamp) FROM track WHERE mmsi = 354710001;
QUERY PLAN
----------
Aggregate (cost=1637.99..1638.00 rows=1 width=8)
- > Index Scan using ix_track_mmsi on track (cost=0.44..1635.28 rows=1082 width=8)
Index Cond: (mmsi = 354710001)
(3 rows)
Now, I notice that the estimated number of rows is larger in the bad case. I can not see anything in the postgresql statistics (pg_stats.histogram_bounds) to explain this.
The problem seems to change when I ANALYZE the table, in that the specific values to trigger the problem becomes different. But anyhow, since this is needed during DB load, ANALYZE is not a solution.
I'm stumped. Does anyone have an idea what could be happening?
[Edit: To clarify, I know ways to work around it, for example by materializing the rows before applying max. But not understanding makes me very unhappy.]
As Laurenz has explained, the problem is that PostgreSQL thinks the approx 10,000 rows where mmsi = 354710000 are scattered randomly over the values of to_timestamp, and so thinks that by scanning the index over to_timestamp in order, it can stop as soon as it finds the first one meeting mmsi = 354710000 and that will happen quickly. But of all the mmsi = 354710000 are on the wrong end of the index, it does not in fact happen quickly. There is nothing you can do about this in the stats, as there are no "handles" it can grab into to better inform its thinking. Maybe some future extension to the custom stats feature will do it.
Edit: To clarify, I know ways to work around it, for example by materializing the rows before applying max.
A better work around solution would probably be an index on (mmsi,to_timestamp). This would not only fix the case where it currently chooses a very bad plan, it would substantially improve the cases currently using a tolerable plan by giving them an even better option. And you don't need to rewrite the query. And you can drop the existing index just on mmsi, as there is no reason to have both.

Postgresql: optimal use of multicolumn-index when subset of index is missing from the where clause

I will be having queries on my database with where clauses similar to this:
SELECT * FROM table WHERE a = 'string_value' AND b = 'other_string_value' AND t > <timestamp>
and less often to this:
SELECT * FROM table WHERE a = 'string_value' AND t > <timestamp>
I have created a multicolumn index on a, b and t on that order. However I am not sure if it will be optimal for my second -less frequent- query.
Will this index do an index scan on b or skip it and move to the t index immediately? (To be honest Im not sure how index scans work exactly). Should I create a second multi-column index on a and t only for the second query?
The docs state that
'the index is most efficient when there are constraints on the leading (leftmost) columns'
But in the example it doesn't highlight my case where the 'b' equality column is missing in the where clause.
The 2nd query will be much less effective with the btree index on (a,b,t) because the absence of b means t cannot be used efficiently (it can still be used as an in-index filter, but that is not nearly as good as being used as a start/stop point). An index on (a,t) will be able to support the 2nd query much more efficiently.
But that doesn't mean you have to create that index as well. Indexes take space and must be maintained, so are far from free. It might be better to just live with less-than-optimal plans for the 2nd query, since that query is used "less often". On the other hand, you did bother to post about it, so maybe "less often" is still pretty often. So you might be better off just to build the extra index and spend your time worrying about something else.
A btree index can be thought of like a phonebook, which is sorted on last name, then first name, then middle name. Your first query is like searching for "people named Mary Smith with a middle name less than Cathy" You can use binary search to efficiently find the first "Mary Smith", then you scan through those until the middle name is > 'Cathy', and you are done. Compare that to "people surnamed Smith with a middle name less than Cathy". Now you have to scan all the Smith's. You can't stop at the first middle name > Cathy, because any change in first name resets the order of the middle names.
Given that b only has 10 distinct values, you could conceivably use the (a,b,t) index in a skip scan quite efficiently. But PostgreSQL doen't yet implement skip scans natively. You can emulate them, but that is fragile, ugly, a lot of work, and easy to screw up. Nothing you said here makes me think it would be worthwhile to do.

Get all ltree nodes at depth

I have an ltree column containing a tree with a depth of 3. I'm trying to write a query that can select all children at a specific depth (level 1 = get all parents, 2 = get all children, 3 = get all grandchildren). I know this is pretty straightforward with n_level:
SELECT path FROM hierarchies
WHERE
nlevel(path) = 1
LIMIT 1000;
I have 200,000 dummy records and it's pretty fast (~170 ms). However, this query uses a sequential scan. I think it'd be better to write it in a way that takes advantage of the ltree operators supported by the GiST index. Frustratingly, I can't seem to wrap my brain around them, and I haven't found a similar question on SO or DBA (besides this one on finding leaves)
Any advice is appreciated!
The only index that could support your query is a simple b-tree index on an expression.
create index on hierarchies((nlevel(path)))
Note however that it is quite possible for the planner to choose a sequential scan anyway, exemplary in the case the number of rows with level 1 is much more than other levels.

Why are so many pages getting read during an index scan (Postgres 11.2)?

We have a Postgres 11.2 database which stores time-series of values against a composite key. Given 1 or a number of keys, the query tries to find the latest value(s) in each time-series given a time constraint.
We suffer query timeouts when the data is not cached, because it seems to have to walk a huge number of pages in order to find the data.
Here is the relevant section in the explain. We are getting the data for a single time-series (with 367 values in this example):
-> Index Scan using quotes_idx on quotes q (cost=0.58..8.61 rows=1 width=74) (actual time=0.011..0.283 rows=367 loops=1)
Index Cond: ((client_id = c.id) AND (quote_detail_id = qd.id) AND (effective_at <= '2019-09-26 00:59:59+01'::timestamp with time zone) AND (effective_at >= '0001-12-31 23:58:45-00:01:15 BC'::timestamp with time zone))
Buffers: shared hit=374
This is the definition of the index in question:
CREATE UNIQUE INDEX quotes_idx ON quotes.quotes USING btree (client_id, quote_detail_id, effective_at);
Where the columns are 2x int4 and a timestamptz, respectively.
Assuming I'm reading the output correctly, why is the engine walking 374 pages (~3Mb, given our 8kb page size) in order to return ~26kb of data (367 rows of width 74 bytes)?
When we scale up the number of keys (say, 500) the engine ends up walking over 150k pages (over 1GB), which when not cached, takes a significant time.
Note, the average row size in the underlying table is 82 bytes (over 11 columns), and contains around 700mi rows.
Thanks in advance for any insights!
The 367 rows found in your index scan are probably stored in more than 300 table blocks (that is not surprising in a large table). So PostgreSQL has to access all these blocks to come up with a result.
This would perform much better if the rows were all concentrated in a few table blocks. In other words, if the logical ordering of the index would correspond to the physical order of the rows in the table. In PostgreSQL terms, a high correlation would be beneficial.
You can force PostgreSQL to rewrite the entire table in the correct order with
CLUSTER quotes USING quotes_idx;
Then your query should become much faster.
There are some disadvantages though:
While CLUSTER is running, the table is not accessible. This usually means down time.
Right after CLUSTER, performance will be good, but PostgreSQL does not maintain the ordering. Subsequent data modifications will reduce the correlation.
To keep the query performing well, you'll have to schedule CLUSTER regularly.
Reading 374 blocks to obtain 367 rows is not unexpected. CLUSTERing the data is one way to address that, as already mentioned. Another possibility is to add some more columns into the index column list (by creating a new index and dropping the old one), so that the query can be satisfied with an index-only-scan.
This requires no down-time if the index is created concurrently. You do have to keep the table well-vacuumed, which can be tricky to do as the autovacuum parameters were really not designed with IOS in mind. It requires no maintenance, other than the vacuuming, so I would prefer this method if the list (and size) of columns you need to add to the index is small.

Will it make a noticeable different in performance/index size if my index is made partial?

Say I have a table with 20 million rows I want to index like so:
CREATE INDEX fruit_color
ON fruits
USING btree
(color);
Now let's say that only 2% of the fruits have a color, rest will be NULL. My queries will NEVER want to find fruits with color NULL (no color), so the question is, will it make a difference for postgresql if I change the index to:
CREATE INDEX fruit_color
ON fruits
USING btree
(color)
WHERE color IS NOT NULL;
I don't know much about postgresql's internal way of handling indexes, so this is why I ask.
PS postgresql version is 9.2
Yes, that will make a difference. How much of a difference depends on how the index is used.
If there is only one fruit with a certain color, and you search for this fruit by color, it won't make much of a difference; maybe one less page will be accessed (because the index has maybe one level of depth less).
If there are many fruits of a certain color, the improvement will be great, because it will be much cheaper to scan the whole index (for a bitmap index scan) or a bigger part of it (for a regular or index-only scan).
If the index is big, PostgreSQL will be more reluctant to scan the complete index and will probably choose a sequential table scan instead.