Given the table
create table a (x int, y int);
create index a_x_y on a(x, y);
I would expect a query like select distinct x from a where y = 1 to use only the index, instead it uses the index to filter by y, then does a Bitmap Heap Scan to get all values of x.
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=15.03..15.05 rows=2 width=4) (actual time=0.131..0.131 rows=0 loops=1)
-> Bitmap Heap Scan on a (cost=4.34..15.01 rows=11 width=4) (actual time=0.129..0.129 rows=0 loops=1)
Recheck Cond: (y = 1)
-> Bitmap Index Scan on a_x_y (cost=0.00..4.33 rows=11 width=0) (actual time=0.125..0.125 rows=0 loops=1)
Index Cond: (y = 1)
What kind of index would be needed for this type of query?
Since you're filtering on the second column of the index, it won't be used for a direct index scan. If you change the index to be on y,x instead of x,y, it might give you the scan you're looking for.
Also, you may very well get a different query plan if you put actual data in your table, so you should do your testing with realistic data.
Finally, I think you are misunderstanding the bitmap scan nodes. Bitmap Heap scan doesn't mean it's doing an actual heap scan. It's using the indexes to find out which pages there are interesting rows on, and will then scan those pages only in the table in the second operation.
The bitmap heap scan takes 0.129 milliseconds, isn't that fast enough?
If you are thinking about an "index only scan", PostgreSQL can not yet do that.
Related
I am trying to define custom function and I wanted to find how can I calculate estimated cost of that function
https://www.postgresql.org/docs/current/sql-createfunction.html
I tried giving different values of cost function but unable to find to find how to estimate that cost.
If I cared enough to bother, I would do it experimentally.
For example, if your function takes double precision, you could compare:
explain analyze select sqrt(x::double precision) from generate_series(1,1000000) f(x);
to
explain analyze select your_func(x::double precision) from generate_series(1,1000000) f(x);
And then find the cost setting that makes the ratio of the cost estimates about match the ratio of the actual times.
You could try to subtract the baseline costs of the generate_series and the cast, but if the added time of your function is so small that it warrants such precision, then it is probably small enough to just make the cost 1 and not worry about it.
The formula postgresql uses for calculate the explain cost plus some examples of explain below,is necessary to sum everything from the table (indexes , foreign keys ,sequence):
references : https://postgrespro.com/blog/pgsql/5969403
SELECT relname , relpages*current_setting('seq_page_cost')::numeric +
reltuples*current_setting('cpu_tuple_cost')::numeric as cost
FROM pg_class
--WHERE relname='tablename';
You can use EXPLAIN to see the cost of CPU from each query,notice that this value is static and based on the objects used.
CREATE OR REPLACE FUNCTION a() RETURNS SET OF INTEGER AS $$
SELECT 1;
$$
LANGUAGE SQL;
EXPLAIN SELECT * FROM a() CROSS JOIN (Values(1),(2),(3)) as foo;
Nested Loop (cost=0.25..47.80 rows=3000 width=8)
-> Function Scan on a (cost=0.25..10.25 rows=1000 width=4)
-> Materialize (cost=0.00..0.05 rows=3 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)
(4 rows)
If two functions with COST 0.0001 AND 10000 get executed on the same time as the predicate of a SELECT statement the query planner will execute first the function of cost 0.0001 and only later the 10000 cost condition as you can see in this example below.
EXPLAIN SELECT * FROM pg_language WHERE lanname ILIKE '%sql%' AND slow_
function(lanname)AND fast_function(lanname);
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on pg_language (cost=0.00..101.05 rows=1 width=114)
Filter: (fast_function(lanname) AND (lanname ~~* '%sql%'::text) AND
slow_function(lanname))
(2 rows)
I am trying to run a spatial query on large table (50k rows). Table contains the data for shortest path determination using pgr_astar() - following the https://docs.pgrouting.org/latest/en/pgr_aStar.html
I prepared the query(below) and everything works fine but I have to run it several hundreds times for different parameters so my question is how to optimize this work?
Query takes about 40 - 50s.
SELECT * FROM pgr_astar('SELECT id, source, target, cost, reverse_co, x1, y1, x2, y2 FROM public."500m_g"', 679009, 631529, directed := false, heuristic := 5)
the explain shows:
Function Scan on pgr_astar (cost=0.25..10.25 rows=1000 width=40)
Is unordered_window guaranteed to provide elements in same order when used repeatedly in select? I was hoping to avoid the cost of ordering since the order is irrelevant as long as it's the same.
I.e, will xs[i] and ys[i] always be elements from same row in xyz?
select
array_agg(x) over unordered_window as xs,
array_agg(y) over unordered_window as ys
from
xyz
window
unordered_window as (partition by z);
Use EXPLAIN (VERBOSE, COSTS OFF) to see what happens:
QUERY PLAN
═══════════════════════════════════════════════════════════
WindowAgg
Output: array_agg(x) OVER (?), array_agg(y) OVER (?), z
-> Sort
Output: z, x, y
Sort Key: xyz.z
-> Seq Scan on laurenz.xyz
Output: z, x, y
There is only a single sort, so we can deduce that the order will be the same.
But that is not guaranteed, to it is possible (albeit unlikely) that the implementation may change.
But you see that a sort is performed anyway. You may as well add the ORDER BY; all that will do is another sort key, which won't slow down the execution much. So you might just as well add the ORDER BY and be safe.
What performs better?
I have a table containing:
id BIGSERIAL,
geog GEOGRAPHY,
longitude DOUBLE PRECISION,
latitude DOUBLE PRECISION,
area GEOGRAPHY
geog is generated by ST_MakePoint(longitude, latitude). I can extract the longitude and latitude from geog with ST_X and ST_Y when I need them, but I don't know if simply recording the longitude and latitude for reuse and query is better for performance.
geog and area are for calculating results like items within, nearest, overlaps, etc. Client typically just need a filtered list of items and the lon/lat.
After running some tests it looks like storing lon/lat is slightly faster (for my limited dataset). I ran it a few times and the results favors storing.
Storing lon/lat
explain analyze
select
longitude,
latitude
from location;
-- 'Seq Scan on location (cost=0.00..3.04 rows=104 width=16) (actual time=0.013..0.032 rows=104 loops=1)'
-- 'Planning time: 0.062 ms'
-- 'Execution time: 0.047 ms'
Using ST_X & ST_Y
explain analyze
select
st_y(geog::geometry),
st_x(geog::geometry)
from location;
-- 'Seq Scan on location (cost=0.00..4.08 rows=104 width=16) (actual time=0.026..0.192 rows=104 loops=1)'
-- 'Planning time: 0.064 ms'
-- 'Execution time: 0.224 ms'
Is there a faster alternative to ST_X and ST_Y? I think it would be rather common for people to need to grab longitude and latitude from a geography type.
I have problem with gist index.
I have table 'country' with 'geog' (geography,multipolygon) columny.
I have also gist index on this column.
This simple query with ST_CoveredBy() against table with 2 rows ( each 'geog' about 5MB) takes 13 s (the query result is correct):
select c."ID" from gis.country c where ST_CoveredBy(ST_GeogFromText('SRID=4326;POINT(8.4375 58.5791015625)'), c."geog") =true
When I droped the index, the query also took 13s.
What I've already did:
VACUUM ANALYZE gis.country ("geog")
maybe this is the problem: "Do not call with a GEOMETRYCOLLECTION as an argument" I have read (http://www.mail-archive.com/postgis-users#postgis.refractions.net/msg17780.html), that it's because overlapping polygons, but in country map doesn't exist overlapping polygons
EDIT
Query plan -
Index Scan using sindx_country_geography_col1 on country c (cost=0.00..8.52 rows=1 width=4)
Index Cond: ('0101000020E61000000000000000E0204000000000204A4D40'::geography && "geog")
Filter: _st_covers("geog", '0101000020E61000000000000000E0204000000000204A4D40'::geography)
You won't see any benefit of an index querying against a table with only two rows. The benefit of an index only shines if you have hundreds or more rows to query.
I'm going to guess that you have two very detailed country multipolygons. There are strategies to divide these into grids to improve performance. How you break up your countries into grids should be based on (1) the density of your areas of interest (where you are most likely to query), and (2) multipolygon complexity or density of vertices.