I have a database of points (10 million) in postgresql 12 which has the postgis extension. I have created a gist index on the points column and I need to filter the points based on distance from a given point. It seems that the index is not used. I run the following explain:
EXPLAIN
SELECT actual_location
FROM geometries
WHERE ST_DWithin(ST_SetSRID(ST_MakePoint(30,30), 4326), actual_location, 100000, true);
which yields:
so it seems it only does a parallel seq scan. Am I getting something wrong here? Should I be using a different index type? When the database was populated with 1 million points, it returned results in about 1.3 seconds. With 10 million it goes to about 11-13 which is not acceptable for a user of my application to wait that much time.
Turns out I should have created the index like so:
CREATE INDEX example_idx ON geometries USING GIST (geography(actual_location));
Related
I have a query that has several filter conditions after WHERE clause.
Also, most of the columns involved have indexes on them.
When I run EXPLAIN command, I see:
-> Bitmap Index Scan on feature_expr_idx (cost=0.00..8.10 rows=14 width=0)
feature_expr_idx is an index on one of the columns in WHERE clause.
But indexes for other columns are not shown. Instead, they are shown in FILTER row:
Filter: ((NOT is_deleted) AND (vehicle_type = 'car'::text) AND (source_type = 'NONE'::text))
Why only a single Index is shown in the result, while other columns also having index are instead part of Filter?
Postgresql has a clever engine which tries to plan the best way to run your query. Often, this involves reading as little as possible from disk, because disk operations are slow. One of the reasons why indexes are so helpful is that by reading from the index, we can find a small number of rows in the table that need to be read in order to satisfy the query, and thus we can avoid reading through the entire table. Note, however, that the index is also on disk, and so reading the index also takes some time.
Now, imagine your query has two filters, one over column A and one over column B, both of which are indexed. According to the statistics postgresql has collected, there are about 5 rows that satisfy the filter on column A, and about 1000 rows that satisfy the filter on column B. In that case, it makes sense to read only the index on column A, then read all the matching 5 (or so) rows, and filter out any of them which don't match the filter on column B. Reading the index on column B would probably be more expensive than just reading the 5 rows!
The actual reason may be different than my example, but the point is that postgresql is simply trying to be as efficient as possible.
I could not reach any conclusive answers reading some of the existing posts on this topic.
I have certain data at 100 locations the for past 10 years. The table has about 800 million rows. I need to primarily generate yearly statistics for each location. Some times I need to generate monthly variation statistics and hourly variation statistics as well. I'm wondering if I should generate two indexes - one for location and another for year or generate one index on both location and year. My primary key currently is a serial number (Probably I could use location and timestamp as the primary key).
Thanks.
Regardless of how many indices have you created on relation, only one of them will be used in a certain query (which one depends on query, statistics etc). So in your case you wouldn't get a cumulative advantage from creating two single column indices. To get most performance from index I would suggest to use composite index on (location, timestamp).
Note, that queries like ... WHERE timestamp BETWEEN smth AND smth will not use the index above while queries like ... WHERE location = 'smth' or ... WHERE location = 'smth' AND timestamp BETWEEN smth AND smth will. It's because the first attribute in index is crucial for searching and sorting.
Don't forget to perform
ANALYZE;
after index creation in order to collect statistics.
Update:
As #MondKin mentioned in comments certain queries can actually use several indexes on the same relation. For example, query with OR clauses like a = 123 OR b = 456 (assuming that there are indexes for both columns). In this case postgres would perform bitmap index scans for both indexes, build a union of resulting bitmaps and use it for bitmap heap scan. In certain conditions the same scheme may be used for AND queries but instead of union there would be an intersection.
There is no rule of thumb for situations like these, I suggest you experiment in a copy of your production DB to see what works best for you: a single multi-column index or 2 single-column indexes.
One nice feature of Postgres is you can have multiple indexes and use them in the same query. Check this chapter of the docs:
... PostgreSQL has the ability to combine multiple indexes ... to handle cases that cannot be implemented by single index scans ....
... Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature ...
You can even experiment creating both the individual and combined indexes, and checking how big each one is and determine if it's worth having them at the same time.
Some things that you can also experiment with:
If your table is too large, consider partitioning it. It looks like you could partition either by location or by date. Partitioning splits your table's data in smaller tables, reducing the amount of places where a query needs to look.
If your data is laid out according to a date (like transaction date) check BRIN indexes.
If multiple queries will be processing your data in a similar fashion (like aggregating all transactions over the same period, check materialized views so you only need to do those costly aggregations once.
About the order in which to put your multi-column index, put first the column on which you will have an equality operation, and later the column in which you have a range, >= or <= operation.
An index on (location,timestamp) should work better that 2 separate indexes for you case. Note that the order of the columns is important.
what is the best solution to improve the following distance query in order to improve the performance.
SELECT count(*) FROM place WHERE DISTANCE(lat, lng, 42.0697, -87.7878) < 10
The query always warn the following message if you have large data set around 80k
fetched more than 50000 records: to speed up the execution, create an index or change the query to use an existent index"
create the following index but it's not involved in that query.
place.distance NOTUNIQUE ["lat","lng"] SBTREE
You can use a spatial index.
You can look the documentation http://orientdb.com/docs/2.1/Spatial-Index.html
I would like to use the GEO2D function in Sphinx (SphinxQL) but I would like to use it without searching an index.
I have the polygon and the latitude and longitude already and just want to calculate if the point is inside the polygon. So this is what I want:
SELECT CONTAINS(GEOPOLY2D($polygon),$lat_deg, $long_deg)
in MySQL you could run a query without specifying a database:
SELECT 1+1
How could I do this in Sphinx? I already tried to run it with an index (FROM index LIMIT 1) but that seems to do a full table scan and thus take a long time.
I could make a new index with just one record but maybe there is another way?
We have a table that has millions of rows with PostGIS geometries in them. The query we want to perform is: what are the most recent entries that fall within a bounding geometry? The problem with this query is that we'll often have a large number of items that match the bounding box (which has around 5km in radius), and Postgres will then have to re-check all the returned items inside the bounding box to get their timestamp, then sort and return the latest N.
It feels like what we need is a (compound?) index that takes into account both the GIST spatial index and the timestamp as well. Is such a thing possible? I've tried several combinations in the CREATE INDEX step and nothing's worked so far.
I'd rather make two indexes, one spatial and the second on the timestamp column. PostgreSQL can combine indexes pretty nice and it doesn't need to 're-check' found rows. It could use indexes to get the rows within the geometry and sort them using the other index.