I am comparing queries on PostgreSQL 8.3.14 which return the same result set.
I have used EXPLAIN on my queries to track the estimated total cost. I have also run the queries a few times and recorded the total time it took to run. I understand consecutive runs will cause more data to be cached and skew the actual no-cache runtime.
Still I would expect that EXPLAIN cost would be somewhat proportional to the total runtime (with cache skew).
My data denies this. I compared 4 queries.
Query A
Total Cost: 119 500
Average Runtime: 28.101 seconds
Query B
Total Cost: 115 700
Average Runtime: 28.291 seconds
Query C
Total Cost: 116 200
Average Runtime: 32.409 seconds
Query D
Total Cost: 93 200
Average Runtime: 37.503 seconds
I ran Query D last and if anything it should be the fastest because of the caching problem. Since running the queries without cache seems to be difficult based on this Q+A:
[SO]:See and clear Postgres caches/buffers?
How can I measure which query is the most efficient?
The query cost shown by the planner is a function of the structure of your indexes and also the relative frequencies of certain values in the relevant tables. PostgreSQL keeps track of the most common values seen in all of the columns of all of your tables so that it can get an idea of how many rows each stage of each plan is likely to operate on.
This information can become out of date. If you are really trying to get an accurate idea of how costly a query is, make sure that the statistics postgres is using is up to date, by executing a VACUUM ANALYZE statement.
Beyond that, the planner is forced to do some apples to oranges comparisons; somehow comparing the time it takes to seek versus the time it takes to run a tight loop over an in-memory relation. Since different hardware can do these things at different relative speeds, sometimes, especially for near ties, postgres may guess wrong. These relative costs can be tuned in the configuration of your server's config file
Edit:
The statistics collected by postgesql do not relate to "query performance" and are not updated by successive queries. They only describe the frequency and distribution of values in each column of each table (except where disabled.) Having accurate statistics is important for accurate query planning, but its on you, the operator, to tell PostgreSQL how often and to what level of detail those statstics should be gathered. The discrepency you are observing is a sign that the stastics are out of date, or that you could benefit from tuning other planner parameters.
Try running them through explain analyze and posting the output from that to http://explain.depesz.com/
Related
I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?
I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream
I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.
This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.
Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.
Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.
Postgres version: 9.6
Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M.
https://explain.depesz.com/s/GZY
After changing default_statistics_target = 1000, rows at Index scan step were
-> (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)
And on setting it to default_statistics_target = 10, counts were:
-> (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)
P.S. Table under consideration has more than 100M rows.
This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics
Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.
If I use postgres's EXPLAIN command, there's a top-level "cost". Assuming that the explain is accurate (ie despite the cost being in reality quite unreliable and/or inconsistent), what is the very approximate conversion from cost to minutes/seconds query duration (for a "large" cost)?
In my case, the query cost is 60 million
For context, my hardware is a regular laptop and the data is 12M rows joining to 250K rows on an indexed column, grouped on several columns to produce 1K rows of output.
This question is not about the query itself per se - there could be better ways to code the query.
This question is also not about how inaccurate, unreliable or inconsistent the explain output is.
This question is about estimating the run time a query would take if it executed given its EXPLAIN cost and given that the EXPLAIN output is in fact an accurate analysis of the query.
I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.
Background
I have a table that contains POLYGONS/MULTIPOLYGONS which represent customer territories:
The table contains roughly 8,000 rows
Approximately 90% of the polygons are circles
The remainder of the polygons represent one or more states, provinces, or other geographic regions. The raw polygon data for these shapes was imported from US census data.
The table has a spatial index and a clustered index on the primary key. No changes to the default SQL Server 2008 R2 settings were made. 16 cells per object, all levels medium.
Here's a simplified query that will reproduce the issue that I'm experiencing:
DECLARE #point GEOGRAPHY = GEOGRAPHY::STGeomFromText('POINT (-76.992188 39.639538)', 4326)
SELECT terr_offc_id
FROM tbl_office_territories
WHERE terr_territory.STIntersects(#point) = 1
What seems like a simple, straightforward query takes 12 or 13 seconds to execute, and has what seems like a very complex execution plan for such a simple query.
In my research, several sources have suggested adding an index hint to the query, to ensure that the query optimizer is properly using the spatial index. Adding WITH(INDEX(idx_terr_territory)) has no effect, and it's clear from the execution plan that it is referencing my index regardless of the hint.
Reducing polygons
It seemed possible that the territory polygons imported from the US Census data are unnecessarily complex, so I created a second column, and tested reduced polygons (w/ Reduce() method) with varying degrees of tolerance. Running the same query as above against the new column produced the following results:
No reduction: 12649ms
Reduced by 10: 7194ms
Reduced by 20: 6077ms
Reduced by 30: 4793ms
Reduced by 40: 4397ms
Reduced by 50: 4290ms
Clearly headed in the right direction, but dropping precision seems like an inelegant solution. Isn't this what indexes are supposed to be for? And the execution plan still seems strangly complex for such a basic query.
Spatial Index
Out of curiosity, I removed the spatial index, and was stunned by the results:
Queries were faster WITHOUT an index (sub 3 sec w/ no reduction, sub 1 sec with reduction tolerance >= 30)
The execution plan looked far, far simpler:
My questions
Why is my spatial index slowing things down?
Is reducing my polygon complexity really necessary in order to speed up my query? Dropping precision could cause problems down the road, and doesn't seem like it will scale very well.
Other Notes
SQL Server 2008 R2 Service Pack 1 has been applied
Further research suggested running the query inside a stored procedure. Tried this and nothing appeared to change.
My first thoughts are to check the bounding coordinates of the index; see if they cover the entirety of your geometries. Second, spatial indexes left at the default 16MMMM, in my experience, perform very poorly. I'm not sure why that is the default. I have written something about the spatial index tuning on this answer.
First make sure the index covers all of the geometries. Then try reducing cells per object to 8. If neither of those two things offer any improvement, it might be worth your time to run the spatial index tuning proc in the answer I linked above.
Final thought is that state boundaries have so many vertices and having many state boundary polygons that you are testing for intersection with, it very well could take that long without reducing them.
Oh, and since it has been two years, starting in SQL Server 2012, there is now a GEOMETRY_AUTO_GRID tessellation that does the index tuning for you and does a great job most of the time.
This might just be fue to the simpler execution plan being executed in parallel, whereas the other one is not. However, there is a warning on the first execution plan that might be worth investigating.