I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.
Related
Recently I am facing with situation when in simple query with one where filter postgres doesn't use index. Query like this select * from book where obj_id=465789. Sometimes we have a lot of writes to this table and selects simultaneously. I read this article Postgres not using index when index scan is much better option and Erwin gave excellent answer. But one thing I didn't understand. How too much concurrent write load affects use index or not ?
The planner does not ponder the how much concurrent writing there is when making its decisions, so there is no direct effect.
I can think of three possible indirect effects. Concurrent writers might keep the data distribution changing faster than auto-analyze can keep up, so the planner is working with bad row estimates (how many rows have obj_id=465789 right now). Intense writing can clear the visibility map faster than autovacuum can reset it, which will penalize index-only scan cost estimates. And intense writing can bloat the index, and the index size plays a (minor) role in estimating index cost.
As postgresql documents points out one way to increase query performance is to increase statistics target for some columns.
it is known that default_statistics_target value is not enough for large tables (a few million row) with irregular value distribution and must be increased.
it seams practical to create a script for auto-tuning statistics target for each column, i would like to know what are possible obstacles in writing such script and why i can't find such script online.
That's because it is not that simple. It does not primarily depend on the size of the table, but on the data in the table and their distribution, the way in which data are modified, and most of all on the queries.
So it is pretty much impossible to make that decision from a look on the persisted state, and even with more information it would require quite a bit of artificial intelligence.
One problem with PG planner statistics is that there is no way to compute statistics over all the rows inside the table. PG use always a small part of the table to compute statistics (sample percent). This way have a huge disavantage in big table: it will ignore some important values that can make the difference when estimating the cardinality for some operations of the execution plan. This may cause the use of an innappropiate algorithm.
Explanation : http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
Especially § 12 – Planer statistics
The reason that PG do not accept a "full scan" stat, is because it will take too much time to compute ! In fact, PostgreSQL is very slow in many maintenance task such as statistics recompute, as I reveal it here :
http://mssqlserver.fr/postgresql-vs-microsoft-part-1-dba-queries-performances/
In some other RDBMS it is possible to do UPDATE STATISTICS ... WITH FULLSCAN (Microsoft SQL Server as an example) is this does not take so much time, because MS SQL Server does it in parallel with multiple threads that PostGreSQL is unnable to do...
Conclusion: PostGreSQL has never been design for huge table. Think to use another RDBMS if you want to deal with big table and have performances...
Just take a look over COUNT performances of PostGreSQL compare to MS SQL Server:
http://mssqlserver.fr/postgresql-vs-microsoft-sql-server-comparison-part-2-count-performances/
Hi i have one problem while executing sql in postgresql.
I have a similar query like this:
select A, B , lower(C) from myTable ORDER BY A, B ;
WIthout ORDER BY clause, I get the result in 11 ms , but with order by , it took more than 4 minutes to retrieve the same results.
These column contains lots of data (1000000 or more) and has lot of duplicate data
Can any one suggest me solution??
Thank you
but with order by , it took more than 4 minutes to retrieve the same results.
udo already explained how indexes can be used to speed up sorting, this is probably the way you want to go.
But another solution (probably) is increasing the work_mem variable. This is almost always beneficial, unless you have many queries running at the same time.
When sorting large result sets, which don't fit in your work_mem setting, PostgreSQL resorts to a slow disk-based sort. If you allow it to use more memory, it will do fast in-memory sorts instead.
NB! Whenever you ask questions about PostgreSQL performance, you should post the EXPLAIN ANALYZE output for your query, and also the version of Postgres.
Have you tried putting an index on A,B?
That should speed things up.
Did you try using a DISTINCT for eliminating duplicates? This should be more efficient than an order by statement.
I’m having a question about the fine line between the gain of an index to a table there is growing steadily in size every month and the gain of queries with an index.
The situation is, that I’ve two tables, Table1 and Table2. Each table grows slowly but regularly each month (with about 100 new rows for Table1 and a couple of rows for Table2).
My concrete question is whether to have an index or to drop it. I’ve made some measurement that an covering index on Table2 improve my SELECT queries and some rather much but again, I’ve to consider the pros and cons but having a really hard time to decide.
For Table1 it might not be necessary to have an index because the SELECT queries there is not that common.
I would appreciate any suggestion, tips or just good advice to what is a good solution.
By the way, I’m using IBM DB2 version 9.7 as my Database system
Sincerely
Mestika
Any additional index will make your inserts slower and your queries faster.
To take a smart decision, you will have to measure exactly by how much, with the amount of data that you expect to see. If you have multiple clients accessing the database at the same time, it may make sense to write a small multithreaded application that simulates the maximum load, both for inserts and for queries.
Your results will depend on the nature of your data and on the hardware that you are running. If you want to know the best answer for your usecase, there is no way around testin accurately yourself with your data and your hardware.
Then you will have to ask yourself:
Which query performance do I need?
If the query performance is good enough without the index anyway, easy: Don't add the index!
Which insert performance do I need?
Can it drop below the needed limit with the additional index? If not, easy: Add the index!
If you discover that you absolutely need the index for query performance and you can't get the required insert performance with the index, you may need to buy better hardware. Solid state discs can do wonders for database servers and they are getting affordable.
If your system is running fine for everyone anyway, worry less, let it run as is.
I have a query in my application that runs very fast when there are large number of rows in my tables. But when the number of rows is a moderate size (neither large nor small) - the same query runs as much as 15 times slower.
The explain plan shows that the query on a medium sized data set is using nested loops for its join algorithm. The large data set uses hashed joins.
I can discourage the query planner from using nested loops either at the database level (postgresql.conf) or per session (SET enable_nestloop TO off).
What are the potential pitfalls of set enable_nestloop to off?
Other info: PostgreSQL 8.2.6, running on Windows.
What are the potential pitfalls of setting enable_nestloop to off?
This means that you will never be able to use indexes efficiently.
And it seems that you don't use them now.
The query like this:
SELECT u.name, p.name
FROM users u
JOIN profiles p ON p.id = u.profile_id
WHERE u.id = :id
will most probably use NESTED LOOPS with an INDEX SCAN on user.id and an INDEX SCAN on profile.id, provided that you have built indices on these fields.
Queries with low selectivity filters (that is, queries that need more than 10% of data from tables they use) will benefit from MERGE JOINS and HASH JOINS.
But the queries like one given above require NESTED LOOPS to run efficiently.
If you post your queries and table definitions here, probably much may be done about the indexes and queries performance.
A few things to consider before taking such drastic measures:
upgrade your installation to the latest 8.2.x (which right now is 8.2.12). Even better - consider upgrading to the next stable version which is 8.3 (8.3.6).
consider changing your production platform to something other than Windows. The Windows port of PostgreSQL, although very useful for development purpose, is still not on a par with the Un*x ones.
read the first paragraph of "Planner Method Configuration". This wiki page probably will help too.
I have the exact same experience. Some queries on a large database were executed using nested loops and that took 12 hours!!! when it runs in 30 seconds when turning off nested loops or removing the indexes.
Having hints would be really nice here, but I tried
...
SET ENABLE_NESTLOOP TO FALSE;
... critical query
SET ENABLE_NESTLOOP TO TRUE;
...
to deal with this matter. So you can definitely disable and re-enable nested loop use, and you can't argue with a 9000-fold speed increase :)
One problem I have is to do the change of ENABLE_NESTLOOP in a PgSQL/PL procedure. I can run an SQL script in Aqua Data Studio doing everything right, but when I put it in a PgSQL/PL procedure, it then still takes 12 hours. Apparently it was ignoring the change.