What are the pitfalls of setting enable_nestloop to OFF - postgresql

I have a query in my application that runs very fast when there are large number of rows in my tables. But when the number of rows is a moderate size (neither large nor small) - the same query runs as much as 15 times slower.
The explain plan shows that the query on a medium sized data set is using nested loops for its join algorithm. The large data set uses hashed joins.
I can discourage the query planner from using nested loops either at the database level (postgresql.conf) or per session (SET enable_nestloop TO off).
What are the potential pitfalls of set enable_nestloop to off?
Other info: PostgreSQL 8.2.6, running on Windows.

What are the potential pitfalls of setting enable_nestloop to off?
This means that you will never be able to use indexes efficiently.
And it seems that you don't use them now.
The query like this:
SELECT u.name, p.name
FROM users u
JOIN profiles p ON p.id = u.profile_id
WHERE u.id = :id
will most probably use NESTED LOOPS with an INDEX SCAN on user.id and an INDEX SCAN on profile.id, provided that you have built indices on these fields.
Queries with low selectivity filters (that is, queries that need more than 10% of data from tables they use) will benefit from MERGE JOINS and HASH JOINS.
But the queries like one given above require NESTED LOOPS to run efficiently.
If you post your queries and table definitions here, probably much may be done about the indexes and queries performance.

A few things to consider before taking such drastic measures:
upgrade your installation to the latest 8.2.x (which right now is 8.2.12). Even better - consider upgrading to the next stable version which is 8.3 (8.3.6).
consider changing your production platform to something other than Windows. The Windows port of PostgreSQL, although very useful for development purpose, is still not on a par with the Un*x ones.
read the first paragraph of "Planner Method Configuration". This wiki page probably will help too.

I have the exact same experience. Some queries on a large database were executed using nested loops and that took 12 hours!!! when it runs in 30 seconds when turning off nested loops or removing the indexes.
Having hints would be really nice here, but I tried
...
SET ENABLE_NESTLOOP TO FALSE;
... critical query
SET ENABLE_NESTLOOP TO TRUE;
...
to deal with this matter. So you can definitely disable and re-enable nested loop use, and you can't argue with a 9000-fold speed increase :)
One problem I have is to do the change of ENABLE_NESTLOOP in a PgSQL/PL procedure. I can run an SQL script in Aqua Data Studio doing everything right, but when I put it in a PgSQL/PL procedure, it then still takes 12 hours. Apparently it was ignoring the change.

Related

Postgres not using index when too much concurrent write load

Recently I am facing with situation when in simple query with one where filter postgres doesn't use index. Query like this select * from book where obj_id=465789. Sometimes we have a lot of writes to this table and selects simultaneously. I read this article Postgres not using index when index scan is much better option and Erwin gave excellent answer. But one thing I didn't understand. How too much concurrent write load affects use index or not ?
The planner does not ponder the how much concurrent writing there is when making its decisions, so there is no direct effect.
I can think of three possible indirect effects. Concurrent writers might keep the data distribution changing faster than auto-analyze can keep up, so the planner is working with bad row estimates (how many rows have obj_id=465789 right now). Intense writing can clear the visibility map faster than autovacuum can reset it, which will penalize index-only scan cost estimates. And intense writing can bloat the index, and the index size plays a (minor) role in estimating index cost.

Implementing a high-scale scheduler on a database

We have a Postgres DB with a table of tens of millions of rows.
We also have a scheduler (app code) that runs on those rows and querying for specific assets. Usually what we need is 30days old items there.
We started to scale, and the scheduler is very slow.
What is the best approach to scale with maintaining a good performance? Using a different DB? Redis? ES? Partitioning the Postgres?
Thanks!
Usually what we need is 30days old items there.
That's the part of your question that's actually relevant. Postgresql, when used appropriately, should have absolutely no trouble performing a simple WHERE query with tens of millions of rows. The cost of index lookups grows logarithmically.
To take a stab in the dark: If you are performing date calculations for every row in your WHERE statement, performance will indeed be abysmal. For example:
SELECT * FROM my_data WHERE AGE(CREATED_AT) > INTERVAL '30 days';
...is a rather bad idea. Instead, calculate the date cutoff once, and statically use it in the comparison.
If your query is really more complicated, you could also look into expression indices. It's overkill for the example above, and it adds some overhead to all data-modifying operations, but would make a query as the one above perform as well as the static variant.
In any case: EXPLAIN SELECT ... is your friend, and posting its output will make you even more friends here.

Optimize PostgreSQL read-only tables

I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.

"Order by" degraded performance in sql

Hi i have one problem while executing sql in postgresql.
I have a similar query like this:
select A, B , lower(C) from myTable ORDER BY A, B ;
WIthout ORDER BY clause, I get the result in 11 ms , but with order by , it took more than 4 minutes to retrieve the same results.
These column contains lots of data (1000000 or more) and has lot of duplicate data
Can any one suggest me solution??
Thank you
but with order by , it took more than 4 minutes to retrieve the same results.
udo already explained how indexes can be used to speed up sorting, this is probably the way you want to go.
But another solution (probably) is increasing the work_mem variable. This is almost always beneficial, unless you have many queries running at the same time.
When sorting large result sets, which don't fit in your work_mem setting, PostgreSQL resorts to a slow disk-based sort. If you allow it to use more memory, it will do fast in-memory sorts instead.
NB! Whenever you ask questions about PostgreSQL performance, you should post the EXPLAIN ANALYZE output for your query, and also the version of Postgres.
Have you tried putting an index on A,B?
That should speed things up.
Did you try using a DISTINCT for eliminating duplicates? This should be more efficient than an order by statement.

A question about indexes regarding to the gain of inserts & updates in database

I’m having a question about the fine line between the gain of an index to a table there is growing steadily in size every month and the gain of queries with an index.
The situation is, that I’ve two tables, Table1 and Table2. Each table grows slowly but regularly each month (with about 100 new rows for Table1 and a couple of rows for Table2).
My concrete question is whether to have an index or to drop it. I’ve made some measurement that an covering index on Table2 improve my SELECT queries and some rather much but again, I’ve to consider the pros and cons but having a really hard time to decide.
For Table1 it might not be necessary to have an index because the SELECT queries there is not that common.
I would appreciate any suggestion, tips or just good advice to what is a good solution.
By the way, I’m using IBM DB2 version 9.7 as my Database system
Sincerely
Mestika
Any additional index will make your inserts slower and your queries faster.
To take a smart decision, you will have to measure exactly by how much, with the amount of data that you expect to see. If you have multiple clients accessing the database at the same time, it may make sense to write a small multithreaded application that simulates the maximum load, both for inserts and for queries.
Your results will depend on the nature of your data and on the hardware that you are running. If you want to know the best answer for your usecase, there is no way around testin accurately yourself with your data and your hardware.
Then you will have to ask yourself:
Which query performance do I need?
If the query performance is good enough without the index anyway, easy: Don't add the index!
Which insert performance do I need?
Can it drop below the needed limit with the additional index? If not, easy: Add the index!
If you discover that you absolutely need the index for query performance and you can't get the required insert performance with the index, you may need to buy better hardware. Solid state discs can do wonders for database servers and they are getting affordable.
If your system is running fine for everyone anyway, worry less, let it run as is.