"Order by" degraded performance in sql - postgresql

Hi i have one problem while executing sql in postgresql.
I have a similar query like this:
select A, B , lower(C) from myTable ORDER BY A, B ;
WIthout ORDER BY clause, I get the result in 11 ms , but with order by , it took more than 4 minutes to retrieve the same results.
These column contains lots of data (1000000 or more) and has lot of duplicate data
Can any one suggest me solution??
Thank you

but with order by , it took more than 4 minutes to retrieve the same results.
udo already explained how indexes can be used to speed up sorting, this is probably the way you want to go.
But another solution (probably) is increasing the work_mem variable. This is almost always beneficial, unless you have many queries running at the same time.
When sorting large result sets, which don't fit in your work_mem setting, PostgreSQL resorts to a slow disk-based sort. If you allow it to use more memory, it will do fast in-memory sorts instead.
NB! Whenever you ask questions about PostgreSQL performance, you should post the EXPLAIN ANALYZE output for your query, and also the version of Postgres.

Have you tried putting an index on A,B?
That should speed things up.

Did you try using a DISTINCT for eliminating duplicates? This should be more efficient than an order by statement.

Related

Implementing a high-scale scheduler on a database

We have a Postgres DB with a table of tens of millions of rows.
We also have a scheduler (app code) that runs on those rows and querying for specific assets. Usually what we need is 30days old items there.
We started to scale, and the scheduler is very slow.
What is the best approach to scale with maintaining a good performance? Using a different DB? Redis? ES? Partitioning the Postgres?
Thanks!
Usually what we need is 30days old items there.
That's the part of your question that's actually relevant. Postgresql, when used appropriately, should have absolutely no trouble performing a simple WHERE query with tens of millions of rows. The cost of index lookups grows logarithmically.
To take a stab in the dark: If you are performing date calculations for every row in your WHERE statement, performance will indeed be abysmal. For example:
SELECT * FROM my_data WHERE AGE(CREATED_AT) > INTERVAL '30 days';
...is a rather bad idea. Instead, calculate the date cutoff once, and statically use it in the comparison.
If your query is really more complicated, you could also look into expression indices. It's overkill for the example above, and it adds some overhead to all data-modifying operations, but would make a query as the one above perform as well as the static variant.
In any case: EXPLAIN SELECT ... is your friend, and posting its output will make you even more friends here.

Query Optimization - PostgreSQL

I have a table of 3M rows.
I wanted to retrieve all those rows and do a visualization using dc.js.
Problem I have is, for just a single column it takes about 70 secs.
And If i write my query it takes about 240 secs to retrieve those rows.
I'm using using select query on columns like this.
SELECT COL1, COL2 FROM TABLE
That's it. No grouping, nothing.
But it takes hell lot of time.
Heard of indexing and I created a Index for the columns I use. But even though no fruitful results.
We should not retrieve 3M rows in any query. And sending 3M records will always take a lot of time (nothing to do with the database, it is transfer speed). It will kill your IO. The bulk of the time taken is on the transfer (IO) from request-originator and the postgres database.
Consider to break that requests into batches of async-requests that gets streamed down to clients. That means restructuring your front-end code (javascript) for improved user-experience.
You didn't specify the environment in which you are using PostgreSQL.
As an example, in Node.js you can solve this problem by streaming the data with the help of pg-query-stream and rendering it on the client side at the same time, so the client doesn't have to wait for the query to finish and can see intermediate results.
This may not be the best solution though. A better solution would be to implement data aggregation within a database function to provide a smaller data subset.

Optimize PostgreSQL read-only tables

I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.

Using a UNION or UNION ALL on two select statements makes them incredibly slower

I have two queries, let's call them Query A and Query B.
Both of these queries run in under a second for the scenario I'm testing and Query A returns 1 result, and Query B returns 0 results.
If I union (or union all) these two queries, it takes over a minute to return the (expected) 1 result.
Both queries select the same columns from the same tables. I could potentially rewrite this entire thing without a union by having a highly conditional where clause but I was trying to get away from doing that.
Any ideas? I'm not sure how much of the exact query and schema I can get away with sharing, but I'm happy to provide what I can.
This is on MSSQL 2008 if it matters to anyone's response.
I would try looking at the execution plans within Management Studio for the individual queries, and then compare that to the execution plan for the query containing the UNION.
If there's that drastic of a difference in the execution times, I would imagine that there's something wrong with the execution plan for the UNION'd query. Identifying what's different will help point you (and maybe us) in the right direction on what the underlying problem is.
The separate clauses in a UNION that are very similar and on the same tables can be merged into one query by the optimiser. You can see this by the lack on UNION operator in the query plan. I've seen similar things before but rarely
What you can do is a SELECT.. INTO #temp... for the first query followed by an INSERT #temp... for the second
Now, where did I read this...
Are they both doing table scans? It sounds like it might be exceeding cache capacity and you're caching to disk.
Even if they are from the same table the records would probably lock independently.

What are the pitfalls of setting enable_nestloop to OFF

I have a query in my application that runs very fast when there are large number of rows in my tables. But when the number of rows is a moderate size (neither large nor small) - the same query runs as much as 15 times slower.
The explain plan shows that the query on a medium sized data set is using nested loops for its join algorithm. The large data set uses hashed joins.
I can discourage the query planner from using nested loops either at the database level (postgresql.conf) or per session (SET enable_nestloop TO off).
What are the potential pitfalls of set enable_nestloop to off?
Other info: PostgreSQL 8.2.6, running on Windows.
What are the potential pitfalls of setting enable_nestloop to off?
This means that you will never be able to use indexes efficiently.
And it seems that you don't use them now.
The query like this:
SELECT u.name, p.name
FROM users u
JOIN profiles p ON p.id = u.profile_id
WHERE u.id = :id
will most probably use NESTED LOOPS with an INDEX SCAN on user.id and an INDEX SCAN on profile.id, provided that you have built indices on these fields.
Queries with low selectivity filters (that is, queries that need more than 10% of data from tables they use) will benefit from MERGE JOINS and HASH JOINS.
But the queries like one given above require NESTED LOOPS to run efficiently.
If you post your queries and table definitions here, probably much may be done about the indexes and queries performance.
A few things to consider before taking such drastic measures:
upgrade your installation to the latest 8.2.x (which right now is 8.2.12). Even better - consider upgrading to the next stable version which is 8.3 (8.3.6).
consider changing your production platform to something other than Windows. The Windows port of PostgreSQL, although very useful for development purpose, is still not on a par with the Un*x ones.
read the first paragraph of "Planner Method Configuration". This wiki page probably will help too.
I have the exact same experience. Some queries on a large database were executed using nested loops and that took 12 hours!!! when it runs in 30 seconds when turning off nested loops or removing the indexes.
Having hints would be really nice here, but I tried
...
SET ENABLE_NESTLOOP TO FALSE;
... critical query
SET ENABLE_NESTLOOP TO TRUE;
...
to deal with this matter. So you can definitely disable and re-enable nested loop use, and you can't argue with a 9000-fold speed increase :)
One problem I have is to do the change of ENABLE_NESTLOOP in a PgSQL/PL procedure. I can run an SQL script in Aqua Data Studio doing everything right, but when I put it in a PgSQL/PL procedure, it then still takes 12 hours. Apparently it was ignoring the change.