Efficient retrieval of latest value in a large table PostgreSQL - postgresql

Currently after working to get an efficient way to query a table in the format below I am using this query...
select distinct on (symbol, date) date, symbol, value, created_time
from "test_table"
where symbol in ('symbol15', 'symbol19', 'symbol36', 'symbol54', 'symbol13', 'symbol90', 'symbol115', 'symbol145', 'symbol165', 'symbol12')
order by symbol, date, created_time desc
With this index...
test_table(symbol, date, created_time)
Below is a sample of the data to show what columns I am working with. The real table is a 13 million rows.
date symbol value created_time
2010-01-09 symbol1 101 3847474847
2010-01-10 symbol1 102 3847474847
2010-01-10 symbol1 102.5 3847475500
2010-01-10 symbol2 204 3847474847
2010-01-11 symbol1 109 3847474847
2010-01-12 symbol1 105 3847474847
2010-01-12 symbol2 206 3847474847
Currently it looks like 80+% of the query is spent sorting based on the EXPLAIN ANALYZE. Any idea how to improve the speed of this query? I need to get the latest created_time for each date and symbol combination.

Since your where clause uses only the column symbol, the index you created will not be used.
I advise you to create an index on symbol:
CREATE INDEX ON test_table(symbol);
Also, this is probably a better way to write your query
SELECT date, symbol, MAX(created_time)
FROM "test_table"
WHERE symbol in ('symbol15', 'symbol19', 'symbol36', 'symbol54', 'symbol13', 'symbol90', 'symbol115', 'symbol145', 'symbol165', 'symbol12')
GROUP BY date, symbol
ORDER BY symbol, date
LIMIT 10;
Adding a limit will greatly improve the performance if that is an option.
You should run EXPLAIN ANALYZE SELECT... to get a better understanding of which indexes are used or not and how PostgreSQL is running your query.

You might consider creating a partial or filtered index for this purpose - but be aware it may not work if your IN clause changes by adding more values or adding values not in your filtered index. It also may have some detrimental effects on INSERT speed to your table, as the index will have to evaluate whether your INSERT contains an interesting value - so if you're doing lots of inserts and can't afford any additional penalty there keep that in mind. You should also specify that you want date and created_time descending in the index.
E.g.
CREATE INDEX test_table_ix ON test_table (symbol, date DESC, created_time DESC)
WHERE (symbol in ('symbol15', 'symbol19', 'symbol36', 'symbol54', 'symbol13', 'symbol90', 'symbol115', 'symbol145', 'symbol165', 'symbol12'));
see: https://www.postgresql.org/docs/8.0/static/indexes-partial.html and https://www.postgresql.org/docs/9.6/static/indexes-ordering.html
Your query would then be able to use this index and should see some benefit - jus tkeep in mind this index has some cost associated, and consider whether your query is run frequently enough to justify it. You might see benefit just by applying the order to the index as well.

Without an ability to properly test this over 13 million rows the problem is always going to be the sorting needed to establish "latest". Although I am a little reluctant to propose this here row_number() over() is often a good technique to arrive at "latest".
An index that mimics the way you need to perform the sort to establish "latest" is the most likely to assist, so I expect that in index on symbol, date, created_time desc would be useful.
select date, symbol, value, created_time
from (select date, symbol, value, created_time
, row_number() over(partition by symbol, date order by created_time DESC) rn
from test_table
where symbol in ('symbol15', 'symbol19', 'symbol36', 'symbol54', 'symbol13', 'symbol90', 'symbol115', 'symbol145', 'symbol165', 'symbol12')
) d
where rn = 1
order by symbol, date, created_time desc
;

The index you are using is already the best. Since you do not show the explain analyze output I suggest you to try to values syntax:
select distinct on (symbol, date) date, symbol, value, created_time
from test_table
where symbol in (values ('symbol15'), ('symbol19'), ('symbol36'), ('symbol54'), ('symbol13'), ('symbol90'), ('symbol115'), ('symbol145'), ('symbol165'), ('symbol12'))
order by symbol, date, created_time desc

Related

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

Most efficient way to retrieve rows of related data: subquery, or separate query with GROUP BY?

I have a very simple PostgreSQL query to retrieve the latest 50 news articles:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Now I also want to retrieve the latest 10 comments for each article as well. I can think of two ways to accomplish retrieving them and I'm not sure which one is best in the context of PostgreSQL:
Option 1:
Do a subquery directly for the comments in the original query and cast the result to an array:
SELECT headline, author_name, body,
ARRAY(
SELECT id, message, author_name,
FROM news_comments
WHERE news_id = n.id
ORDER BY DATE DESC
LIMIT 10
) AS comments
FROM news n
ORDER BY publish_date DESC
LIMIT 50
Obviously, in this case, application logic would need to be aware of which index in the array is which column, that's no problem.
The one problem I see with the method is not knowing how the query planner would execute it. Would this effectively turn into 51 queries?
Option 2:
Use the original very simple query:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Then via application logic, gather all of the news ids and use those in a separate query, row_number() would have to be used here in order to limit the number of results per news article:
SELECT *
FROM (
SELECT *,
row_number() OVER(
PARTITION BY author_id
ORDER BY author_id DESC
) AS rn
FROM (
SELECT *
FROM news_comment
WHERE news_id IN(123, 456, 789)
) s
) s
where rn <= 10
This approach is obviously more complicated, and I'm not sure if this would have to retrieve all comments for the scoped news articles first, then chop off the ones where the row count is great than 10.
Which option is best? Or is there an even better solution I have overlooked?
For context, this is a news aggregator site I've developed myself, I currently have about 40,000 news articles across several categories, with about 500,000 comments, so I'm looking for the best solution to help me keep growing.
You should investigate execution plan for your statements using at least EXPLAIN ANALYZE. This will provide you with plan chosen by the optimizer while executing the statement itself and giving you back actual run times and other statistics as well.
Another solution would be to use LATERAL subquery to retrieve 10 comments for each news in separate rows, but then again - you need to investigate and compare plans to choose the best approach that works for you:
SELECT
n.id, n.headline, n.uathor_name, n.body,
c.id, c.message, c.author_name
FROM news n
LEFT JOIN LATERAL (
SELECT id, message, author_name
FROM news_comments nc
WHERE n.id = nc.news_id
ORDER BY nc.date DESC
LIMIT 10
) c ON TRUE
ORDER BY publish_date DESC
LIMIT 50
When your query contains LATERAL cross-references for each row retrieved from news LATERAL is evaluated using the connection in WHERE clause. Thus making it a repeated execution and joining the information retrieved from it for each row from your source table news.
This approach would save the time needed for your application logic to deal with arrays coming out from option 1 while not having to issue many separate queries for each news like in option 2 saving you (in this case) time needed to open separate transactions, establish connections, retrieve rows etc...
It would be good to look for performance improvements by creating indexes and looking into planner cost constans and planner method configuration parameters that you can experiment with to understand the choice planner has made. More on the subject here.

Postgresql: order by two boolean and 1 timestamp columns

Im having trouble with a query that becomes ghastly slow as the database grows.
The problem seems to be the sorting, which depends on three conditions - importance, urgency and timestamp.
The query currently in use is plain old
ORDER BY urgent DESC, important DESC, date_published DESC
Fields are boolean for urgent and important, and date_published is an integer (UNIX timestamp).
Create indexes for columns you sort by regularly. You may even set a compound index.
CREATE INDEX foo ON table_name (urgent DESC, important DESC, date_published DESC);

PostgreSQL DISTINCT problem: works locally but not on server

I've come across a vexing problem with a PostgreSQL query. This works in my local development environment:
SELECT distinct (user_id) user_id, created_at, is_goodday
FROM table
WHERE ((created_at >= '2011-07-01 00:00:00') AND user_id = 95
AND (created_at < '2011-08-01 00:00:00'))
ORDER BY user_id, created_at ASC;
...but gives the following error on my QA server (which is on Heroku):
PGError: ERROR: syntax error at or near "user_id"
LINE 1: SELECT distinct (user_id) user_id, created_at,
^
Why could this be?
Other possibly relevant info:
I have tried single-quoting and double-quoting the field names
It's a Rails 3 app, but I'm using this SQL raw, i.e. no ActiveRecord magic
My local version of Postgres is 9.0.4 on Mac, but I have no idea what version Heroku is using
As per your comment, the standard PostgreSQL version of that query would be:
SELECT user_id, created_at, is_goodday
FROM table
WHERE created_at >= '2011-07-01 00:00:00'
AND created_at < '2011-08-01 00:00:00'
AND user_id = 95
ORDER BY created_at DESC, id DESC
LIMIT 1
You don't need user_id in the ORDER BY because you have user_id = 95, you want created_at DESC in the ORDER BY to put the most recent created_at at the top; then you LIMIT 1 to slice off just the first row in the result set. GROUP BY can be used to enforce uniqueness or if you need to group things for an aggregate function but you don't need it for either one of those here as you can get uniqueness through ORDER BY and LIMIT and you can hide your aggregation inside the ORDER BY (i.e. you don't need MAX because ORDER BY does that for you).
Since you have user_id = 95 in your WHERE, you don't need user_id in the SELECT but you can leave it in if that makes it easier for you in Ruby-land.
It is possible that you could have multiple entries with the same created_at so I added an id DESC to the ORDER BY to force PostgreSQL to choose the one with the highest id. There's nothing wrong with being paranoid when they really are out to get you and bugs definitely are out to get you.
Also, you want DESC in your ORDER BY to get the highest values at the top, ASC puts the lowest values at the top. The more recent timestamps will be the higher ones.
In general, the GROUP BY and SELECT have to match up because:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions, since there would be more than one possible value to return for an ungrouped column.
But that doesn't matter here because you don't need a GROUP BY at all. I linked to the 8.3 version of the documentation to match the PostgreSQL version you're using.
There are probably various other ways to do this but this one as probably as straight forward and clear as you're going to get.
put a quote in user_id like user_id = '95'. Your query should be
SELECT distinct (user_id) as uid, created_at, is_goodday FROM table WHERE
((created_at >= '2011-07-01 00:00:00') AND user_id = '95' AND (created_at < '2011-08-01 00:00:00')) ORDER BY user_id, created_at ASC;
You're using DISTINCT ON (without writing the ON). Perhaps you should write the ON. Perhaps your postgres server dates from before the feature was implemented (which is pretty old by now).
If all else fails, you can always do that with some GROUP BY...

Optimizing select query with limit and order by

Following is my query:
select * from table order by timestamp desc limit 10
this takes too much time compared to
select * from table limit 10
How can I optimize the first query to get to near performance of second query.
UPDATE: I don't have control over the db server, so can not index columns to gain performance.
Create an index on timestamp.
Quassnoi is correct -- you need an index on timestamp.
That said, if your timestamp field reasonably maps your primary key (e.g. a date_created or an invoice_date field), you can try this workaround:
select *
from (select * from table order by id desc limit 1000) as table
order by timestamp desc limit 10;
#Nishan is right. There is little you can do. If you do not need every column in the table you may gain a few milliseconds by explicitly asking for just the columns you need