Optimize large offset in select - postgresql

I have table
Users (user_id integer, user_name string, scores integer)
That table will contain 1-6 millions records. That has indexes on user_name and scores
The user will input his name and I should show him one page from that table, ordered by scores, that will contain him around other users.
I do it in 2 queries:
First:
select user_id from (
select row_number() over (order by scores desc),
user_id
from users
where user_name="name" limit 1
)
Second:
select * from users limit 20 offset The_User_Id/20+1
than i get page, that contain my User around others.
But when user is in middle of table with millions record, I have offset 500000, that works slow, about 1-2 seconds, how to improve it?

Offset itself makes your query slow.
If you don't need a pure sql and can use a programming language to form the query, why not consider Paging Through Results? ordering the second query by user_id and limit 20 for pagination needs instead of using offset.

Related

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

keyset pagination with full text search on postgresql

I have a table 'users' with +100,000 records. I want to start making use of keyset pagination to speed up the process of fetching records.
The following query works. This query fetches the second page of the recordset (starting at user_id: 1001, and fetching until user_id: 2000).
SELECT
user_id,
username
FROM
users
WHERE
user_id > 1000
ORDER BY
user_id ASC
LIMIT
1000
The problem is: I don't want to order the records on user_id. I have a column named "tokens" which is a to_tsvector column. I want to perform full text search on the recordset and order the users on rank. The new query:
SELECT
user_id,
username,
to_tsrank(tokens, plainto_tsquery('search query')) AS rank
FROM
users
WHERE
tokens ## plainto_tsquery('search query')
How can I apply a keyset pagination on this second query, so the results are ordered on rank instead of user_id?
Important:
I tried this one, but this does not work!
SELECT
user_id,
username,
to_tsrank(tokens, plainto_tsquery('search query')) AS rank
FROM
users
WHERE
tokens ## plainto_tsquery('search query')
AND
to_tsrank(tokens, plainto_tsquery('search query')) < $1 // $1 = last fetched rank
ORDER BY
rank DESC
LIMIT
1000
Let's say, when the results are ranked on 'rank', the 1,000th result has a rank of 0.5. $1 (last fetched rank) would be 0.5, so I would select all results with rank < 0.5. The problem is: some results may have the same rank. So if the 1,001th record would also be rank = 0.5, it wouldn't be fetched because I say in my query rank < 0.5. I also cannot say rank <= 0.5because that would fetch the previous results with rank = 0.5 again.
Does anyone know the solution to this problem?
You have to provide a fully deterministic ORDER BY. Assuming user_id is unique:
ORDER BY rank desc, user_id
Then your WHERE would include:
AND (rank < :last_rank or (rank = :last_rank and user_id > :last_user_id))
But this will not be efficient, so you might as well just do OFFSET.
Better yet, don't do it at all. No human is realistically going to read through 1000 results, and think "You know, I'd like to do this a few more times". The only one who will do that is the web scraper, and the only reason the web scraper will do it is because it is the only method you offer. Just let them set a LIMIT which is as high as they want, and offer no pagination.

Postgres optimizing in an "A OR exists(B)" query

I'm having a lot of trouble with a particular case in my Postgres optimization.
Essentially, I have three tables, which will simplify as:
Place: id, name (String), select (Boolean)
Bookmark: id, user (Integer), place (Integer)
User: id, name (String)
The Place table has several million rows (and growing), but a relatively small amount of them has select as true.
I have several indexes on these tables, obviously on all id, plus a partial one on place where "select"=true, and a unique one on the bookmark (user, place) combos. There are more, but I think they're not relevant here.
When I do a query of the type:
SELECT *
FROM place
WHERE "select"
LIMIT 10;
it takes 3ms.
When I do a query of the type:
SELECT *
FROM place
WHERE exists (SELECT id
FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10;
it's also blazing fast.
However, if I do an OR on both conditions, like so:
SELECT *
FROM place
WHERE "select"
OR exists (SELECT id
FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10;
it slows down to over 1s.
Besides doing two queries in my code and combining the results, is there any way I can optimize this?
The old problem, OR is a performance killer.
Use UNION:
(SELECT * FROM place
WHERE select
LIMIT 10)
UNION
(SELECT * FROM place
WHERE exists (SELECT 1 FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10)
LIMIT 10;

Postgres pagination with non-unique keys?

Suppose I have a table of events with (indexed) columns id : uuid and created : timestamp.
The id column is unique, but the created column is not. I would like to walk the table in chronological order using the created column.
Something like this:
SELECT * FROM events WHERE created >= $<after> ORDER BY created ASC LIMIT 10
Here $<after> is a template parameter that is taken from the previous query.
Now, I can see two issues with this:
Since created is not unique, the order will not be fully defined. Perhaps the sort should be id, created?
Each row should only be on one page, but with this query the last row is always included on the next page.
How should I go about this in Postgres?
SELECT * FROM events
WHERE created >= $<after> and (id >= $<id> OR created > $<after>)
ORDER BY created ASC ,id ASC LIMIT 10
that way the events each timestamp values will be ordered by id. and you can split pages anywhere.
you can say the same thing this way:
SELECT * FROM events
WHERE (created,id) >= ($<after>,$<id>)
ORDER BY created ASC ,id ASC LIMIT 10
and for me this produces a slightly better plan.
An index on (created,id) will help performance most, but for
many circumstances an index on created may suffice.
First, as you said, you should enforce a total ordering. Since the main thing you care about is created, you should start with that. id could be the secondary ordering, a tie breaker invisible to the user that just ensures the ordering is consistent. Secondly, instead of messing around with conditions on created, you could just use an offset clause to return later results:
SELECT * FROM events ORDER BY created ASC, id ASC LIMIT 10 OFFSET <10 * page number>
-- Note that page number is zero based

Looping through unique dates in PostgreSQL

In Python (pandas) I read from my database and then I use a pivot table to aggregate data each day. The raw data I am working on is about 2 million rows per day and it is per person and per 30 minutes. I am aggregating it to be daily instead so it is a lot smaller for visualization.
So in pandas, I would read each date into memory and aggregate it and then load it into a fresh table in postgres.
How can I do this directly in postgres? Can I loop through each unique report_date in my table, groupby, and then append it to another table? I am assuming doing it in postgres would be fast compared to reading it over a network in python, writing a temporary .csv file, and then writing it again over the network.
Here's an example: Suppose that you have a table
CREATE TABLE post (
posted_at timestamptz not null,
user_id integer not null,
score integer not null
);
representing the score various user have earned from posts they made in SO like forum. Then the following query
SELECT user_id, posted_at::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, posted_at::date;
will aggregate the scores per user per day.
Note that this will consider that the day changes at 00:00 UTC (like SO does). If you want a different time, say midnight Paris time, then you can do it like so:
SELECT user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date;
To have good performace for the above queries, you might want to create a (computed) index on (user_id, posted_at::date), or similarly for the second case.