keyset pagination with full text search on postgresql - postgresql

I have a table 'users' with +100,000 records. I want to start making use of keyset pagination to speed up the process of fetching records.
The following query works. This query fetches the second page of the recordset (starting at user_id: 1001, and fetching until user_id: 2000).
SELECT
user_id,
username
FROM
users
WHERE
user_id > 1000
ORDER BY
user_id ASC
LIMIT
1000
The problem is: I don't want to order the records on user_id. I have a column named "tokens" which is a to_tsvector column. I want to perform full text search on the recordset and order the users on rank. The new query:
SELECT
user_id,
username,
to_tsrank(tokens, plainto_tsquery('search query')) AS rank
FROM
users
WHERE
tokens ## plainto_tsquery('search query')
How can I apply a keyset pagination on this second query, so the results are ordered on rank instead of user_id?
Important:
I tried this one, but this does not work!
SELECT
user_id,
username,
to_tsrank(tokens, plainto_tsquery('search query')) AS rank
FROM
users
WHERE
tokens ## plainto_tsquery('search query')
AND
to_tsrank(tokens, plainto_tsquery('search query')) < $1 // $1 = last fetched rank
ORDER BY
rank DESC
LIMIT
1000
Let's say, when the results are ranked on 'rank', the 1,000th result has a rank of 0.5. $1 (last fetched rank) would be 0.5, so I would select all results with rank < 0.5. The problem is: some results may have the same rank. So if the 1,001th record would also be rank = 0.5, it wouldn't be fetched because I say in my query rank < 0.5. I also cannot say rank <= 0.5because that would fetch the previous results with rank = 0.5 again.
Does anyone know the solution to this problem?

You have to provide a fully deterministic ORDER BY. Assuming user_id is unique:
ORDER BY rank desc, user_id
Then your WHERE would include:
AND (rank < :last_rank or (rank = :last_rank and user_id > :last_user_id))
But this will not be efficient, so you might as well just do OFFSET.
Better yet, don't do it at all. No human is realistically going to read through 1000 results, and think "You know, I'd like to do this a few more times". The only one who will do that is the web scraper, and the only reason the web scraper will do it is because it is the only method you offer. Just let them set a LIMIT which is as high as they want, and offer no pagination.

Related

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

The last updated data shows first in the postgres selet query?

I have simple query that takes some results from User model.
Query 1:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
LIMIT 20 OFFSET 0;
Result:
Then I have done some update on the customer 341683 and again I run the same query that time the result shows different, means the last updated shows first. So postgres is taking the last updated by default or any other things happens here?
Without an order by clause, the database is free to return rows in any order, and will usually just return them in whichever way is fastest. It stands to reason the row you recently updated will be in some cache, and thus returned first.
If you need to rely on the order of the returned rows, you need to explicitly state it, e.g.:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
ORDER BY id -- Here!
LIMIT 20 OFFSET 0

count max values in postgresql

I have a problem to formulate an sql question in postgresql, hoping to get some help here
I have three tables employee, visitor, and visit. I want to find out which employee (fk_employee_id) who have been responsible for most visit that haven't been checked out.
I want to make an sql question which are returning just the number one result, (by max function maybe?) instead of my current one, which are returning a ranked list (this ranked list doesn't work either if the number one position is shared by two persons)
This is my current sql question:
select visitor.fk_employee_id, count(visitor.fk_employee_id)
From Visit
Inner Join visitor on visit.fk_visitor_id = visitor.visitor_id
WHERE check_out_time IS NULL
group by visitor.fk_employee_id, visitor.fk_employee_id
Limit 1
Anyone now how to do this?
enter image description here
To avoid confusion, I will change the column names to:
visitor table, the FK to employee id : employee_in_charge_id
visit table, the FK to employee id : employee_to_meet_id
From your explanation in comments, you are looking for Employee, who has the most visits which are not check-out .
In the case where, more than 1 employees are having same max number of visits which are not check-out, this query lists all the multiple employees:
SELECT * FROM
(
SELECT
r.employee_in_charge_id,
count(*) cnt,
rank() over (ORDER BY count(*) DESC)
FROM visit v
JOIN visitor r ON v.visitor_id = r.id
WHERE v.check_out_time IS NULL
GROUP BY r.employee_in_charge_id
) a
WHERE rank = 1;
Refer SQLFidle link: http://sqlfiddle.com/#!17/423d9/2
Side Note:
To me, it sounds more correct if employee_in_charge_id is part of visit table, rather than visitor table. My assumption is for each visit, there is 1 employee (A) who is responsible to handle the visit, & the visitor is meeting 1 employee (B). So 1 visitor can make multiple visits, which handle by different employees.
Anyway, my answer above is based on your original schema design.
Assuming a standard n:m implementation like detailed here, this whould be one way to do it:
SELECT fk_employee_id
FROM visit
WHERE check_out_time IS NULL
GROUP BY fk_employee_id
ORDER BY count(*) DESC
LIMIT 1;
Assuming referential integrity, you do not need to include the table visitor in the query at all.
count(*) is a bit faster than count(fk_employee_id) doing the same in this case. (assuming fk_employee_id is NOT NULL). See:
PostgreSQL: running count of rows for a query 'by minute'

Most efficient way to retrieve rows of related data: subquery, or separate query with GROUP BY?

I have a very simple PostgreSQL query to retrieve the latest 50 news articles:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Now I also want to retrieve the latest 10 comments for each article as well. I can think of two ways to accomplish retrieving them and I'm not sure which one is best in the context of PostgreSQL:
Option 1:
Do a subquery directly for the comments in the original query and cast the result to an array:
SELECT headline, author_name, body,
ARRAY(
SELECT id, message, author_name,
FROM news_comments
WHERE news_id = n.id
ORDER BY DATE DESC
LIMIT 10
) AS comments
FROM news n
ORDER BY publish_date DESC
LIMIT 50
Obviously, in this case, application logic would need to be aware of which index in the array is which column, that's no problem.
The one problem I see with the method is not knowing how the query planner would execute it. Would this effectively turn into 51 queries?
Option 2:
Use the original very simple query:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Then via application logic, gather all of the news ids and use those in a separate query, row_number() would have to be used here in order to limit the number of results per news article:
SELECT *
FROM (
SELECT *,
row_number() OVER(
PARTITION BY author_id
ORDER BY author_id DESC
) AS rn
FROM (
SELECT *
FROM news_comment
WHERE news_id IN(123, 456, 789)
) s
) s
where rn <= 10
This approach is obviously more complicated, and I'm not sure if this would have to retrieve all comments for the scoped news articles first, then chop off the ones where the row count is great than 10.
Which option is best? Or is there an even better solution I have overlooked?
For context, this is a news aggregator site I've developed myself, I currently have about 40,000 news articles across several categories, with about 500,000 comments, so I'm looking for the best solution to help me keep growing.
You should investigate execution plan for your statements using at least EXPLAIN ANALYZE. This will provide you with plan chosen by the optimizer while executing the statement itself and giving you back actual run times and other statistics as well.
Another solution would be to use LATERAL subquery to retrieve 10 comments for each news in separate rows, but then again - you need to investigate and compare plans to choose the best approach that works for you:
SELECT
n.id, n.headline, n.uathor_name, n.body,
c.id, c.message, c.author_name
FROM news n
LEFT JOIN LATERAL (
SELECT id, message, author_name
FROM news_comments nc
WHERE n.id = nc.news_id
ORDER BY nc.date DESC
LIMIT 10
) c ON TRUE
ORDER BY publish_date DESC
LIMIT 50
When your query contains LATERAL cross-references for each row retrieved from news LATERAL is evaluated using the connection in WHERE clause. Thus making it a repeated execution and joining the information retrieved from it for each row from your source table news.
This approach would save the time needed for your application logic to deal with arrays coming out from option 1 while not having to issue many separate queries for each news like in option 2 saving you (in this case) time needed to open separate transactions, establish connections, retrieve rows etc...
It would be good to look for performance improvements by creating indexes and looking into planner cost constans and planner method configuration parameters that you can experiment with to understand the choice planner has made. More on the subject here.

Optimize large offset in select

I have table
Users (user_id integer, user_name string, scores integer)
That table will contain 1-6 millions records. That has indexes on user_name and scores
The user will input his name and I should show him one page from that table, ordered by scores, that will contain him around other users.
I do it in 2 queries:
First:
select user_id from (
select row_number() over (order by scores desc),
user_id
from users
where user_name="name" limit 1
)
Second:
select * from users limit 20 offset The_User_Id/20+1
than i get page, that contain my User around others.
But when user is in middle of table with millions record, I have offset 500000, that works slow, about 1-2 seconds, how to improve it?
Offset itself makes your query slow.
If you don't need a pure sql and can use a programming language to form the query, why not consider Paging Through Results? ordering the second query by user_id and limit 20 for pagination needs instead of using offset.