How can I query postgres by page? - postgresql

Assume a table that has many pages and I only want some of it, for example
SELECT * FROM t WHERE...
From pg_class.relpages I know how many pages are there, how can I access some page like
SELECT * FROM t WHERE PAGE_NUMBER=1

Since you mention relpages I assume you are talking about "physical" pages where the data is stored on disk. In general you should mind your own business and let the database server mind its own business. But if you don't want to do that, the you can reference tuples by the hidden system column "ctid", which is a composite consisting of the page number and the slot within the page.
select * from pgbench_accounts where ctid between '(18,0)' and '(18,65535)';
This is the 19th page, as page numbers start at 0. Slot numbers start at 1, but for convenience you are allowed to reference slot 0 as if it were a valid (but empty) slot.
Note that until very recent versions, this will not be fast as it would scan the whole table and filter out the disqualified rows one by one.

For that, you need a sort order. For example
SELECT * FROM t WHERE ...
ORDER BY customer, created_ts;
Make sure that the sort condition is unique. If it isn't, add the primary key at the end.
Then you create an index for the ORDER BY condition:
CREATE INDEX ON t (customer, created_ts);
Then you select the first page of 50 items like this:
SELECT * FROM t WHERE ...
ORDER BY customer, created_ts
LIMIT 50;
You display the page and remember customer and created_ts from the last result row in <latest_cust> and <latest_ts>.
The next page is fetched with
SELECT * FROM t WHERE ...
AND (customer, created_ts) > (<latest_cust>, <latest_ts>)
ORDER BY customer, created_ts
LIMIT 50;
and so on.
This method (it is called “keyset pagination”) is efficient and quite stable in the face of concurrent data modifications.

Related

Sqlalchemy way to handle select based on outdated data

Assuming I have a table sets with a field filters, containing array of key - value mappings, table items to which select query must be applied to extract rows based on these filters, and associated table for M:M relations to link each set with each item. I am seeking for a method or mechanism to cancel select query if sets.filters were updated, otherwise M:M relation will be built invalid as based on yet not refreshed filters.
The concrete scenario when a problem takes place is:
Receive file with items data, to parse, and insert into items returning new relevant ids(primary keys here);
After insertion, select from relevant sets for filters;
Take items ids and select from items using filters;
Update M:M association table for all the items returned at step 3.
So, unfortunately between step 3 and 4 or even earlier, API call makes an update on one of the sets rows, changing its filters. As the result - M:M table is invalid, because one filter was changed(lets say the filters contained kind of weight <= 100 kilos expression, however after the mentioned update it has become weight <= 50 kilos, so if there are some new items with weight greater than 50, those items ids should not be in M:M table, obviously).
Is there some efficient way to cancel select query from items during transaction? Or maybe there is a strong query to use. My idea is to rollback changes post-factum, checking sets.modified_at column. But it seems as doing additional job by wasting disk and cpu time.

How to implement GraphQL Relay style cursor based pagination in postgres (with sqlalchemy and fastAPI)

I am trying to implement Relay-Style pagination on a postgres. I have gone through the relay specification, and it recommends having a globally unique cursor for each entity. The specification for pagination is that the client provides first and after for Forward pagination and last and before for Backward pagination.
Assume I have a table called Products, with an auto incrementing primary key integer ID. In the most basic case, pagination is very straight forward. For forward pagination it would be something like SELECT * FROM products WHERE id > cursor LIMIT 10 for forward pagination, and SELECT * FROM products WHERE id < cursor ORDER BY id DESC LIMIT 10
The problem arises when we have additional sorting requirements. Let us say, our product table has a name field on which we want to sort.
Name Id (cursor field)
Trendy 1
Autumn 2
Winter 3
Crazy 4
Antman 5
Marvel 6
And we want to paginate, with page size of 3, sorted on name ASC. The query would be:
SELECT * FROM products ORDER BY name ASC LIMIT 3, And we would get the following output:
(Antman, 5), (Autumn 1), (Crazy 4)
But now, how will I get the next three? This query SELECT * FROM products WHERE id >= 4 ORDER BY name ASC LIMIT 3; would not work, as Trendy has ID < 1. Do I create a different cursor for Name and use that for paginating with Name? Okay, what if we wanted to sort on multiple fields together (let us say we have a price field, we want to sort from decreasing to increasing price, and then sort based on created time, and then on name. The price might not be unique and any other additional constraints), how would I create a cursor then?
What is the standard approach to solve this problem? I have tried to research as much as I can but I couldn't find any information on how to accomplish this in fastapi and sqlalchemy. Django has FilterableConnectionField, but I couldn't understand the implementation. Any help is very appreciated, thank you.
edit: I figured out how to do this. It is called queryset pagination, and sqlakeyset repo on github had a good implementation of this pattern. https://github.com/djrobstep/sqlakeyset

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

PostgreSQL: Returning ordered rows after a specific ID

Scenario:
I am displaying a table of records. It initially displays the first 500 with "show more" at the bottom, which returns the next 500.
Issue:
If between initial display and clicking "show more" 1 record is added, that will cause "order by date, offset 500, limit 500" to overlap by 1 row.
I'd like to "order by date, offset until 'id of last row shown', limit 500"
My row IDs are UUIDs. I am open to alternative approaches that achieve the same result.
If you can order by ID, you can paginate using
where id > $last_seen_id limit 500
but that's not going to be useful where you're sorting by date.
Sort stability!
I really hope that "date" actually means "timestamp" though, otherwise your ordering will be unstable and you can miss rows in pagination; you'll have to order by date, id to get stable ordering if it's really a date, and should probably do so even for timestamp.
State on client
One option is to push the state out to the client. Have the client remember the last-seen (date,id) tuple, and use:
where date > $last_seen_date and id > $last_seen_id limit 500
Cursors
Do you care about scalability? If not, you can use a server-side cursor. Declare the cursor for the full query, without the LIMIT. Then FETCH chunks of rows as requested. To do this your app must have a way to consistently bind a connection to a specific user's requests, though, and not to reset that connection or return it to the pool between requests. This might not be practical with your pool/framework, but is probably the best solution if you can do it.
Temp tables
Another even less scalable option is to CREATE TABLE sessiondata.myuser_myrequest_blah AS SELECT .... then paginate that table. It's guaranteed not to change. This avoids the difficulty of needing to keep a consistent connection across requests, but will have a very slow first-request response time and is completely impractical for large user counts or large amounts of data.
Related questions
Handling paging with changing sort orders
Using "Cursors" for paging in PostgreSQL
How to provide an API client with 1,000,000 database results?
i think you can use a subquery in the where to accomplish this.
e.g. given you're paginating through a users table, and you want the records after a given user:
SELECT *
FROM users
WHERE created_at > (
SELECT created_at
FROM users
WHERE users.id = '00000000-1111-2222-3333-444444444444'
LIMIT 1
)
ORDER BY created_at DESC limit 5;

Pagination logic in Mainframe CICS

Here is my requirement.
Front (Client) end will do a search based on predefined conditions (for instance: customer id, account number, first name, last name, etc). I need to get the data corresponding to this request from a db2 database and send it back to them (Server). We use CICS channels and containers to pass requests and responses between the Client and Server.
Front end needs the data ordered by: Receive date descending, Customer id Ascending, Account number Ascending. Data are fetched in pages of 500 records. For example, if for a search request from front end would retrieve 50,000 records from the db2 database, we need to return this data in 500 record "pages". For pagination concept, we use the field security deposit number which is primary key to our database but the sorting order is not based on this field.
I would like to know whether we can use scrollable cursor logic in CICS to implement pagination.
Please note that I do not prefer to go for internal array bubble sort to send the data in response as it would degrade performance. I like to do it via query logic.any thoughts?
Example (Initial Front end input request):
Customer id : A
First time request (To identify whether it is first time or next or previous request for pagination)
First security deposit number : 0
Last security deposit number : 0
Since this is first time request, both this field will be having zero from front end and we need to retrieve records from database based on condition of security deposit > 0
Db2 database:
There are 700 records for this criteria
Mainframe response for first time:
We will send the first 500 records
Front end will then send request for getting next set of records which will contain:
Customer id: A
Next request
First security deposit numbr: 0
Last security deposit number : 17980
So for this detail, if I query my datbase based on security deposit number > 17980, it may result in duplicate records listing in the screen once again since our sorting order in database is not based on security deposit number
How to impelement this logic??
Many Client/Server applications in an IBM Mainframe environment involve psuedo conversational CICS transactions.
If you are using CICS in psueudo conversational mode it
is not possible for the Server to hold cusors when it RETURNs to the Client. Therefore scrollable cusors
are of little use in this environment. So to answer your basic question: No scrollable cursors cannot be used here.
The "trick" here is to create an SQL predicate in the Server that is restartable. It will then pick up rows in the correct order from any given
stating point. When the Client calls your Server it must pass all of the positioning information to your Server.
Typically, on a first call from a Client all of the positioning values are set to cause the cursor to
position itself starting with what must be the the first row. The Server then pulls in a "page" worth of data
and returns it to the Client. On the next page forward request the Client sets these positioning values to
the last row it displayed and calls the Server for the next "page" of data.
In your situation I would assume that the page forward cursor would look something like this, all the
variables prefixed with RESTART... are what the Client must provide to the Server to start the cursor
in the correct position.
DECLARE CURSOR Page-forward FOR
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date < :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id > :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr > :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id > :RESTART-SEC-DEP-ID))
ORDER BY 1 DESC, 2 ASC , 3 ASC, 4 ASC
For the initial call the Client would have passed something like '9999-12-31' as the RESTART-RCV-DT, zero
for the RESTART-CUSTOMER-ID, RESTART-ACCT-NBR and SEC-DEP-ID (assuming these are all numeric). If you look at
the cursor predicate carefully you can verify that there cannot be any rows prior to these values - therefore this
will return the first page of data. If the Client needs to page forward after this, it must tell the Server to start
with the next row after the last one it received. To do this it would populate the RESTART... variables with
the values from the last row on the page it just
displayed. This process will drive the cursor selects forward one page at a time.
When paging up, the process is reversed (you will need a second cursor to support this, and the Client needs to tell you which direction to page: Forward or Back). The Client
will need to populate the RESTART variables with the first row it recieved from the Server. The trick
for the Server on a page up request is to return the data
to the Client in reverse order. You may have to populate the data page passed back
to the Client in reverse order (ie. put the first row retrieved into the last row of the paging area shared between
the Client and the Server). The page backward cursor would look something like:
DECLARE CURSOR Page-backward FOR
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date > :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id < :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr < :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id < :RESTART-SEC-DEP-ID))
ORDER BY 1 ASC, 2 DESC , 3 DESC, 4 DESC
As has been pointed out in other answers, this type of paging process does not manage or detect concurrent
updates to the database that may occur duing paging transactions. That is another topic for another day...
Developing Restartable Cursors
The the key to building a paging Server is to develop a cursor that is restartable from a set of values received
from a Client transaction. This leaves control of cursor positioning and direction with the Client.
It also means the Client must receive all critical positioning data from the Server even though
the Client might not actually
use these data for any other purpose (e.g. From your question I got the impression that the Client may not require
the Security Deposit Id except to supply as a positioning parameter for your Server)
To build a paging Server you need to know
what the required sorting order of the data are (e.g. Receive Date Descending then Customer Id Ascending then
Account Number Ascending).
You also need know the set of data that uniquely identify a row
returned by the cursor. In your case that would be the Security Deposit Id (this is the primary key for the
table you are selecting from so it must be unique for each and every row in that table). Knowing this you then build a
cursor predicate (the stuff in the WHERE clause) that will return data needed by the Client in the required sort order that
also includes
the full positioning key (i.e. Security Deposit Id). In the event that two or more returned rows may contain identical data if
the final positioning key were elimiminated makes it important that the positioning key be included as a sort condition.
It doesn't matter if it is ascending or descending, but it needs to be included on the sort to ensure consistent
order of data retrieval.
A fairly simple formula may be followed to build the predicate for a restartable cusor needed to
support paging Servers. Basically this is a cascade of "OR" clauses connecting a series of "AND" clauses
that become progressively more selective following the sort order required by the Client and end up with the positioning
key.
To see how this works consider how the query for your Server might be developed...
Start with the column from the sort order that changes least often...
SELECT ...
FROM ...
WHERE Receive_Date < restart value
This will retrieve all rows prior to the specified restart Receieve date regardless of what the other
column restart values are (e.g. Customer ID's can range from minimum to maximum values, as long as the Receive Date
is less than any Receive Date "seen" so far). Since this column only changes value after all subortinate sort columns values
have been exausted you can be sure that this does not pick up any rows prior to the full restart key.
But what about those rows that occur on the same date as the restart request but have a
larger Customer Id? These can be picked up with....
SELECT ...
FROM ...
WHERE Receive_Date = restart value AND
Customer_id > restart value
What about those where the Receive Date and Customer Id are the same as the restart key but have
a larger Account Number? These can be picked up with...
SELECT ...
FROM ...
WHERE Receive_Date = restart value AND
Customer_Id = restart value AND
Account_Nbr > restart value
Continue this pattern until the full restart key has been processed. Notice that the inequality
signs are determined by the sort order. Use < when the column is sorted Descending and > when Ascending.
Also notice that the SELECT and FROM clauses
are exactly the same for each query - which means you can put them all together using OR conjuctions...
SELECT Receive_Date, Customer_id, Account_Nbr, Security_Dep_Id
FROM Table_Name
WHERE ( (Receive_Date < :RESTART-RCV-DT)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id > :RESTART-CUSTOMER-ID)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr > :RESTART-ACCT-NBR)
OR (Receive_Date = :RESTART-RCV-DT AND
Customer_Id = :RESTART-CUSTOMER-ID AND
Account_Nbr = :RESTART-ACCT-NBR AND
Security_Dep_Id > :RESTART-SEC-DEP-ID))
ORDER BY 1 DESC, 2 ASC , 3 ASC, 4 ASC
There you go... a restartable cursor for forward paging. Construction of the cursor for backward paging follows a similar pattern, just flip the
sort orders and repeat.
A simplistic approach: Write your SQL to retrieve data according to your criteria, in the sort order you specify. Then only retrieve the keys to the rows you want. Save the keys somewhere you will have access to upon subsequent invocations of your transaction. Look into multi-row select in DB2. Also understand pseudo-conversational programming techniques in CICS.
And now we get to the design implications Bill Woodger mentions, that you do not specify in your question, and which are the reason I'm just hitting the high points of a simplistic approach.
If changes to your result set occur between one invocation and the next, your results will not reflect those changes. You must decide if this is important.
You mention a "front end" but do not specify what it is. If it is a BMS application, you may be able to save the keys in your commarea or in a container. If your front end is a distributed application invoking your transactions via CICS Web Services or CICS Web Support or MQ or raw sockets or whatever, you must design a mechanism to store those keys such that you can uniquely retrieve them — perhaps by sending a contrived key back to the distributed application which it must supply upon subsequent invocations. Then you must have some process to clean up your key store.
Creating a solution to your problem that is unique in your IT shop is not something to be done in isolation. You must involve others who will be tasked with maintaining your application, there may be a group external to your project tasked with making such decisions, there may be infrastructure issues with your solution.
So this isn't so much as an answer to your question as it is an elaboration upon why you may not get an answer, or at least the answer you seem to desire.