cursors: atomic MOVE then FETCH? - postgresql

the intent
Basically, pagination: have a CURSOR (created using DECLARE outside any function here, but that can be changed if need be) concurrently addressed to retrieve batches of rows, implying moving the cursor position in order to fetch more than one line (FETCH count seems to be the only way to fetch more than one line).
the context
During a more global transaction (i.e. using one connection), I want to retrieve a range of rows through a cursor. To do so, I:
MOVE the cursor to the desired position (e.g. MOVE 42 FROM "mycursor")
then FETCH the amount of rows (e.g. FETCH FORWARD 10 FROM "mycursor")
However, this transaction is used by many workers (horizontally scaled), each receiving a set of "coordinates" for the cursor, like LIMIT and OFFSET: the index to MOVE to, and the amount of rows to FETCH. These workers use the DB connection through HTTP calls to a single DB API which handles the pool of connections and the transactions' liveliness.
Because of this concurrent access to the transaction/connection, I need to ensure atomic execution of the couple "MOVE then FETCH".
the setup
NodeJS workers consuming ranges of rows through a DB API
NodeJS DB API based on pg (latest)
PostgreSQL v10 (can be upgraded if required, all documentation links here are from v12 - latest)
the tries
WITH (MOVE 42 FROM "mycursor") FETCH 10 FROM "mycursor" produces a syntax error, apparently WITH doesn't handle MOVE
MOVE 42 FROM "mycursor" ; FETCH 10 FROM "mycursor" as I'm inside a transaction I suppose this could work, but anyway I'm using Node's pg which apparently doesn't handle several statements in the same call to query() (no error, but no result yielded, I didn't dig into this too much as it looks like a hack)
I'm not confident a function would guarantee atomicity, it doesn't seem to be what PARALLEL UNSAFE does, and as I'm going to have high concurrency, I'd really love some explicitly written assurances about atomicity...
the reason
I'd prefer not to rely on LIMIT/OFFSET as it would require an ORDER BY clause to ensure pagination consistency (as per the docs, ctrl-f for "unpredictable"), unless (scrollable, without hold) cursors prove to be way more resource-consuming. "Way more" because it has to be weighed with the INSENSITIVE behavior of cursors that would allow me not to acquire a lock on the underlying table during the whole process. If it's proven that pagination in this context is not feasible using cursors, I'll fall back to this solution, unless you have something better to suggest!
the human side
Hello, and thanks in advance for the help! :)

You can share connections, but you cannot to share transactions. This request is impossible for this context. Good tools doesn't share connections, when any transaction is used in this connection.

Related

Stable pagination using Postgres

I want to implement stable pagination using Postgres database as a backend. By stable, I mean if I re-read a page using some pagination token, the results should be identical.
Using insertion timestamps will not work, because clock synchronization errors can make pagination unstable.
I was considering using pg_export_snapshot() as a pagination token. That way, I can reuse it on every read, and the database would guarantee me the same results since I am always using the same snapshot. But the documentation says that
"The snapshot is available for import only until the end of the transaction that exported it."
(https://www.postgresql.org/docs/9.4/functions-admin.html)
Is there any workaround for this? Is there an alternate way to export the snapshot even after the transaction is closed?
You wouldn't need to export snapshots; all you need is a REPEATABLE READ READ ONLY transaction so that the same snapshot is used for the whole transaction. But, as you say, that is a bad idea, because long transactions are quite problematic.
Using insert timestamps I see no real problem for insert-only tables, but rows that get deleted or updated will certainly vanish or move unless you use “soft delete and update” and leave the old values in the table (which gives you the problem of how to get rid of the values eventually). That would be re-implementing PostgreSQL's multiversioning on the application level and doesn't look very appealing.
Perhaps you could use a scrollable WITH HOLD cursor. Then the database server will materialize the result set when the selecting transaction is committed, and you can fetch forward and backward at your leisure. Sure, that will hog server resources, but you will have to pay somewhere. Just don't forget to close the cursor when you are done.
If you prefer to conserve server resources, the obvious alternative would be to fetch the whole result set to the client and implement pagination on the client side alone.

Is eval that evil?

I understand that eval locks the whole database, which can't be good for throughput - however I have a scenario where a very specific transaction involving several documents must be isolated.
Because that transaction does not happen very often and is fairly quick (a few updates on indexed queries), I was thinking of using eval to execute it.
Are their any pitfalls that I should be aware of (I have seen several eval=evil posts but without much explanation)?
Does it make a difference if the database is part of a replica set?
Many developers would suggest using eval is "evil" as their are obvious security concerns with potentially unsanitized JavaScript code executing within the context of the MongoDB instance. Normally MongoDB is immune to those types of injection attacks.
Some of the performance issues of using JavaScript in MongoDB via the eval command are mitigated in version 2.4, as muliple JavaScript operations can execute at the same time (depending on the setting of the nolock option). By default though, it takes a global lock (which is what you specifically want apparently).
When a eval is being used to try to perform an (ACID-like) transactional update to several documents, there's one primary concern. The biggest issue is that if all operations must succeed for the data to be in a consistent state, the developer is running the risk that a failure mid-way through the operation may result in a partially complete update to the database (like a hardware failure for example). Depending on the nature of the work being performed, replication settings, etc., the data may be OK, or may not.
For situations where database corruption could occur as a result of a partially complete eval operation, I would suggest considering an alternative schema design and avoiding eval. That's not to say that it wouldn't work 99.9999% of the time, it's really up to you to decide ultimately whether it's worth the risk.
In the case you describe, there are a few options:
{ version: 7, isCurrent: true}
When a version 8 document becomes current, you could for example:
Create a second document that contains the current version, this would be an atomic set operation. It would mean that all reads would potentially need to read the "find the current version" document first, followed by the read of the full document.
Use a timestamp in place of a boolean value. Find the most current document based on timestamp (and your code could clear out the fields of older documents if desired once the now current document has been set)

using postgres server-side cursor for caching

To speed page generation for pages based on large postgres collections, we cache query results in memcache. However, for immutable collections that are very large, or that are rarely accessed, I'm wondering if saving server side cursors in postgres would be a viable alternate caching strategy.
The idea is that after having served a page in the middle of a collection "next" and "prev" links are much more likely to be used than a random query somewhere else in the collection. Could I have a cursor "WITH HOLD" in the neighborhood to avoid the (seemingly unavoidable) large startup costs of the query?
I wonder about resource consumption on the server. If the collection is immutable, saving the cursor shouldn't need very many resources, but I wonder how optimized postgres is in this respect. Any links to further documentation would be appreciated.
You're going to run into a lot of issues.
You 'd have to ensure the same user gets the same sql connection
You have to create a cleanup strategy
The cursors will be holding up vacuum operations.
You have to convince the connection pool to not clear the cursors
Probably other issues I have not mentioned.
In short: don't do it. How about precalculating the next/previous page in background, and storing it in memcached?
A good answer to this has previously been made Best way to fetch the continuous list with PostgreSQL in web
The questions are similar, essentially you store a list of PKs on the server with a pagination-token and an expiration.

Using "Cursors" for paging in PostgreSQL [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to provide an API client with 1,000,000 database results?
Wondering of the use of Cursors is a good way to implement "paging" using PostgreSQL.
The use case is that we have upwards 100,000 rows that we'd like to make available to our API clients. We thought a good way to make this happen would be to allow the client to request the information in batches ( pages ). The client could request 100 rows at a time. We would return the 100 rows as well as a cursor, then when the client was ready, they could request the next 100 rows using the cursor that we sent to them.
However, I'm a little hazy on how cursors work and exactly how and when cursors should be used:
Do the cursors require that a database connection be left open?
Do the cursors run inside a transaction, locking resources until they are "closed"?
Are there any other "gotchas" that I'm not aware of?
Is there another, better way that this situation should be handled?
Thanks so much!
Cursors are a reasonable choice for paging in smaller intranet applications that work with large data sets, but you need to be prepared to discard them after a timeout. Users like to wander off, go to lunch, go on holiday for two weeks, etc, and leave their applications running. If it's a web-based app there's even the question of what "running" is and how to tell if the user is still around.
They are not suitable for large-scale applications with high client counts and clients that come and go near-randomly like in web-based apps or web APIs. I would not recommend using cursors in your application unless you have a fairly small client count and very high request rates ... in which case sending tiny batches of rows will be very inefficient and you should think about allowing range-requests etc instead.
Cursors have several costs. If the cursor is not WITH HOLD you must keep a transaction open. The open transaction can prevent autovacuum from doing its work properly, causing table bloat and other issues. If the cursor is declared WITH HOLD and the transaction isn't held open you have to pay the cost of materializing and storing a potentially large result set - at least, I think that's how hold cursors work. The alternative is just as bad, keeping the transaction implicitly open until the cursor is destroyed and preventing rows from being cleaned up.
Additionally, if you're using cursors you can't hand connections back to a connection pool. You'll need one connection per client. That means more backend resources are used just maintaining session state, and sets a very real upper limit on the number of clients you can handle with a cursor-based approach.
There's also the complexity and overhead of managing a stateful, cursor-based setup as compared to a stateless connection-pooling approach with limit and offset. You need to have your application expire cursors after a timeout or you face potentially unbounded resource use on the server, and you need to keep track of which connections have which cursors for which result sets for which users....
In general, despite the fact that it can be quite inefficient, LIMIT and OFFSET can be the better solution. It can often be better to search the primary key rather than using OFFSET, though.
By the way, you were looking at the documentation for cursors in PL/pgSQL. You want normal SQL-level cursors for this job.
Do the cursors require that a database connection be left open?
Yes.
Do the cursors run inside a transaction, locking resources until they
are "closed"?
Yes unless they are WITH HOLD, in which case they consume other database resources.
Are there any other "gotchas" that I'm not aware of?
Yes, as the above should explain.
For HTTP clients, don't use cursors to implement paging. For scalability, you don't want server resources tied up between requests.
Instead, use LIMIT and OFFSET on your queries; see LIMIT and OFFSET in the Pg docs.
But make sure that the indexing on your tables will support efficient queries of this form.
Design a RESTful API, so that the client can invoke the "next_url" (also passed in the response) to get the next set of rows.

Some basic questions on RDBMSes

I've skimmed thru Date and Silberschatz but can't seem to find answers to these specific questions of mine.
If 2 database users issue a query -- say, 'select * from AVERYBIGTABLE;' -- where would the results of the query get stored in general... i.e., independent of the size of the result set?
a. In the OS-managed physical/virtual memory of the DBMS server?
b. In a DBMS-managed temporary file?
Is the query result set maintained per connection?
If the query result set is indeed maintained per connection, then what if there's connection pooling in effect (by a layer of code sitting above the DBMS)? Won't, then, the result set be maintained per query (instead of per connection)?
If the database is changing in realtime while its users concurrently issue select queries, what happens to the queries that have already been executed but not yet (fully) 'consumed' by the query issuers? For example, assume the result set has 50,000 rows; the user is currently iterating at 100th, when parallely another user executes an insert/delete such that it would lead to more/less than 50,000 rows if the earlier query were to be re-issued by any user of the DBMS?
On the other hand, in case of a database that does not change in realtime, if 2 users issue identical queries each with identical but VERY LARGE result sets, would the DBMS maintain 2 identical copies of the result set, or would it have a single shared copy?
Many thanks in advance.
Some of this may be specific to Oracle.
The full results of the query do not need to copied each user gets a cursor (like a pointer) that maintains which rows have been retrieved, and what rows still need to be fetched. The database will cache as much of data as it can as it reads the data out of the tables. Same principal as two users have read only file handle on file.
The cursors are maintained per connection, the data for the next row may or may not already be in memory.
Connections for the most part are single threaded, only 1 client can use a connection at a time. If the same query is executed twice on the same connection then the cursor position is reset.
If a cursor is open on table that is being updated then the old rows are copied into a separate space (undo in Oracle) and is maintained for the life of the cursor, or at least until it runs out of space to maintain it. (Oracle will give a snapshot too old error)
The database will never duplicate the data stored in cache, in Oracle's case with cursor sharing there would a single cached cursor and each client cursor would only have to maintain its position in the cached cursor.
Oracle Database Concepts
See 8 Memory for questions 1, 2, 5
See 13 Data Concurrency and Consistency (Questions 3, 4)
The reason you don't find this in Date etc is because they could change between DBMS products, there is nothing in the relational model theory about pooling connections to the database or how to maintain the result sets from a query (like caching etc). The only point which is partially covered is 4 - where the read level would come into play (eg read uncommitted), but this only applies until the result set has been produced.