Using "Cursors" for paging in PostgreSQL [duplicate] - postgresql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to provide an API client with 1,000,000 database results?
Wondering of the use of Cursors is a good way to implement "paging" using PostgreSQL.
The use case is that we have upwards 100,000 rows that we'd like to make available to our API clients. We thought a good way to make this happen would be to allow the client to request the information in batches ( pages ). The client could request 100 rows at a time. We would return the 100 rows as well as a cursor, then when the client was ready, they could request the next 100 rows using the cursor that we sent to them.
However, I'm a little hazy on how cursors work and exactly how and when cursors should be used:
Do the cursors require that a database connection be left open?
Do the cursors run inside a transaction, locking resources until they are "closed"?
Are there any other "gotchas" that I'm not aware of?
Is there another, better way that this situation should be handled?
Thanks so much!

Cursors are a reasonable choice for paging in smaller intranet applications that work with large data sets, but you need to be prepared to discard them after a timeout. Users like to wander off, go to lunch, go on holiday for two weeks, etc, and leave their applications running. If it's a web-based app there's even the question of what "running" is and how to tell if the user is still around.
They are not suitable for large-scale applications with high client counts and clients that come and go near-randomly like in web-based apps or web APIs. I would not recommend using cursors in your application unless you have a fairly small client count and very high request rates ... in which case sending tiny batches of rows will be very inefficient and you should think about allowing range-requests etc instead.
Cursors have several costs. If the cursor is not WITH HOLD you must keep a transaction open. The open transaction can prevent autovacuum from doing its work properly, causing table bloat and other issues. If the cursor is declared WITH HOLD and the transaction isn't held open you have to pay the cost of materializing and storing a potentially large result set - at least, I think that's how hold cursors work. The alternative is just as bad, keeping the transaction implicitly open until the cursor is destroyed and preventing rows from being cleaned up.
Additionally, if you're using cursors you can't hand connections back to a connection pool. You'll need one connection per client. That means more backend resources are used just maintaining session state, and sets a very real upper limit on the number of clients you can handle with a cursor-based approach.
There's also the complexity and overhead of managing a stateful, cursor-based setup as compared to a stateless connection-pooling approach with limit and offset. You need to have your application expire cursors after a timeout or you face potentially unbounded resource use on the server, and you need to keep track of which connections have which cursors for which result sets for which users....
In general, despite the fact that it can be quite inefficient, LIMIT and OFFSET can be the better solution. It can often be better to search the primary key rather than using OFFSET, though.
By the way, you were looking at the documentation for cursors in PL/pgSQL. You want normal SQL-level cursors for this job.
Do the cursors require that a database connection be left open?
Yes.
Do the cursors run inside a transaction, locking resources until they
are "closed"?
Yes unless they are WITH HOLD, in which case they consume other database resources.
Are there any other "gotchas" that I'm not aware of?
Yes, as the above should explain.

For HTTP clients, don't use cursors to implement paging. For scalability, you don't want server resources tied up between requests.
Instead, use LIMIT and OFFSET on your queries; see LIMIT and OFFSET in the Pg docs.
But make sure that the indexing on your tables will support efficient queries of this form.
Design a RESTful API, so that the client can invoke the "next_url" (also passed in the response) to get the next set of rows.

Related

cursors: atomic MOVE then FETCH?

the intent
Basically, pagination: have a CURSOR (created using DECLARE outside any function here, but that can be changed if need be) concurrently addressed to retrieve batches of rows, implying moving the cursor position in order to fetch more than one line (FETCH count seems to be the only way to fetch more than one line).
the context
During a more global transaction (i.e. using one connection), I want to retrieve a range of rows through a cursor. To do so, I:
MOVE the cursor to the desired position (e.g. MOVE 42 FROM "mycursor")
then FETCH the amount of rows (e.g. FETCH FORWARD 10 FROM "mycursor")
However, this transaction is used by many workers (horizontally scaled), each receiving a set of "coordinates" for the cursor, like LIMIT and OFFSET: the index to MOVE to, and the amount of rows to FETCH. These workers use the DB connection through HTTP calls to a single DB API which handles the pool of connections and the transactions' liveliness.
Because of this concurrent access to the transaction/connection, I need to ensure atomic execution of the couple "MOVE then FETCH".
the setup
NodeJS workers consuming ranges of rows through a DB API
NodeJS DB API based on pg (latest)
PostgreSQL v10 (can be upgraded if required, all documentation links here are from v12 - latest)
the tries
WITH (MOVE 42 FROM "mycursor") FETCH 10 FROM "mycursor" produces a syntax error, apparently WITH doesn't handle MOVE
MOVE 42 FROM "mycursor" ; FETCH 10 FROM "mycursor" as I'm inside a transaction I suppose this could work, but anyway I'm using Node's pg which apparently doesn't handle several statements in the same call to query() (no error, but no result yielded, I didn't dig into this too much as it looks like a hack)
I'm not confident a function would guarantee atomicity, it doesn't seem to be what PARALLEL UNSAFE does, and as I'm going to have high concurrency, I'd really love some explicitly written assurances about atomicity...
the reason
I'd prefer not to rely on LIMIT/OFFSET as it would require an ORDER BY clause to ensure pagination consistency (as per the docs, ctrl-f for "unpredictable"), unless (scrollable, without hold) cursors prove to be way more resource-consuming. "Way more" because it has to be weighed with the INSENSITIVE behavior of cursors that would allow me not to acquire a lock on the underlying table during the whole process. If it's proven that pagination in this context is not feasible using cursors, I'll fall back to this solution, unless you have something better to suggest!
the human side
Hello, and thanks in advance for the help! :)
You can share connections, but you cannot to share transactions. This request is impossible for this context. Good tools doesn't share connections, when any transaction is used in this connection.

Do Firebase/Firestore Transactions create internal queues?

I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?
Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.

using postgres server-side cursor for caching

To speed page generation for pages based on large postgres collections, we cache query results in memcache. However, for immutable collections that are very large, or that are rarely accessed, I'm wondering if saving server side cursors in postgres would be a viable alternate caching strategy.
The idea is that after having served a page in the middle of a collection "next" and "prev" links are much more likely to be used than a random query somewhere else in the collection. Could I have a cursor "WITH HOLD" in the neighborhood to avoid the (seemingly unavoidable) large startup costs of the query?
I wonder about resource consumption on the server. If the collection is immutable, saving the cursor shouldn't need very many resources, but I wonder how optimized postgres is in this respect. Any links to further documentation would be appreciated.
You're going to run into a lot of issues.
You 'd have to ensure the same user gets the same sql connection
You have to create a cleanup strategy
The cursors will be holding up vacuum operations.
You have to convince the connection pool to not clear the cursors
Probably other issues I have not mentioned.
In short: don't do it. How about precalculating the next/previous page in background, and storing it in memcached?
A good answer to this has previously been made Best way to fetch the continuous list with PostgreSQL in web
The questions are similar, essentially you store a list of PKs on the server with a pagination-token and an expiration.

will I typically get better performance if I run an update/calc loop via javascript?

I have a script that loops over a set of records, performs some statistical calculations and updates the records. It's a big cursor: get record, calculate statistics from embedded documents, set fields on record, save record. There's <5k records that are being looped and each one embeds 90 history entries.
Question: would I get substantially better performance if I did this via javascript? The alternative being writing it in Ruby. My opinion (unfounded) is that since this can be done entirely in the database I will get better performance if send a chunk of js to Mongodb instead of adding Ruby in to the mix.
Related: is map/reduce appropriate for finding the median and mode of a set of values for many records?
The answer is really "it depends" - if the fields you need to do the calculations are very large, doing the calculation on the server side with JS might be a lot faster simply by cutting down on network traffic.
But, executing JS on the server side also holds a write lock, so depending on how complicated the calculations are, it might be more efficient to just do your calculations on the client side and then simply update the document.
Your best bet is to do a simple benchmark with ruby vs. server side JS. If you need to serve other database traffic at the same time, this should also be considered as well, because your lock % could be different in the two scenarios (you can monitor this with mongostat).
Also, keep in mind that using db.eval will not work with sharding, so avoid it if you are using a sharded environment or plan to in the future.

One big call vs. multiple smaller TSQL calls

I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.