the intent
Basically, pagination: have a CURSOR (created using DECLARE outside any function here, but that can be changed if need be) concurrently addressed to retrieve batches of rows, implying moving the cursor position in order to fetch more than one line (FETCH count seems to be the only way to fetch more than one line).
the context
During a more global transaction (i.e. using one connection), I want to retrieve a range of rows through a cursor. To do so, I:
MOVE the cursor to the desired position (e.g. MOVE 42 FROM "mycursor")
then FETCH the amount of rows (e.g. FETCH FORWARD 10 FROM "mycursor")
However, this transaction is used by many workers (horizontally scaled), each receiving a set of "coordinates" for the cursor, like LIMIT and OFFSET: the index to MOVE to, and the amount of rows to FETCH. These workers use the DB connection through HTTP calls to a single DB API which handles the pool of connections and the transactions' liveliness.
Because of this concurrent access to the transaction/connection, I need to ensure atomic execution of the couple "MOVE then FETCH".
the setup
NodeJS workers consuming ranges of rows through a DB API
NodeJS DB API based on pg (latest)
PostgreSQL v10 (can be upgraded if required, all documentation links here are from v12 - latest)
the tries
WITH (MOVE 42 FROM "mycursor") FETCH 10 FROM "mycursor" produces a syntax error, apparently WITH doesn't handle MOVE
MOVE 42 FROM "mycursor" ; FETCH 10 FROM "mycursor" as I'm inside a transaction I suppose this could work, but anyway I'm using Node's pg which apparently doesn't handle several statements in the same call to query() (no error, but no result yielded, I didn't dig into this too much as it looks like a hack)
I'm not confident a function would guarantee atomicity, it doesn't seem to be what PARALLEL UNSAFE does, and as I'm going to have high concurrency, I'd really love some explicitly written assurances about atomicity...
the reason
I'd prefer not to rely on LIMIT/OFFSET as it would require an ORDER BY clause to ensure pagination consistency (as per the docs, ctrl-f for "unpredictable"), unless (scrollable, without hold) cursors prove to be way more resource-consuming. "Way more" because it has to be weighed with the INSENSITIVE behavior of cursors that would allow me not to acquire a lock on the underlying table during the whole process. If it's proven that pagination in this context is not feasible using cursors, I'll fall back to this solution, unless you have something better to suggest!
the human side
Hello, and thanks in advance for the help! :)
You can share connections, but you cannot to share transactions. This request is impossible for this context. Good tools doesn't share connections, when any transaction is used in this connection.
I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?
Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.
To speed page generation for pages based on large postgres collections, we cache query results in memcache. However, for immutable collections that are very large, or that are rarely accessed, I'm wondering if saving server side cursors in postgres would be a viable alternate caching strategy.
The idea is that after having served a page in the middle of a collection "next" and "prev" links are much more likely to be used than a random query somewhere else in the collection. Could I have a cursor "WITH HOLD" in the neighborhood to avoid the (seemingly unavoidable) large startup costs of the query?
I wonder about resource consumption on the server. If the collection is immutable, saving the cursor shouldn't need very many resources, but I wonder how optimized postgres is in this respect. Any links to further documentation would be appreciated.
You're going to run into a lot of issues.
You 'd have to ensure the same user gets the same sql connection
You have to create a cleanup strategy
The cursors will be holding up vacuum operations.
You have to convince the connection pool to not clear the cursors
Probably other issues I have not mentioned.
In short: don't do it. How about precalculating the next/previous page in background, and storing it in memcached?
A good answer to this has previously been made Best way to fetch the continuous list with PostgreSQL in web
The questions are similar, essentially you store a list of PKs on the server with a pagination-token and an expiration.
I have a script that loops over a set of records, performs some statistical calculations and updates the records. It's a big cursor: get record, calculate statistics from embedded documents, set fields on record, save record. There's <5k records that are being looped and each one embeds 90 history entries.
Question: would I get substantially better performance if I did this via javascript? The alternative being writing it in Ruby. My opinion (unfounded) is that since this can be done entirely in the database I will get better performance if send a chunk of js to Mongodb instead of adding Ruby in to the mix.
Related: is map/reduce appropriate for finding the median and mode of a set of values for many records?
The answer is really "it depends" - if the fields you need to do the calculations are very large, doing the calculation on the server side with JS might be a lot faster simply by cutting down on network traffic.
But, executing JS on the server side also holds a write lock, so depending on how complicated the calculations are, it might be more efficient to just do your calculations on the client side and then simply update the document.
Your best bet is to do a simple benchmark with ruby vs. server side JS. If you need to serve other database traffic at the same time, this should also be considered as well, because your lock % could be different in the two scenarios (you can monitor this with mongostat).
Also, keep in mind that using db.eval will not work with sharding, so avoid it if you are using a sharded environment or plan to in the future.
I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.