Efficient row reading with libpq (postgresql) - postgresql

This is a scalability related question.
We want to read some rows from a table and, after processing some of them, stop the query. The stop criteria is data dependent (we do not know in advance how many or what rows are we interested in).
This is scalability sensitive when the number of rows of the table grows far beyond the number of rows we really are interested in.
If we use the standard PQExec, all rows are returned and we are forced to consume them (we have to call PQGetResult until it returns null). So this does not scale.
We are now trying "row by row" reading.
We first used PQsendQuery and PQsetSingleRowMode. However, we still have to call PQGetResult until it returns null.
Our last approach is PQsendQuery, PQsetSingleRowMode and when we are done we cancel the query as follows
void CloseRowByRow() {
PGcancel *c = PQgetCancel(conn);
char errbuf[256];
PQcancel(c, errbuf, 256);
PQfreeCancel(c);
while (res) {
PQclear(res);
res = PQgetResult(conn);
}
}
This produces some performance benefits but we are wondering if this is the best we can do.
So here comes the question: Is there any other way?

Use DECLARE and FETCH to define & read from a server-side cursor, this is exactly what they are meant for. You would use standard APIs, FETCH will just let you retrieve the results in batches of a controlled size. See the examples in the docs for more details.

Related

How to tell PostgreSQL via JDBC that I'm not going to fetch every row of the query result (i.e. how to stream the head of a result set efficiently)?

Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method

What is a more efficient way to compare and filter Sequences (multiple calls to a single call)

I have two sequences of Data objects and I want to establish what has been added, removed and is common between DataSeq1 and DataSeq2 based upon the id in the Data objects within each sequence.
I can achieve this using the following:
val dataRemoved = DataSeq1.filterNot(c => DataSeq2.exists(_.id == c.id))
val dataAdded = DataSeq2.filterNot(c => DataSeq1.exists(_.id == c.id))
val dataCommon = DataSeq1.filter(c => DataSeq2.exists(_.id == c.id))
//Based upon what is common I want to filter DataSeq2
var incomingDataToCompare = List[Data]()
dataCommon.foreach(data => {incomingDataToCompare = DataSeq2.find(_.id == data.id).get :: incomingDataToCompare})
However as the Data object gets larger calling filters three different times may have a performance impact. Is there a more efficient way to achieve the same output (i.e. what has been removed, added and in common) in a single call?
The short answer is, quite possibly not, unless you are going to add some additional features into the system. I would guess that you need to keep a log of operations in order to improve the time complexity. Even better if that log will be indexed both by the order in which the operation has occurred and by the id of the item that was added/removed. I will leave it to you to discover how such a log can be used.
Also you might be able to improve time complexity if you are going to keep the original sequences sorted by id (or a separate seq of sorted ids, you should probably be able to do that incurring a logN penalty per a single operation). This seq should be of type vector or something, to allow fast random access. Then you can probably iterate with two pointers. But this algorithm's efficiency will depend greatly on whether the unique ids are bounded and also whether this "establish added/removed/same" operation will be called much more frequently compared to the operations that mutate the sequences.

SOQL Query in loop

The following APEX code:
for (List<MyObject__c> list : [select id,some_fields from MyObject__c]) {
for (MyObject__c item : list) {
//do stuff
}
}
Is a standard pattern for processing large quantitues of data - it will automatically break up a large result set into chucks of 200 records, so each List will contain 200 objects.
Is there any difference between the above approach, and below:
for (List<MyObject__c> list : Database.query('select...')) {
for (MyObject__c item : list) {
//do stuff
}
}
used when the SOQL needs to by dynamic?
I have been told that the 2nd approach is losing data, but I'm not sure of the details, and the challenge is to work out why, and to prevent data loss.
Where did you read that using Dynamic SOQL will result in data loss in this scenario? This is untrue AFAIK. The static Database.query() method behaves exactly the same as static SOQL, except for a few small differences.
To answer your first question, the main difference between your approaches is that the first is static (the query is fixed), and the second allows you to dynamically define the query, as the name suggests. Using your second approach with Dynamic SOQL will still chunk the result records for you.
This difference does lead to some other subtle considerations - Dynamic SOQL doesn't support variable bindings other than basic operations. See Daniel Ballinger's idea for this feature here. I also need to add the necessary security caveat - if you're using Dynamic SOQL, do not construct the query based on input, as that would render your application vulnerable to SOQL injection.
Also, I'm just curious, what is your use case/how large of quantities of data are you talking about here? It might be better to use batch Apex, depending on your requirement.

Find First and First Difference in Progress 4GL

I'm not clear about below queries and curious to know what is the different between them even though both retrieves same results. (Database used sports2000).
FOR EACH Customer WHERE State = "NH",
FIRST Order OF Customer:
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
FOR EACH Customer WHERE State = "NH":
FIND FIRST Order OF Customer NO-ERROR.
IF AVAILABLE Order THEN
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
Please explain me
Regards
Suga
As AquaAlex says your first snippet is a join (the "," part of the syntax makes it a join) and has all of the pros and cons he mentions. There is, however, a significant additional "con" -- the join is being made with FIRST and FOR ... FIRST should never be used.
FOR LAST - Query, giving wrong result
It will eventually bite you in the butt.
FIND FIRST is not much better.
The fundamental problem with both statements is that they imply that there is an order which your desired record is the FIRST instance of. But no part of the statement specifies that order. So in the event that there is more than one record that satisfies the query you have no idea which record you will actually get. That might be ok if the only reason that you are doing this is to probe to see if there is one or more records and you have no intention of actually using the record buffer. But if that is the case then CAN-FIND() would be a better statement to be using.
There is a myth that FIND FIRST is supposedly faster. If you believe this, or know someone who does, I urge you to test it. It is not true. It is true that in the case where FIND returns a large set of records adding FIRST is faster -- but that is not apples to apples. That is throwing away the bushel after randomly grabbing an apple. And if you code like that your apple now has magical properties which will lead to impossible to cure bugs.
OF is also problematic. OF implies a WHERE clause based on the compiler guessing that fields with the same name in both tables and which are part of a unique index can be used to join the tables. That may seem reasonable, and perhaps it is, but it obscures the code and makes the maintenance programmer's job much more difficult. It makes a good demo but should never be used in real life.
Your first statement is a join statement, which means less network traffic. And you will only receive records where both the customer and order record exist so do not need to do any further checks. (MORE EFFICIENT)
The second statement will retrieve each customer and then for each customer found it will do a find on order. Because there may not be an order you need to do an additional statement (If Available) as well. This is a less efficient way to retrieve the records and will result in much more unwanted network traffic and more statements being executed.

RavenDB - querying issue - Stale results/indexes

While querying RavenDB I am noticing that it does not get the expected results immediately. May be it has to do with indexing, I dont know.
For example :
int ACount = session.Query<Patron>()
.Count();
int BCount = session.Query<Theaters>()
.Count();
int CCount = session.Query<Movies>()
.Where(x => x.Status == "Released")
.Count();
int DCount = session.Query<Promotions>()
.Count();
When I execute this then ACount and BCount get their values immediately on the first run). However CCount and DCount do not get their values until after three or four runs. They show 0 (zero) value in the first few runs.
Why does this happen for bottom two and not top two queries? If its because of stale results (or indexes) then how can I modify my queries to get the accurate results every time, when I run it first time. Thank you for help.
If you haven't defined an index for the Movies query, Raven will create a Dynamic Index. If you use the query repeatedly the index will be automatically persisted. Otherwise Raven will discard it and that may explain why you're getting 0 results during the first few runs.
You can also instruct Raven to wait for the indexing process to ensure that you'll always get the most accurate results (even though this might not be a good idea as it will slow your queries) by using the WaitForNonStaleResults instruction:
session.Query<Movies>()
.Customize(x => x.WaitForNonStaleResults())
.Where(x => x.Status == "Released")
.Count();
Needing to put WaitForNonStaleResults in each query feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.
You have the following options according to the official documentation (the most preferable first):
Setting cut-off point.
WaitForNonStaleResultsAsOfLastWrite(TimeSpan.FromSeconds(10))
or
WaitForNonStaleResultsAsOfNow()
This will make sure that you get the latest results up to that point in time (or till the last write). And you can put a cap on it (e.g. 10s), if you want to sacrifice freshness of the results to receiving the response faster.
Explicitly waiting for non-stale results
WaitForNonStaleResultsAsOfNow(TimeSpan.FromSeconds(10))
Again, specifying a time-out would be a good practice.
Setting querying conventions to apply the same rule to all requests
store.Conventions.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite.