RavenDB - querying issue - Stale results/indexes - nosql

While querying RavenDB I am noticing that it does not get the expected results immediately. May be it has to do with indexing, I dont know.
For example :
int ACount = session.Query<Patron>()
.Count();
int BCount = session.Query<Theaters>()
.Count();
int CCount = session.Query<Movies>()
.Where(x => x.Status == "Released")
.Count();
int DCount = session.Query<Promotions>()
.Count();
When I execute this then ACount and BCount get their values immediately on the first run). However CCount and DCount do not get their values until after three or four runs. They show 0 (zero) value in the first few runs.
Why does this happen for bottom two and not top two queries? If its because of stale results (or indexes) then how can I modify my queries to get the accurate results every time, when I run it first time. Thank you for help.

If you haven't defined an index for the Movies query, Raven will create a Dynamic Index. If you use the query repeatedly the index will be automatically persisted. Otherwise Raven will discard it and that may explain why you're getting 0 results during the first few runs.
You can also instruct Raven to wait for the indexing process to ensure that you'll always get the most accurate results (even though this might not be a good idea as it will slow your queries) by using the WaitForNonStaleResults instruction:
session.Query<Movies>()
.Customize(x => x.WaitForNonStaleResults())
.Where(x => x.Status == "Released")
.Count();

Needing to put WaitForNonStaleResults in each query feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.

You have the following options according to the official documentation (the most preferable first):
Setting cut-off point.
WaitForNonStaleResultsAsOfLastWrite(TimeSpan.FromSeconds(10))
or
WaitForNonStaleResultsAsOfNow()
This will make sure that you get the latest results up to that point in time (or till the last write). And you can put a cap on it (e.g. 10s), if you want to sacrifice freshness of the results to receiving the response faster.
Explicitly waiting for non-stale results
WaitForNonStaleResultsAsOfNow(TimeSpan.FromSeconds(10))
Again, specifying a time-out would be a good practice.
Setting querying conventions to apply the same rule to all requests
store.Conventions.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite.

Related

How to tell PostgreSQL via JDBC that I'm not going to fetch every row of the query result (i.e. how to stream the head of a result set efficiently)?

Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method

Efficient row reading with libpq (postgresql)

This is a scalability related question.
We want to read some rows from a table and, after processing some of them, stop the query. The stop criteria is data dependent (we do not know in advance how many or what rows are we interested in).
This is scalability sensitive when the number of rows of the table grows far beyond the number of rows we really are interested in.
If we use the standard PQExec, all rows are returned and we are forced to consume them (we have to call PQGetResult until it returns null). So this does not scale.
We are now trying "row by row" reading.
We first used PQsendQuery and PQsetSingleRowMode. However, we still have to call PQGetResult until it returns null.
Our last approach is PQsendQuery, PQsetSingleRowMode and when we are done we cancel the query as follows
void CloseRowByRow() {
PGcancel *c = PQgetCancel(conn);
char errbuf[256];
PQcancel(c, errbuf, 256);
PQfreeCancel(c);
while (res) {
PQclear(res);
res = PQgetResult(conn);
}
}
This produces some performance benefits but we are wondering if this is the best we can do.
So here comes the question: Is there any other way?
Use DECLARE and FETCH to define & read from a server-side cursor, this is exactly what they are meant for. You would use standard APIs, FETCH will just let you retrieve the results in batches of a controlled size. See the examples in the docs for more details.

Why does response time go up when the number of concurrent requests per second to my Asp.Net Core API goes up

I'm testing an endpoint under load. For 1 request per second, the average response time is around 200ms. The endpoint does a few DB lookups (all read) that are pretty fast and it's async throughout.
However when doing a few hundred requests per second (req/sec), the average response time goes up to over a second.
I've had a look at the best practices guide at:
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2
Some suggestions like "Avoid blocking calls" and "Minimize large object allocations" seem like they don't apply since I'm already using async throughout and my response size for a single request is less than 50 KB.
There are a couple though that seem like they might be useful, for example:
https://learn.microsoft.com/en-us/ef/core/what-is-new/ef-core-2.0#high-performance
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2#pool-http-connections-with-httpclientfactory
Questions:
Why would the average response time go up with an increased req/sec?
Are the suggestions above that I've marked as being 'might be useful' likely to help? I ask because while I would like to try out all, I have limited time available to me unfortunately, so I'd like to try out options that are most likely to help first.
Are there any other options worth considering?
I've had a look at these two existing threads, but neither answer my question:
Correlation between requests per second and response time?
ASP.NET Web API - Handle more requests per second
It will be hard to answer your specific issue without access to the code but the main things to consider is the size and complexity of the database queries being generated by EF. Using async/await will increase the responsiveness of your web server to start requests, but the request handling time under load will depend largely on the queries being run as the database becomes the contention point. You will want to ensure that all queries are as minimalist as possible. For example, there is a huge difference between the following 3 statements:
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.ToList()
.Where(x => x.SomeCriteriaMethod())
.ToList();
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.ToList();
var someData = context.SomeTable
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.Select(x => new SomeViewModel
{
SomeTableId = x.SomeTableId,
SomeField = x.SomeField,
SomeOtherField = x.SomeOtherTable.SomeOtherField
}).ToList();
Examples like the first above are extremely inefficient as they end up loading all data from the related tables from the database before filtering rows. Even though your web server may only pass back a few rows, it has requested everything from the database. These kinds of scenarios creep into apps when developers face scenarios where they want to filter on a value that EF cannot translate to SQL (such as a Function) so they solve it by putting a ToList call, or it can be introduced as a byproduct of poor separation such as a repository pattern that returns an IEnumerable.
The second example is a little better where they avoid using the read-all ToList() call, but the calls are still loading back entire rows for data that isn't necessary. This ties up resources on the database and web servers.
The third example demonstrates refining queries to just return the absolute minimum of data that the consumer needs. This can make better use of indexes and execution plans on the database server.
Other performance pitfalls that you can face under load are things like lazy loads. Databases will execute a finite number of concurrent requests, so if it turns out some queries are kicking off additional lazy load requests, when there is no load, these are executed immediately. Under load though, they are queued up along with other queries and lazy load requests which can tie down data pulls.
Ultimately you should run an SQL profiler against your database to capture the kinds and numbers of SQL queries being executed. When executing in a test environment, pay close attention to the Read count and CPU cost rather than the total execution time. As a general rule of thumb higher read and CPU cost queries will be far more susceptible to execution time blow-out under load. They require more resources to run, and "touch" more rows meaning more waiting for row/table locks.
Another thing to watch out for are "heavy" queries in very large data systems that will need to touch a lot of rows, such as reports, and in some cases, highly customizable search queries. If these should be needed, you should consider planning your database design to include a read-only replica to run reports or large search expressions against to avoid row lock scenarios in your primary database that can degrade responsiveness for the typical read and write queries.
Edit: Identifying lazy load queries.
These show up in a profiler where you query against a top level table, but then see a number of additional queries for related tables following it.
For example, say you have a table called Order, with a related table called Product, another called Customer and another called Address for a delivery address. To read all orders for a date range you'd expect to see a query such as:
SELECT [OrderId], [Quantity], [OrderDate] [ProductId], [CustomerId], [DeliveryAddressId] FROM [dbo].[Orders] WHERE [OrderDate] >= '2019-01-01' AND [OrderDate] < '2020-01-01'
Where you just wanted to load Orders and return them.
When the serializer iterates over the fields, it finds a Product, Customer, and Address referenced, and by attempting to read those fields, will trip lazy loading resulting in:
SELECT [CustomerId], [Name] FROM [dbo].[Customers] WHERE [CustomerId] = 22
SELECT [ProductId], [Name], [Price] FROM [dbo].[Products] WHERE [ProductId] = 1023
SELECT [AddressId], [StreetNumber], [City], [State], [PostCode] FROM [dbo].[Addresses] WHERE [AddressId] = 1211
If your original query returned 100 Orders, you would see potentially 100x the above set of queries, one set for each order as a lazy load hit on 1 order row would attempt to look up a related customer by customer ID, a related product by Product ID, and a related Address by Delivery Address ID. This can, and does get costly. It may not be visible when run on a test environment, but that adds up to a lot of potential queries.
If eager loaded using .Include() for the related entities, EF will compose JOIN statements to get the related rows all in one hit which is considerably faster than fetching each individual related entity. Still, that can result in pulling a lot of data you don't need. The best way to avoid this extra cost is to leverage projection through Select to retrieve just the columns you need.

Find First and First Difference in Progress 4GL

I'm not clear about below queries and curious to know what is the different between them even though both retrieves same results. (Database used sports2000).
FOR EACH Customer WHERE State = "NH",
FIRST Order OF Customer:
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
FOR EACH Customer WHERE State = "NH":
FIND FIRST Order OF Customer NO-ERROR.
IF AVAILABLE Order THEN
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
Please explain me
Regards
Suga
As AquaAlex says your first snippet is a join (the "," part of the syntax makes it a join) and has all of the pros and cons he mentions. There is, however, a significant additional "con" -- the join is being made with FIRST and FOR ... FIRST should never be used.
FOR LAST - Query, giving wrong result
It will eventually bite you in the butt.
FIND FIRST is not much better.
The fundamental problem with both statements is that they imply that there is an order which your desired record is the FIRST instance of. But no part of the statement specifies that order. So in the event that there is more than one record that satisfies the query you have no idea which record you will actually get. That might be ok if the only reason that you are doing this is to probe to see if there is one or more records and you have no intention of actually using the record buffer. But if that is the case then CAN-FIND() would be a better statement to be using.
There is a myth that FIND FIRST is supposedly faster. If you believe this, or know someone who does, I urge you to test it. It is not true. It is true that in the case where FIND returns a large set of records adding FIRST is faster -- but that is not apples to apples. That is throwing away the bushel after randomly grabbing an apple. And if you code like that your apple now has magical properties which will lead to impossible to cure bugs.
OF is also problematic. OF implies a WHERE clause based on the compiler guessing that fields with the same name in both tables and which are part of a unique index can be used to join the tables. That may seem reasonable, and perhaps it is, but it obscures the code and makes the maintenance programmer's job much more difficult. It makes a good demo but should never be used in real life.
Your first statement is a join statement, which means less network traffic. And you will only receive records where both the customer and order record exist so do not need to do any further checks. (MORE EFFICIENT)
The second statement will retrieve each customer and then for each customer found it will do a find on order. Because there may not be an order you need to do an additional statement (If Available) as well. This is a less efficient way to retrieve the records and will result in much more unwanted network traffic and more statements being executed.

How to get total number of potential results in Lucene

I'm using lucene on a site of mine and I want to show the total result count from a query, for example:
Showing results x to y of z
But I can't find any method which will return me the total number of potential results. I can only seem to find methods which you have to specify the number of results you want, and since I only want 10 per page it seems logical to pass in 10 as the number of results.
Or am I doing this wrong, should I be passing in say 1000 and then just taking the 10 in the range that I require?
BTW, since I know you personally I should point out for others I already knew you were referring to Lucene.net and not Lucene :) although the API would be the same
In versions prior to 2.9.x you could call IndexSearcher.Search(Query query, Filter filter) which returned a Hits object, one of which properties [methods, technically, due to the Java port] was Length()
This is now marked Obsolete since it will be removed in 3.0, the only results of Search return TopDocs or TopFieldDocs objects.
Your alternatives are
a) Use IndexServer.Search(Query query, int count) which will return a TopDocs object, so TopDocs.TotalHits will show you the total possible hits but at the expense of actually creating <count> results
b) A faster way is to implement your own Collector object (inherit from Lucene.Net.Search.Collector) and call IndexSearcher.Search(Query query, Collector collector). The search method will call Collect(int docId) on your collector on every match, so if internally you keep track of that you have a way of garnering all the results.
It should be noted Lucene is not a total-resultset query environment and is designed to stream the most relevant results to you (the developer) as fast as possible. Any method which gives you a "total results" count is just a wrapper enumerating over all the matches (as with the Collector method).
The trick is to keep this enumeration as fast as possible. The most expensive part is deserialisation of Documents from the index, populating each field etc. At least with the newer API design, requiring you to write your own Collector, the principle is made clear by telling the developer to avoid deserialising each result from the index since only matching document Ids and a score are provided by default.
The top docs collector does this for you, for example
TopDocs topDocs = searcher.search(qry, 10);
int totalHits = topDocs.totalHits ;
The above query will count all hits, but return only 10.