How to get total number of potential results in Lucene - lucene.net

I'm using lucene on a site of mine and I want to show the total result count from a query, for example:
Showing results x to y of z
But I can't find any method which will return me the total number of potential results. I can only seem to find methods which you have to specify the number of results you want, and since I only want 10 per page it seems logical to pass in 10 as the number of results.
Or am I doing this wrong, should I be passing in say 1000 and then just taking the 10 in the range that I require?

BTW, since I know you personally I should point out for others I already knew you were referring to Lucene.net and not Lucene :) although the API would be the same
In versions prior to 2.9.x you could call IndexSearcher.Search(Query query, Filter filter) which returned a Hits object, one of which properties [methods, technically, due to the Java port] was Length()
This is now marked Obsolete since it will be removed in 3.0, the only results of Search return TopDocs or TopFieldDocs objects.
Your alternatives are
a) Use IndexServer.Search(Query query, int count) which will return a TopDocs object, so TopDocs.TotalHits will show you the total possible hits but at the expense of actually creating <count> results
b) A faster way is to implement your own Collector object (inherit from Lucene.Net.Search.Collector) and call IndexSearcher.Search(Query query, Collector collector). The search method will call Collect(int docId) on your collector on every match, so if internally you keep track of that you have a way of garnering all the results.
It should be noted Lucene is not a total-resultset query environment and is designed to stream the most relevant results to you (the developer) as fast as possible. Any method which gives you a "total results" count is just a wrapper enumerating over all the matches (as with the Collector method).
The trick is to keep this enumeration as fast as possible. The most expensive part is deserialisation of Documents from the index, populating each field etc. At least with the newer API design, requiring you to write your own Collector, the principle is made clear by telling the developer to avoid deserialising each result from the index since only matching document Ids and a score are provided by default.

The top docs collector does this for you, for example
TopDocs topDocs = searcher.search(qry, 10);
int totalHits = topDocs.totalHits ;
The above query will count all hits, but return only 10.

Related

How does the page size work in nested OData query?

Let's say we want to execute following odata query:
api/accounts?$expand=contacts
Suppose that we have 3 accounts (for ex. a1, a2, a3) and 3 contacts per account.
So if we have define "odata.maxpagesize=2" and execute above query, what will it result according to OData standards.
Option-1
a1
- c11
- c12
- (odata.nextlink for c13)
a2
- c21
- c22
- (odata.nextlink for c23)
(odata.nextlink for a3)
Option-2
a1
- c11
- c12
- (odata.nextlink for c13)
(odata.nextlink for a2, a3)
For pagesize=2 it might look easy, but assume that pagesize=5000, then will it return:
Option-1
5000 accounts and nested 5000 contacts for each account. So, it will have 25,000,000 records from that viewpoint.
Option-2
1 account and nested 5000 contacts for that account. So, it will have 5000 records from that viewpoint.
-------- UPDATE-2 -------------------------
We were just slightly hesitant for Option-1 as user can query multiple expands and it can lead to too large result size.
For ex, if user queries:
accounts?$expand=contacts($expand=callHistory)
So, considering option-1 with maxPageSize of 100, if we return records till maxPage Size at all nested levels then it will return
100(accounts) * 100(contacts per account) * 100(call logs per contact) = 1 million entities.
And number of records will grow exponentially if user uses $expand at further nested levels. Please let me know if my analysis is correct.
On the other hand, Option-2 can be close to what you suggest. Here we'll count even nested results and check if entity count is exceeding page-size. So, after that we can return nextlink wherever applicable.
It would be great if you can re-validate our approach. :)
Either option is technically compliant with the spec. MaxPageSize is a preference (i.e., a hint to the service) and the service is allowed to return more/less, as long as it correctly returns nextLinks for incomplete collections.
So, for example, a service could also look at maxpageSize of 5000 and decide to return the first 1000 parents with up to 5 nested results each. Or, it could ignore maxpagesize entirely and return 200 parents with only next links for nested resources. Or...
I think the best consumer experience is something closer to option 1, where the service returns some a's (less than or equal to maxpagesize, perhaps based on number/level of nested results), each with some b's (again, perhaps based on the number/level of nested results, up to maxpagesize.)
Not sure if that helps?
---Response to Update 2---
Yes; your analysis is correct. And, it's more complicated than that -- the user could $expand multiple properties at each level, so you would either end up having pagesize by the number of $expanded properties at each level or you would have to decide how you divided the pagesize requested results across all of the $expanded collections at each level.
As I say, option-2 is valid, and probably easy to implement (just read up to the first pagesize records and then stop), it just might not be as friendly to a consumer that is trying to get a sense of the data (i.e., in a visual display) and then drill in where appropriate.
It kinda depends on the consumption scenario. Option 2 optimizes for only doing paging if there are more total records requested than maxpagesize, but (in the extreme) the first page of results is not very representative.
On the other hand, if the goal is for someone to view/browse through the data, drilling in where appropriate, then limiting nested collections based on some static value (say, the first 5 records of each nested collection) and then using maxpagesize to limit either the top level records or the total count of records would probably be more user-friendly. The only disadvantage would be that you might introduce paging for nested collections even when the full result was less than maxpagesize).
You also might want to consider which is more efficient to implement. For example, if you are building a query to get data from an underlying store, it may be more efficient to request a fixed maximum number of records for each nested collection, rather than requesting all of the data for nested collections and then throwing the rest away once you've read as much as you need.
Again, keep in mind that the calculation doesn't have to be exact. maxpagesize is just a hint to the service. The service is not required to return exactly that count, so don't get too bogged down on trying to exactly calculate how many records will be returned.
Personal preference: If I had potentially large nested results I would probably lean towards limiting them based on some static value. It makes the results more predictable and uniform, and provides a better representation of the data.

getting a doc read count of a sub collection in firebase?

I am trying to change the structure of part of my Db, it a part that is getting called pretty frequently, thus I am afraid it is going to cost more than one read per call due to my new nesting nature, I want to nest it under a sub doc due so I can protect the whole sub-col by a premium Until value, will this nesting increase the price of since I am querying deep to 1 level plus checking the premium Until value in rules ? would my single call now considered a double call one for the week and one for the weeks ??
// Call
let collectionRef = collection(db, 'Weeks', id, 'week');
// get a specific doc in that sub-collection.
// Db Model
The nesting of a document doesn't affect how much it costs to read it. One document read is always going to cost the same, no matter what the collection it comes from.

Entity Framework 5 Get All method

In our EF implementation for Brand New Project, we have GetAll method in Repository. But, as our Application Grows, we are going to have let's say 5000 Products, Getting All the way it is, would be Performance Problem. Wouldn't it ?
if So, what is good implementation to tackle this future problem?
Any help will be hightly appreciated.
It could become a performance problem if get all is enumerating on the collection, thus causing the entire data set to be returned in an IEnumerable. This all depends on the number of joins, the size of the data set, if you have lazy loading enabled, how SQL Server performs, just to name a few.
Some people will say this is not desired, you could have GetAll() return an IQueryable which would defer the query until something caused the collection to be filled by going to SQL. That way you could filter the results with other Where(), Take(), Skip(), etc statements to allow for paging without having to retrieve all 5000+ products from the database.
It depends on how your repository class is set up. If you're performing the query immediately, i.e. if your GetAll() method returns something like IEnumerable<T> or IList<T>, then yes, that could easily be a performance problem, and you should generally avoid that sort of thing unless you really want to load all records at once.
On the other hand, if your GetAll() method returns an IQueryable<T>, then there may not be a problem at all, depending on whether you trust the people writing queries. Returning an IQueryable<T> would allow callers to further refine the search criteria before the SQL code is actually generated. Performance-wise, it would only be a problem if developers using your code didn't apply any filters before executing the query. If you trust them enough to give them enough rope to hang themselves (and potentially take your database performance down with them), then just returning IQueryable<T> might be good enough.
If you don't trust them, then, as others have pointed out, you could limit the number of records returned by your query by using the Skip() and Take() extension methods to implement simple pagination, but note that it's possible for records to slip through the cracks if people make changes to the database before you move on to the next page. Making pagination work seamlessly with an ever-changing database is much harder than a lot of people think.
Another approach would be to replace your GetAll() method with one that requires the caller to apply a filter before returning results:
public IQueryable<T> GetMatching<T>(Expression<Func<T, bool>> filter)
{
// Replace myQuery with Context.Set<T>() or whatever you're using for data access
return myQuery.Where(filter);
}
and then use it like var results = GetMatching(x => x.Name == "foo");, or whatever you want to do. Note that this could be easily bypassed by calling GetMatching(x => true), but at least it makes the intention clear. You could also combine this with the first method to put a firm cap on the number of records returned.
My personal feeling, though, is that all of these ways of limiting queries are just insurance against bad developers getting their hands on your application, and if you have bad developers working on your project, they'll find a way to cause problems no matter what you try to do. So my vote is to just return an IQueryable<T> and trust that it will be used responsibly. If people abuse it, take away the GetAll() method and give them training-wheels methods like GetRecentPosts(int count) or GetPostsWithTag(string tag, int count) or something like that, where the query logic is out of their hands.
One way to improve this is by using pagination
context.Products.Skip(n).Take(m);
What your referring to is known as paging, and it's pretty trivial to do using LINQ via the Skip/Take methods.
EF queries are lazily loaded which means they won't actually hit the database until they are evaluated so the following would only pull the first 10 rows after skipping the first 10
context.Table.Skip(10).Take(10);

Fql Limit Returns Less Results Than Expected

I am trying to get data from facebook by fql query.
One of the things I want to do is to get a litmited records each call. I am trying to do this by the 'LIMIT [start], [end]' command, it suppose to return to me the records between those numbers. Instead of getting [end]-[start] records which is the total number of records that are suppose to be returned. I get random number of records, I have checked and I can be sure that I am not trying to get more records that there is.
LIMIT example:
http://graph.facebook.com/fql?q=SELECT actor_id, message,description FROM stream WHERE source_id =5878435678 Limit 2,10
This suppose to return 7 records (the count starts from 0) and I get only 3 records.
The funny thig is when I wrote instead of 10 50 I got 26 records.
Can someone help me to find a way how to get the exact ammount of records I asked for.
Tanks ahead!!
This blog post by Facebook engineers explains this phenomenon.
http://developers.facebook.com/blog/post/478/
Here's the part that addresses your question...
You might notice that the number of results returned is not always
equal to the “limit” specified. This is expected behavior. Query
parameters are applied on our end before checking to see if the
results returned are visible to the viewer. Because of this, it is
possible that you might get fewer results than expected.
The below is the best part of the blog entry...
This also means when you are manually constructing your own queries,
you should be aware that with some tables and connections if you are
specifying an “offset” parameter, the Nth result you are pointing to
may not get returned in your results (like the 3rd result in step 2 of
the image above). This can make paging difficult and confusing.
Lol, you're killing me Facebook!!! Why not make it straight forward and consistent, rather than "difficult and confusing."?!?!?

RavenDB - querying issue - Stale results/indexes

While querying RavenDB I am noticing that it does not get the expected results immediately. May be it has to do with indexing, I dont know.
For example :
int ACount = session.Query<Patron>()
.Count();
int BCount = session.Query<Theaters>()
.Count();
int CCount = session.Query<Movies>()
.Where(x => x.Status == "Released")
.Count();
int DCount = session.Query<Promotions>()
.Count();
When I execute this then ACount and BCount get their values immediately on the first run). However CCount and DCount do not get their values until after three or four runs. They show 0 (zero) value in the first few runs.
Why does this happen for bottom two and not top two queries? If its because of stale results (or indexes) then how can I modify my queries to get the accurate results every time, when I run it first time. Thank you for help.
If you haven't defined an index for the Movies query, Raven will create a Dynamic Index. If you use the query repeatedly the index will be automatically persisted. Otherwise Raven will discard it and that may explain why you're getting 0 results during the first few runs.
You can also instruct Raven to wait for the indexing process to ensure that you'll always get the most accurate results (even though this might not be a good idea as it will slow your queries) by using the WaitForNonStaleResults instruction:
session.Query<Movies>()
.Customize(x => x.WaitForNonStaleResults())
.Where(x => x.Status == "Released")
.Count();
Needing to put WaitForNonStaleResults in each query feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.
You have the following options according to the official documentation (the most preferable first):
Setting cut-off point.
WaitForNonStaleResultsAsOfLastWrite(TimeSpan.FromSeconds(10))
or
WaitForNonStaleResultsAsOfNow()
This will make sure that you get the latest results up to that point in time (or till the last write). And you can put a cap on it (e.g. 10s), if you want to sacrifice freshness of the results to receiving the response faster.
Explicitly waiting for non-stale results
WaitForNonStaleResultsAsOfNow(TimeSpan.FromSeconds(10))
Again, specifying a time-out would be a good practice.
Setting querying conventions to apply the same rule to all requests
store.Conventions.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite.