How does the page size work in nested OData query?

How does the page size work in nested OData query? - rest

Let's say we want to execute following odata query:
api/accounts?$expand=contacts
Suppose that we have 3 accounts (for ex. a1, a2, a3) and 3 contacts per account.
So if we have define "odata.maxpagesize=2" and execute above query, what will it result according to OData standards.
Option-1
a1
- c11
- c12
- (odata.nextlink for c13)
a2
- c21
- c22
- (odata.nextlink for c23)
(odata.nextlink for a3)
Option-2
a1
- c11
- c12
- (odata.nextlink for c13)
(odata.nextlink for a2, a3)
For pagesize=2 it might look easy, but assume that pagesize=5000, then will it return:
Option-1
5000 accounts and nested 5000 contacts for each account. So, it will have 25,000,000 records from that viewpoint.
Option-2
1 account and nested 5000 contacts for that account. So, it will have 5000 records from that viewpoint.
-------- UPDATE-2 -------------------------
We were just slightly hesitant for Option-1 as user can query multiple expands and it can lead to too large result size.
For ex, if user queries:
accounts?$expand=contacts($expand=callHistory)
So, considering option-1 with maxPageSize of 100, if we return records till maxPage Size at all nested levels then it will return
100(accounts) * 100(contacts per account) * 100(call logs per contact) = 1 million entities.
And number of records will grow exponentially if user uses $expand at further nested levels. Please let me know if my analysis is correct.
On the other hand, Option-2 can be close to what you suggest. Here we'll count even nested results and check if entity count is exceeding page-size. So, after that we can return nextlink wherever applicable.
It would be great if you can re-validate our approach. :)

Either option is technically compliant with the spec. MaxPageSize is a preference (i.e., a hint to the service) and the service is allowed to return more/less, as long as it correctly returns nextLinks for incomplete collections.
So, for example, a service could also look at maxpageSize of 5000 and decide to return the first 1000 parents with up to 5 nested results each. Or, it could ignore maxpagesize entirely and return 200 parents with only next links for nested resources. Or...
I think the best consumer experience is something closer to option 1, where the service returns some a's (less than or equal to maxpagesize, perhaps based on number/level of nested results), each with some b's (again, perhaps based on the number/level of nested results, up to maxpagesize.)
Not sure if that helps?
---Response to Update 2---
Yes; your analysis is correct. And, it's more complicated than that -- the user could $expand multiple properties at each level, so you would either end up having pagesize by the number of $expanded properties at each level or you would have to decide how you divided the pagesize requested results across all of the $expanded collections at each level.
As I say, option-2 is valid, and probably easy to implement (just read up to the first pagesize records and then stop), it just might not be as friendly to a consumer that is trying to get a sense of the data (i.e., in a visual display) and then drill in where appropriate.
It kinda depends on the consumption scenario. Option 2 optimizes for only doing paging if there are more total records requested than maxpagesize, but (in the extreme) the first page of results is not very representative.
On the other hand, if the goal is for someone to view/browse through the data, drilling in where appropriate, then limiting nested collections based on some static value (say, the first 5 records of each nested collection) and then using maxpagesize to limit either the top level records or the total count of records would probably be more user-friendly. The only disadvantage would be that you might introduce paging for nested collections even when the full result was less than maxpagesize).
You also might want to consider which is more efficient to implement. For example, if you are building a query to get data from an underlying store, it may be more efficient to request a fixed maximum number of records for each nested collection, rather than requesting all of the data for nested collections and then throwing the rest away once you've read as much as you need.
Again, keep in mind that the calculation doesn't have to be exact. maxpagesize is just a hint to the service. The service is not required to return exactly that count, so don't get too bogged down on trying to exactly calculate how many records will be returned.
Personal preference: If I had potentially large nested results I would probably lean towards limiting them based on some static value. It makes the results more predictable and uniform, and provides a better representation of the data.

Related

Why does response time go up when the number of concurrent requests per second to my Asp.Net Core API goes up

I'm testing an endpoint under load. For 1 request per second, the average response time is around 200ms. The endpoint does a few DB lookups (all read) that are pretty fast and it's async throughout.
However when doing a few hundred requests per second (req/sec), the average response time goes up to over a second.
I've had a look at the best practices guide at:
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2
Some suggestions like "Avoid blocking calls" and "Minimize large object allocations" seem like they don't apply since I'm already using async throughout and my response size for a single request is less than 50 KB.
There are a couple though that seem like they might be useful, for example:
https://learn.microsoft.com/en-us/ef/core/what-is-new/ef-core-2.0#high-performance
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2#pool-http-connections-with-httpclientfactory
Questions:
Why would the average response time go up with an increased req/sec?
Are the suggestions above that I've marked as being 'might be useful' likely to help? I ask because while I would like to try out all, I have limited time available to me unfortunately, so I'd like to try out options that are most likely to help first.
Are there any other options worth considering?
I've had a look at these two existing threads, but neither answer my question:
Correlation between requests per second and response time?
ASP.NET Web API - Handle more requests per second

It will be hard to answer your specific issue without access to the code but the main things to consider is the size and complexity of the database queries being generated by EF. Using async/await will increase the responsiveness of your web server to start requests, but the request handling time under load will depend largely on the queries being run as the database becomes the contention point. You will want to ensure that all queries are as minimalist as possible. For example, there is a huge difference between the following 3 statements:
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.ToList()
.Where(x => x.SomeCriteriaMethod())
.ToList();
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.ToList();
var someData = context.SomeTable
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.Select(x => new SomeViewModel
{
SomeTableId = x.SomeTableId,
SomeField = x.SomeField,
SomeOtherField = x.SomeOtherTable.SomeOtherField
}).ToList();
Examples like the first above are extremely inefficient as they end up loading all data from the related tables from the database before filtering rows. Even though your web server may only pass back a few rows, it has requested everything from the database. These kinds of scenarios creep into apps when developers face scenarios where they want to filter on a value that EF cannot translate to SQL (such as a Function) so they solve it by putting a ToList call, or it can be introduced as a byproduct of poor separation such as a repository pattern that returns an IEnumerable.
The second example is a little better where they avoid using the read-all ToList() call, but the calls are still loading back entire rows for data that isn't necessary. This ties up resources on the database and web servers.
The third example demonstrates refining queries to just return the absolute minimum of data that the consumer needs. This can make better use of indexes and execution plans on the database server.
Other performance pitfalls that you can face under load are things like lazy loads. Databases will execute a finite number of concurrent requests, so if it turns out some queries are kicking off additional lazy load requests, when there is no load, these are executed immediately. Under load though, they are queued up along with other queries and lazy load requests which can tie down data pulls.
Ultimately you should run an SQL profiler against your database to capture the kinds and numbers of SQL queries being executed. When executing in a test environment, pay close attention to the Read count and CPU cost rather than the total execution time. As a general rule of thumb higher read and CPU cost queries will be far more susceptible to execution time blow-out under load. They require more resources to run, and "touch" more rows meaning more waiting for row/table locks.
Another thing to watch out for are "heavy" queries in very large data systems that will need to touch a lot of rows, such as reports, and in some cases, highly customizable search queries. If these should be needed, you should consider planning your database design to include a read-only replica to run reports or large search expressions against to avoid row lock scenarios in your primary database that can degrade responsiveness for the typical read and write queries.
Edit: Identifying lazy load queries.
These show up in a profiler where you query against a top level table, but then see a number of additional queries for related tables following it.
For example, say you have a table called Order, with a related table called Product, another called Customer and another called Address for a delivery address. To read all orders for a date range you'd expect to see a query such as:
SELECT [OrderId], [Quantity], [OrderDate] [ProductId], [CustomerId], [DeliveryAddressId] FROM [dbo].[Orders] WHERE [OrderDate] >= '2019-01-01' AND [OrderDate] < '2020-01-01'
Where you just wanted to load Orders and return them.
When the serializer iterates over the fields, it finds a Product, Customer, and Address referenced, and by attempting to read those fields, will trip lazy loading resulting in:
SELECT [CustomerId], [Name] FROM [dbo].[Customers] WHERE [CustomerId] = 22
SELECT [ProductId], [Name], [Price] FROM [dbo].[Products] WHERE [ProductId] = 1023
SELECT [AddressId], [StreetNumber], [City], [State], [PostCode] FROM [dbo].[Addresses] WHERE [AddressId] = 1211
If your original query returned 100 Orders, you would see potentially 100x the above set of queries, one set for each order as a lazy load hit on 1 order row would attempt to look up a related customer by customer ID, a related product by Product ID, and a related Address by Delivery Address ID. This can, and does get costly. It may not be visible when run on a test environment, but that adds up to a lot of potential queries.
If eager loaded using .Include() for the related entities, EF will compose JOIN statements to get the related rows all in one hit which is considerably faster than fetching each individual related entity. Still, that can result in pulling a lot of data you don't need. The best way to avoid this extra cost is to leverage projection through Select to retrieve just the columns you need.

REST - should endpoints include summary data?

Simple model:
Products, which have weights (can be mixed - ounces, grams, kilograms etc)
Cagtegories, which have many products in them
REST endpoints:
/products - get all products and post to add a new one
/products/id - delete,update,patch a single product
/categories/id - delete,update,patch a single category
/categories - get all categories and post to add a new one
The question is that the frontend client wants to display a chart of the total weight of products in ALL categories. Imagine a bar chart or a Pie chart showing all categories, and the total weight of products in each.
Obviously the product model has a 'weight_value' and a 'weight_unit' so you know the weight of a product and its measure (oz, grams etc).
I see 2 ways of solving this:
In the category model, add a computed field that totals the weight of all the products in a category. The total is calculated on the fly for that category (not stored) and so is always up to date. Note the client always needs all categories (eg. to populate drop downs when creating a product, you have to choose the category it belongs to) so it now will automatically always have the total weight of each category. So constructing the chart is easy - no need to get the data for the chart from the server, it's already on the client. But first time loading that data might be slow. Only when a product is added will the totals be stale (insignificant change though to the overall number and anyway stale totals are fine).
Create a separate endpoint, say categories/totals, that returns for all categories: a category id, name and total weight. This endpoint would loop through all the categories, calculating the weight for each and returning a category dataset with weights for each.
The problem I see with (1) is that it is not that performant. I know it's not very scalable, especially when a category ends up with a lot of products (millions!) but this is a hobby project so not worried about that.
The advantage of (1) is that you always have the data on the client and don't have to make a separate request to get the data for the chart.
However, the advantage of (2) is that every request to category/id does not incur a potentially expensive totalling operation (because now it is inside its own dedicated endpoint). Of course that endpoint will have to do quite a complex database query to calculate the weights of all products for all categories (although handing that off to the database should be the way as that is what db's are good for!)
I'm really stumped on which is the better way to do this. Which way is more true to the RESTful way (HATEOS basically)?

I would go with 2. favouring scalability and best practices. It does not really make any sense to perform the weight calculations every time the category is requested and even though you may not anticipate this to be a problem since it is a 'hobby' project, it's always best to avoid shortcuts where the benefits are minimal ( or so experience has taught me !).
Choosing 1, the only advantage would be that you have to set up one less endpoint and one extra call to get the weight total - the extra call shouldn't add too much overhead, and setting up the extra endpoint shouldn't take up too much effort.
Regardless of whether you choose 1 or 2, I would consider caching the weight total ( for a reasonable amount of time, depending on accuracy required ) to increase performance even further.

Nosql database design for complex querying

A nosql question:
Let's say I have this scenario:
A user as a status that change often (let's say every seconds), it also has different characteristics such as country (up to 10k characteristics per user)...
A user can post messages which have different types.
The issue:
The scenario is in my opinion very RDS oriented where JOIN would be used a lot for querying. However, it is not an option (for the sake of the exercise). Therefore, I am not looking for a solution with pseudo RDS like HIVE or other solution where you have pseudo join. I am looking for something like mongodb which can use mapreduce or aggregation.
My solution (using mongodb):
Let's say you have 3 collections:
user => user Characteristics (a large number of differents characteristics such as age/sex..)
message => message specific field
status => status specific field
The possible way to tackle this problem (as far as I know) are:
Denormalize the data by duplicating the user field and embed it in message and status (or putting everything in one collection) => Does not seem optimized as you have a lot of characteristics per user and you will reach the 2MB limit of the documents (you could use GridFS but I am worried about the perf of something like that and you also duplicate tons of not useful data storage).
Use a SQL like solution by adding user_id reference to message and status => Seems like a reasonable solution. However You are then trapped (in terms of query performance) if you actually want to make specific queries such as count the number of message of type X for users that have their last status equal to Z and have characteristic Y equal to E and group them by characteristic W .In SQL it would be SELECT tmp.count(*), user.characteristicW FROM message INNER JOIN status on status.user_id=message.user_id INNER JOIN user ON user.id=message.user_id WHERE status.type=Z and user.characteristicZ=E group by user.characteristicW (this query is actually not totally exact as you want to know if the last status is equal to Z and not if it ever had a status equal to Z, it would require a select within a select but that is not the point of the exercise). It becomes quickly very demanding where you have to make several queries (in this example one for getting the user id that have the last status equal to Z, then another one to filter this user id list to the one who have characteristic Y equal to E, then get all messages from these users and then group them by characteristic W with a map reduce job.
go with double reference user that ref to message and message to user. Also status that ref to user and user that ref to status. => Might seem fine but the user document already being big you come back to potentially issue with solution 1 assuming there will be tons of messages and status.
I went with option 2 but I am unhappy about it as the processing time to query it might not seem optimized.
Question:
In a scenario as above, what are the best practices to implement a scalable solution that allow complex querying as the example I gave above.

How to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!

Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.

I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!

How to get total number of potential results in Lucene

I'm using lucene on a site of mine and I want to show the total result count from a query, for example:
Showing results x to y of z
But I can't find any method which will return me the total number of potential results. I can only seem to find methods which you have to specify the number of results you want, and since I only want 10 per page it seems logical to pass in 10 as the number of results.
Or am I doing this wrong, should I be passing in say 1000 and then just taking the 10 in the range that I require?

BTW, since I know you personally I should point out for others I already knew you were referring to Lucene.net and not Lucene :) although the API would be the same
In versions prior to 2.9.x you could call IndexSearcher.Search(Query query, Filter filter) which returned a Hits object, one of which properties [methods, technically, due to the Java port] was Length()
This is now marked Obsolete since it will be removed in 3.0, the only results of Search return TopDocs or TopFieldDocs objects.
Your alternatives are
a) Use IndexServer.Search(Query query, int count) which will return a TopDocs object, so TopDocs.TotalHits will show you the total possible hits but at the expense of actually creating <count> results
b) A faster way is to implement your own Collector object (inherit from Lucene.Net.Search.Collector) and call IndexSearcher.Search(Query query, Collector collector). The search method will call Collect(int docId) on your collector on every match, so if internally you keep track of that you have a way of garnering all the results.
It should be noted Lucene is not a total-resultset query environment and is designed to stream the most relevant results to you (the developer) as fast as possible. Any method which gives you a "total results" count is just a wrapper enumerating over all the matches (as with the Collector method).
The trick is to keep this enumeration as fast as possible. The most expensive part is deserialisation of Documents from the index, populating each field etc. At least with the newer API design, requiring you to write your own Collector, the principle is made clear by telling the developer to avoid deserialising each result from the index since only matching document Ids and a score are provided by default.

The top docs collector does this for you, for example
TopDocs topDocs = searcher.search(qry, 10);
int totalHits = topDocs.totalHits ;
The above query will count all hits, but return only 10.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse