I've built a .NET Core 3.1 API. The API does a lot of work, and the bulk of it is executing SQL Queries. Most of the API uses EF Core for its data access. However, for one particularly critical query, we've discovered that Dapper offered a significant performance advantage.
One test scenario we're testing involves calling the API with many requests per second. When tested In a linear fashion, the requests take a little less than a second. However, if we bombard the API with 10, 15, (up to 120) calls within a second, the query performance drops dramatically. As an example, linear queries take around 500ms. At 15 rapid calls (within a second or so), an average query time is 3600ms.
It's true that with "Wide" scenarios effective throughput increases. However, I can't see the database being the bottleneck here, and I wonder why requests take so much longer when many go on in a small period of time.
Another critical point to note is that if I alternate calls to our API hosted in an environment and my local box for instance, performance and throughput increases by around 30%. This lends credence to the theory that the DB itself isn't the primary issue. I've also found only this question being something close to what I'm asking, and it's focused on EF.
Dapper's used in an unobtrusive way. The EF Context has a typical lifetime (scoped) though I've experimented with singleton and transient. Here's what it looks like:
using (var conn = new SqlConnection(_connectionString)) //from EF properties
{
using (var tx = await conn.BeginTransactionAsync(System.Data.IsolationLevel.ReadUncommitted))
{
var details = conn.QueryAsync<OurObject>(
#"SELECT columns
FROM table1
INNER JOIN table2
WHERE (someConditional = #paramValue)"
, new
{
paramValue
},tx);
var res1 = Success((await details).ToList());
return res1;
}
}
It should be noted that we need to use this isolation level and transactions for a specific reason. We're reading from tables and then writing to them for each request. Dirty Reads are much less of a concern than locks.
We have detailed logging throughout the request lifecycle showing effectively that reads (and to a lesser extent, writes) take longer. The processing of data does not change and is very fast throughout each request.
Finally, I know this is a bigger question than many on S.O. I appreciate any information or things to try you can provide!
As an example, linear queries take around 500ms. At 15 rapid calls (within a second or so), an average query time is 3600ms.
...
I can't see the database being the bottleneck here,
I can.
Assuming that 500ms is database server CPU time, 15 requests would require 7,500ms of CPU time. If the database server has 2 CPU cores it can process 4 queries/second. 15 queries would require 4.5sec. But the server is quite capable of accepting all 15 requests at the same time (after 15 separate client sessions have been esatablished), and time-slicing the CPUs among the running requests. So it may well be that all of the requests take an average of 3600ms.
For perspective 5ms is a "cheap" database query. EG a single-row lookup by key will take well under 1ms of CPU time. For a commonly-run query 500ms is expensive-enough to spend some time on analyzing the query plan and optimizing the execution.
Related
I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.
I build a tool for data extraction and transformation. Typical use case - transactionally processing lots of data.
Numbers are - about 10sec - 5min duration, 200-10000 row updated (long duration caused not by the database itself but by outside services that used during transaction).
There are two types of agents that access database - multiple read agents, and only one write agent (so, there are never multiple concurrent write).
During the transaction:
Read agents should be able to read database and see it in the current state.
Write agent should be able to read database (it does both - read and write during transaction) and see it in the new (not yet committed) state.
Is PostgreSQL a good choice for that type of load? I know it uses MVCC - so it should be ok in general, but is it ok to use long and big transactions extensively?
What other open-source transactional databases may be a good choice (I am not limited to SQL)?
P.S.
I do not know if the sharding may affect the performance. The database will be sharded. For every shard there will be multiple readers and only one writer, but multiple different shards can be written to at the same time.
I know that it's better not to use outside services during transaction, but in that case - it's the goal. The database used as a reliable and consistent index for some heavy, huge, slow and eventually-consistent data processing tool.
Huge disclaimer: as always, only real life test can tell you the truth.
But, I think PostgreSQL will not let you down, if you use most recent version (at least 9.1, better 9.2) and tune it properly.
I have somewhat similar load in my server, but with slightly worse R/W ratio: about 10:1. Transactions range from few milliseconds up to 1 hour (and sometimes even more), and one transaction can insert or update up to 100k rows. Total number of concurrent writers with long transactions can reach 10 and more.
So far so good - I don't really have any serious issues, performance is great (certainly not worse than I expected).
What really helps is that my hot working data set almost fits into available memory.
So, give it a try, it should work great for your load.
Have a look at this link. Maximum transaction size in PostgreSQL
Basically there can be some technical limits on the software side to how large your transaction can be.
I'd like to know when you should consider using multiple table in your query store.
For example, consider the problem where a product has it's description changed. This change could potentially have a massive impact on the synchronisation of the read only query store if you had many aggregates that included the product description.
At which point should you consider a slight normalization of the data to avoid lengthy synchronisation issues? Is this a no-no or an acceptable compromise?
Thanks,
CQRS is not about using table-per-view, rather table-per-view is an aspect of a system that CQRS makes easier.
It's up to you and depends on your specific context and needs. I would look at it this way, what is the cost of the eventual consistency of that query vs. the need for high query performance. You may want to consider the following two characteristics of your system:
1) The avg. consistency of that command, i.e., how long it takes to update all of the read models affected by the command (also consider whether an optimized stored-proc for the change would outperform say using an ORM or other abstraction to update your database in this way).
My guess is unless you are talking millions, upon millions of records the consistency here is sufficient to meet your requirements and user expectations for consistency, maybe a few seconds.
2) The importance of query performance. How many queries are you getting per second? Can you handle doing a SQL join every time?
In most practical scenarios the optimization of either of these things is moot. You can probably do the update, regardless of records, using a good SP in seconds which is more than enough consistency for a UI refresh (keep in mind the UI that issued the command can be consistent as soon as they know the command succeeded).
And you usually don't need so much query scaling in a system that a single join will hurt you. What you may not want is the added internal complexity of performing these joins in your code and stored procs.
As with all things in CQRS, you don't need to use and optimize every aspect of it from day one. You can optimize these things incrementally. Use joins today, and fully denormalize tomorrow, or vice-versa.
I'm trying to use the Task-Parallel-Library to offload expensive ADO.NET database access from the UI thread (formerly the program I'm re-writing would simply freeze, occasionally updating a VB6 text box with its progress, until the data in the database was fully loaded). I have an complex dependency structure (26 individual tasks), and I'm trying to figure out how much of it is worth parallelizing.
I'd like to know whether or not IO access like this can be parallelized at all with performance bonuses. If not I'll just sequentially load the data and update the UI whenever enough information is loaded to perform that task, but It'd be nice to get an extra boost by loading maybe two things at a time instead of just one (even if I don't get double speedup).
It's possible that parallelizing this will increase performance, but not guaranteed. It all depends on where your bottleneck is.
For example, if a request is expensive because it loads lots of data, then it probably consumes much of your clients network bandwith. Parallelizing in this case wouldn't help much, if at all.
If, on the other hand, the bottleneck is the SQL processing and your SQL request leaves the SQL Server with spare capacity in its own bottleneck, then you can profit from SQL Servers (very good) parallelizing capabilities.
It is also possible that parallelizing slows you down. If for example the SQl Server has not much RAM and access only to a single disk, forcing it to do multiple queries in parallel may lead to more seek activity on the harddisk, which can dramatically slow down the overall read rate.
So, as it often is, the answer isn't a simple yes or no, but "it depends".
I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.