I'm using EF Core 3.0 code first with MSSQL database. I have big table that has ~5 million records. I have indexes on ProfileId, EventId and UnitId. This query takes ~25-30 seconds to execute. Is it normal or there is a way to optimize it?
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
}).AsNoTracking().ToListAsync();
I tried to loos through profileIds, adding another WHERE clause and removing ProfileId from grouping parameter, but it worked slower.
Capture the SQL being executed with a profiling tool (SSMS has one, or Express Profiler) then run that within SSMS /w execution plan enabled. This may highlight an indexing improvement. If the execution time in SSMS roughly correlates to what you're seeing in EF then the only real avenue of improvement will be hardware on the SQL box. You are running a query that will touch 5m rows any way you look at it.
Operations like this are not that uncommon, just not something that a user would expect to sit and wait for. This is more of a reporting-type request so when faced with requirements like this I would look at options to have users queue up a request where they can receive a notification when the operation completes to fetch the results. This would be set up to prevent users from repeatedly requesting updates ("not sure if I clicked" type spams) or also considerations to ensure too many requests from multiple users aren't kicked off simultaneously. Ideally this would be a candidate to run off a read-only reporting replica rather than the read-write production DB to avoid locks slowing/interfering with regular operations.
Try to remove ToListAsync(). Or replace it with AsQueryableAsync(). Add ToList slow performance down.
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
});
Related
Excuse what seems like it could be a duplicate. I'm familiar with multiple updates in Postgres... but I can't seem to figure out a way around this one...
I have a photos table with the following columns: id (primary key), url, sort_order, and owner_user_id.
We would like to allow our interface to allow the user to reorder their existing photos in a collection view. In which case when a drag-reorder interaction is complete, I am able to send a POST body to our API with the following:
req.body.photos = [{id: 345, order: 1, id: 911, order: 2, ...<etc>}]
In which case I can turn around and run the following query in a loop per each item in the array.
photos.forEach(function (item) {
db.runQuery('update photos set sort_order=$1 where id=$2 and owner_user_id=$3', [item.order, item.id, currentUserId])
})
In general, it's generally frowned upon to run database queries inside loops, so if there's anyway this can be done with 1 query that would be fantastic.
Much thanks in advance.
Running a select query inside of a loop is definitely questionable, but I don't think multiple updates is necessarily frowned upon if the data you are updating doesn't natively reside on the database. To do these as separate transactions, however, might be.
My recommendation would be to wrap all known updates in a single transaction. This is not only kinder to the database (compile once, execute many, commit once), but this is an ACID approach to what I believe you are trying to do. If, for some reason, one of your updates fails, they will all fail. This prevents you from having two photos with an order of "1."
I didn't recognize your language, but here is an example of what this might look like in C#:
NpgSqlConnection conn = new NpgSqlConnection(connectionString);
conn.Open();
NpgSqlTransaction trans = conn.BeginTransaction();
NpgSqlCommand cmd = new NpqSqlCommand("update photos set sort_order=:SORT where id=:ID",
conn, trans);
cmd.Parameters.Add(new NpgSqlParameter("SORT", DbType.Integer));
cmd.Parameters.Add(new NpgSqlParameter("ID", DbType.Integer));
foreach (var photo in photos)
{
cmd.Parameters[0].Value = photo.SortOrder;
cmd.Parameters[1].Value = photo.Id;
cmd.ExecuteNonQuery();
}
trans.Commit();
I think in Perl, for example, it would be even simpler -- turn off DBI AutoCommit and commit after the inserts.
CAVEAT: Of course, add error trapping -- I was just illustrating what it might look like.
Also, I changed you update SQL. If "Id" is the primary key, I don't think you need the additional owner_user_id=$3 clause to make it work.
I'm trying to sort out a very weird behavior.
I'm working with:
JBoss AS 7.1.1
EJB 3.0
JPA
XA DataSource
Oracle 11g
In one of the systems fuctionalitites, the user can see the status of each Store. For each Store I fire a query, to sum up all the files that have been processed. The query is something like this:
SELECT
SUM(CASE file.type
WHEN 'TYPE_1' THEN 1
ELSE 0
END)
,
SUM(CASE file.type
WHEN 'TYPE_2' THEN 1
ELSE 0
END)
,
SUM(CASE file.type
WHEN 'TYPE_3' THEN 1
ELSE 0
END)
FROM
File file
WHERE
file.type IN ('TYPE_1', 'TYPE_2', 'TYPE_3')
AND file.status = 'RECEIVED'
AND file.store.id = :storeId
The thing is, the user can select which of the stores he wants to check, and that's where things get weird.
When I check the first store, the result comes blazing fast, but all subsequent queries take significantly more time. Let me exemplify:
User checks store 15 (Blazing fast result) - About 200 ms
User checks store 2 (Very slow result) - About 8000 ms
Now pay attention to this part, it's very important.
User logs out, and logs in again.
User checks store 2 (the one that took 8000ms), and now the result is blazing fast.
This is very odd, the same store that took a while before, is now loading pretty fast.
Whenever I try the queries on SQLDeveloper the results come pretty fast as well.
I annotated my EJB with #TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED) but I didn't get any difference on the execution time.
I created a standalone project to run the queries using JDBC and the result was fast again, which leaves me thinking it may be some configuration on my DataSource, persistence.xml or anything like that.
Does anyone have any clues why this happens?
few things :
when the user checks store2 in the second time , the oracle optimizer is probably using it's "cache" , and therefore it is blazing fast.
how much File records does store2 has ? try to perform a group by sentence to see if this File table needs special statistics , if for example the store2 has dramaticly more File records than other stores then try to perform this method :
begin
dbms_stats.gather_table_stats(user,'file' , estimate_percent=>100);
end;
this will ensure the table's statistics are accurate.
you can optimize the query , you don't have to perform 3 times "sum" , you can do something like this :
select f.type , count(*)
from File f
where f.store.id = :storeId
and f.type IN ('TYPE_1', 'TYPE_2', 'TYPE_3')
group by f.type
you may run in a cardinality feedback problem; just look in this blog;
http://orcasoracle.squarespace.com/oracle-rdbms/2012/12/18/when-a-query-runs-slower-on-second-execution-a-possible-side.html
/KR
I have 2 SQL Server databases, hosted on two different servers. I need to extract data from the first database. Which is going to be a list of integers. Then I need to compare this list against data in multiple tables in the second database. Depending on some conditions, I need to update or insert some records in the second database.
My solution:
(WCF Service/Entity Framework using LINQ to Entities)
Get the list of integers from 1st db, takes less than a second gets 20,942 records
I use the list of integers to compare against table in the second db using the following query:
List<int> pastDueAccts; //Assuming this is the list from Step#1
var matchedAccts = from acct in context.AmAccounts
where pastDueAccts.Contains(acct.ARNumber)
select acct;
This above query is taking so long that it gives a timeout error. Even though the AmAccount table only has ~400 records.
After I get these matchedAccts, I need to update or insert records in a separate table in the second db.
Can someone help me, how I can do step#2 more efficiently? I think the Contains function makes it slow. I tried brute force too, by putting a foreach loop in which I extract one record at a time and do the comparison. Still takes too long and gives timeout error. The database server shows only 30% of the memory has been used.
Profile the sql query being sent to the database by using SQL Profiler. Capture the SQL statement sent to the database and run it in SSMS. You should be able to capture the overhead imposed by Entity Framework at this point. Can you paste the SQL Statement emitted in step #2 in your question?
The query itself is going to have all 20,942 integers in it.
If your AmAccount table will always have a low number of records like that, you could just return the entire list of ARNumbers, compare them to the list, then be specific about which records to return:
List<int> pastDueAccts; //Assuming this is the list from Step#1
List<int> amAcctNumbers = from acct in context.AmAccounts
select acct.ARNumber
//Get a list of integers that are in both lists
var pastDueAmAcctNumbers = pastDueAccts.Intersect(amAcctNumbers);
var pastDueAmAccts = from acct in context.AmAccounts
where pastDueAmAcctNumbers.Contains(acct.ARNumber)
select acct;
You'll still have to worry about how many ids you are supplying to that query, and you might end up needing to retrieve them in batches.
UPDATE
Hopefully somebody has a better answer than this, but with so many records and doing this purely in EF, you could try batching it like I stated earlier:
//Suggest disabling auto detect changes
//Otherwise you will probably have some serious memory issues
//With 2MM+ records
context.Configuration.AutoDetectChangesEnabled = false;
List<int> pastDueAccts; //Assuming this is the list from Step#1
const int batchSize = 100;
for (int i = 0; i < pastDueAccts.Count; i += batchSize)
{
var batch = pastDueAccts.GetRange(i, batchSize);
var pastDueAmAccts = from acct in context.AmAccounts
where batch.Contains(acct.ARNumber)
select acct;
}
I have project that pull data from a service (return xml) which deserialize into objects/entities.
I'm using EF CF and testing is working fine until it come to big chuck of data, not too big, only 150K records, I use SQL profile to check the SQL statement and it's really fast, but there is a huge slow issue with generating insert statement.
simply put, the data model is simple, class Client has many child object set (5) and 1 many-to-many relationship.
ID for model is provided from service so I cleaned up the duplicate instances of one entity (same ID).
var clientList = service.GetAllClients(); // return IEnumerable<Client> // return 10K clients
var filteredList = Client.RemoveDuplicateInstancesSameEntity(clientList); // return IEnumerable<Client>
int cur = 0;
in batch = 100;
while (true)
{
logger.Trace("POINT A : get next batch");
var importSegment = filteredList.Skip(cur).Take(batch).OrderBy(x=> x.Id);
if (!importSegment.Any())
Break;
logger.Trace("POINT B: Saving to DB");
importSegment.ForEach(c => repository.addClient(c));
logger.Trace("POINT C: calling persist");
repository.persist();
cur = cur + batch;
}
logic is simple, breaking it up into batch to speed up the process. each 100 Client create about 1000 insert statement (for child records and 1 many to many table).
using profiler and logging to analyze this. right after it insert
log show POINT B as the last step all the time. but i dont see any insert statement yet in profiler. then 2 minutes later, I see all the insert statement and then the POINT B for the next batch. and 2 minutes again.
did I do anything wrong or is there is setting or anything I can do to improve?
insert 1k records seems to be fast. Database is wiped out when process start so no records in there. doesn't seem to be an issue with SQL slowness but EF generating insert statement?
although the project works but it is slow. I want to speed it up and understand more about EF when it comes to big chunks of data. or is this normal?
the first 100 is fast and then is getting slower and slower and slower. seems like issue at POINT B. is it issue with too much data repo/dbcontext can't handle it in timely manner?
repo is inheritance from dbcoontext and addClient is simply
dbcontext.Client.Add(client)
Thank you very much.
Suppose Table X has a 100 tuples.
Will the following approach to scanning X generate all the tuples in TABLE X, in MySQL?
for start in [0, 10, 20, ..., 90]:
print results of "select * from X LIMIT start, 10;"
I ask, because I've been using PostgreSQL, which clearly says that this approach need not work, but there seems to be no such info for MySQL. If it won't, is there a way to return results in a fixed ordering without knowing any other info about the table (like what the primary key fields are)?
I need to scan each tuple in a table in an application, and I want a way to do it without using too much memory in the application (so simply doing a "select * from X" is out).
No, that isn't a safe assumption. Without an ORDER BY clause, there is no guaranteeing that your query will return unique results each time. If this table is properly indexed, adding an ORDER BY (for the index) shouldn't be too expensive.
Edit: Non-ORDER BYed results will sometimes be in the order of the clustered index, but I wouldn't put any money on that!
If you are using Innodb or MyISAM table types, a better approach is to use the HANDLER interface. Only MySQL supports this, but it does what you want:
http://dev.mysql.com/doc/refman/5.0/en/handler.html
Also, the MySQL API supports two modes of retrieving data from the server:
store result: in this mode, as soon as a query is executed, the API retrieves the entire result set before returning to the user code. This can use up a lot of client memory buffering results, but minimises the use of resources on the server.
use result: in this mode, the API pulls results row-by-row and returns control to the user code more frequently. This minimises the use of memory on the client, but can hold locks on the server for longer.
Most of the MySQL APIs for various languages support this in oneform or another. It is usually an argument that can be supplied as when creating the connection, and / or a separate call that can be used against an existing connection to switch it to that mode.
So, in answer to your question - I would do the following:
set the connection to "use result" mode;
select * from X