CreateOrUpdate Operation Over Many Records Using Entity Framework 6

CreateOrUpdate Operation Over Many Records Using Entity Framework 6 - entity-framework

I am writing a web crawler for fun.
I have a remote SQL database that I want to save information about each page I visit and I am using Entity Framework 6 to persist data. For the sake of illustration, let's assume that the only data I want to save about each page is the last time I visited it.
Updating this database is very slow. Here is the operation that I want make fast:
For the current page, check if it exists in the database already
If it exists already, update the "lastVisited" timestamp field on the record and save.
If it doesn't exist, create it.
Currently I can only do 300 updates per minute. My SQL server instance shows almost no activity, so I assume I am client-bound.
My code is naive:
public static void AddOrUpdatePage(long id, DataContext db)
{
Page p = db.Pages.SingleOrDefault(f => f.id == id);
if (p == null)
{
// create
p = new Page();
p.id = id;
db.Pages.Add(p);
}
p.lastSeen = DateTime.Now;
db.SaveChanges();
}
I crawl a bunch of pages (1000s), and then call AddOrUpdatePage in a loop for each page.
It seems like the way to get more speed is batching? What is the best way to get 1000 records from my database at a time, given a set of page ids? In SQL I would use a table variable for this and join, or use a lengthy IN clause.

Related

JPQL can not prevent cache for JPQL query

Two (JSF + JPA + EclipseLink + MySQL) applications share the same database. One application runs a scheduled task where the other one creates tasks for schedules. The tasks created by the first application is collected by queries in the second one without any issue. The second application updates fields in the task, but the changes done by the second application is not refreshed when queried by JPQL.
I have added QueryHints.CACHE_USAGE as CacheUsage.DoNotCheckCache, still, the latest updates are not reflected in the query results.
The code is given below.
How can I get the latest updates done to the database from a JPQL query?
public List<T> findByJpql(String jpql, Map<String, Object> parameters, boolean withoutCache) {
TypedQuery<T> qry = getEntityManager().createQuery(jpql, entityClass);
Set s = parameters.entrySet();
Iterator it = s.iterator();
while (it.hasNext()) {
Map.Entry m = (Map.Entry) it.next();
String pPara = (String) m.getKey();
if (m.getValue() instanceof Date) {
Date pVal = (Date) m.getValue();
qry.setParameter(pPara, pVal, TemporalType.DATE);
} else {
Object pVal = (Object) m.getValue();
qry.setParameter(pPara, pVal);
}
}
if(withoutCache){
qry.setHint(QueryHints.CACHE_USAGE, CacheUsage.DoNotCheckCache);
}
return qry.getResultList();
}

The CacheUsage settings affect what EclipseLink can query using what is in memory, but not what happens after it goes to the database for results.
It seems you don't want to out right avoid the cache, but refresh it I assume so the latest changes can be visible. This is a very common situation when multiple apps and levels of caching are involved, so there are many different solutions you might want to look into such as manual invalidation or even if both apps are JPA based, cache coordination (so one app can send an invalidation even to the other). Or you can control this on specific queries with the "eclipselink.refresh" query hint, which will force the query to reload the data within the cached object with what is returned from the database. Please take care with it, as if used in a local EntityManager, any modified entities that would be returned by the query will also be refreshed and changes lost
References for caching:
https://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Caching
https://www.eclipse.org/eclipselink/documentation/2.6/concepts/cache010.htm

Make the Entity not to depend on cache by adding the following lines.
#Cache(
type=CacheType.NONE, // Cache nothing
expiry=0,
alwaysRefresh=true
)

ExecuteStoreCommand "Delete" returning different record count to "DeleteObject" post delete, why?

Got a strange one here. I am using EF 6 over SQL Server 2012 and C#.
If I delete a record, using DeleteObject, I get:
//order.orderitem count = 11
db.OrderItem.DeleteObject(orderitem);
db.SaveChanges();
var order = db.order.First(r => r.Id == order.id);
//Order.OrderItem count = 10, CORRECT
If I delete an Order Item, using ExecuteStoreCmd inline DML, I get:
//order.orderitem count = 11
db.ExecuteStoreCommand("DELETE FROM ORDERITEM WHERE ID ={0}", orderitem.Id);
var order = db.Order.First(r => r.Id == order.id);
//order.orderitem count = 11, INCORRECT, should be 10
So the ExecuteStoreCommand version reports 11, however the OrderItem is definitely deleted from the DB, so it should report 10. Also I would have thought First() does an Eager search thus repopulating the "order.orderitem" collection.
Any ideas why this is happening? Thanks.
EDIT: I am using ObjectContext
EDIT2: This is the closest working solution I have using "detach". Interestingly the "detach" actually takes about 2 secs ! Not sure what it is doing, but it works.
db.ExecuteStoreCommand("DELETE FROM ORDERITEM WHERE ID ={0}", orderitem.Id);
db.detach(orderitem);
It would be quicker to requery and repopulate the dataset. How can I force a requery? I thought the following would do it:
var order = db.order.First(r => r.Id == order.id);
EDIT3: This seems to work to force a refresh post delete, but still take about 2 secs:
db.Refresh(RefreshMode.StoreWins,Order.OrderItem);
I am still not really understanding why one cannot just requery as a Order.First(r=>r.id==id) type query oftens take much less than 2 secs.

This would likely be because the Order and it's order items are already known to the context when you perform the ExecuteStoredCommand. EF doesn't know that the command relates to any cached copy of Order, so the command will be sent to the database, but not update any loaded entity state. WHere-as the first one would look for any loaded OrderItem, and when told to remove it from the DbSet, it would look for any loaded entities that reference that order item.
If you don't want to ensure the entity(ies) are loaded prior to deleting, then you will need to check if any are loaded and refresh or detach their associated references.
If orderitem represents an entity should just be able to use:
db.OrderItems.Remove(orderitem);
If the order is loaded, the order item should be removed automatically. If the order isn't loaded, no loss, it will be loaded from the database when requested later on and load the set of order items from the DB.
However, if you want to use the SQL execute approach, detaching any local instance should remove it from the local cache.
db.ExecuteStoreCommand("DELETE FROM ORDERITEM WHERE ID ={0}", orderitem.Id);
var existingOrderItem = db.OrderItems.Local.SingleOrDefault(x => x.Id == orderItem.Id);
if(existingOrderItem != null)
db.Entity(existingOrderItem).State = EntityState.Detached;
I don't believe you will need to check for the orderItem's Order to refresh anything beyond this, but I'm not 100% sure on that. Generally though when it comes to modifying data state I opt to load the applicable top-level entity and remove it's child.
So if I had a command to remove an order item from an order:
public void RemoveOrderItem(int orderId, int orderItemId)
{
using (var context = new MyDbContext())
{
// TODO: Validate that the current user session has access to this order ID
var order = context.Orders.Include(x => x.OrderItems).Single(x => x.OrderId == orderId);
var orderItem = order.OrderItems.SingleOrDefault(x => x.OrderItemId == orderItemId);
if (orderItem != null)
order.OrderItems.Remove(orderItem);
context.SaveChanges();
}
}
The key points to this approach.
While it does mean loading the data state again for the operation, this load is by ID so it's fast.
We can/should validate that the data requested is applicable for the user. Any command for an order they should not access should be logged and the session ended.
We know we will be dealing with the current data state, not basing decisions on values/data from the point in time that data was first read.

Long Running Query in Entity Framework with multiple table joins

I have a query that joins about 10 tables some that are self referencing tables. I use an "IN" statement for the conditional on the ID column (indexed) of the top most table.
var aryOrderId = DetermineOrdersToGet(); //Logic to determine what orderids to get
var result = dbContext.Orders.Where(o=>aryOrderId.Contains(o.id)
.Include(o=>o.Customer)
.Include(o=>o.Items.Select(oi=>oi.ItemAttributes))
.Include(o=>o.Items.Select(oi=>oi.Dimensions))
.Include(o=>o.CustomOptions.Select(oc => oc.CustomOptions1))
.....A Bunch more.....
.ToList();
I would like to figure out a way to speed this up without redesigning my tables and flattening out the structure. Currently 50-200 records take 10-20 seconds.
This data can be read only. I don't need to update these records.
Can I convert this to a stored procedure?
How hard is this to do?
Will I be able to get noticeable performance gains?

One of the slower parts of the database query is the transport of the selected data from the DBMS to your local process. Hence it is wise to select only the properties you actually plan to use.
For example, it seems that an Order has zero or more ItemAttributes. Evey ItemAttribute belongs to exactly one Order, using a foreign key OrderId.
If you fetch all Orders with Id in ArryOrderId, each order with its thousand ItemAttributes, you know that every ItemAttribute will have a foreign key OrderId with the same value as the Id of the Order that it belongs to. It is a waste to send 1000 times the same value.
When querying data using entity framework, always use Select. Select only the properties yo actually plan to use. Only use Include if you intend to change the fetched objects.
var result = dbContext.Orders
.Where(order=>aryOrderId.Contains(order.id)
.Select(order => new
{ // select only the properties you plan to use:
Id = order.Id,
...
Customer = order.Customer.Select(customer => new
{ // again: only the properties you plan to use
Id = order.Customer.Id,
Name = order.Customer.Name,
...
},
ItemAttributes = order.ItemAttributes.Select(itemAttribute => new
{
...
})
.ToList(),
Dimensions = order.Dimensions.Select(dimension => new
{
...
})
.ToList(),
....A Bunch more.....
})
.ToList();
If after selecting only the properties that you actually plan to use, the query still takes too long, think again: do I really need all these properties.
Another solution to limit the execution time is fetching the date 'per page', using Skip / Take. The danger is of course that when you are viewing page 10, the data of page 1 might be changed in a way that page 10 should be interpreted differently.

As jtate mentions, if you don't need everything from the joined tables, don't include them. Instead, utilize .Select() to retrieve just the data you want from the entity and it's associated relationships.
I.e.
var query = dbContext.Orders
.Where(x => aryOrderId.Contains(x => x.OrderId))
.Select(x => new
{
x.OrderId,
x.OrderNumber,
OrderItems = x.Items.Select(i => new
{
i.ItemId,
Attributes = i.Attributes.Select(a => a.AttributeName).ToList(),
Dimensions = i.Dimensions.Select(d => new {d.DimensionId, d.Name}).ToList(),
}).ToList(),
// ...
}).ToList();
You can structure the query, or queries however you like to find an optimal result.
Alternatively you can consider utilizing a view on the database and binding an entity to the view. This option works well for read-only views of data. Provided you include the relevant IDs you can always retrieve the applicable "real" entities at any time to load a details page or perform an action/update against the entity.

Answering your 3 questions. Yes, you can use a stored procedure and that's what I would do in this situation. It is not hard at all; EF makes it quite simple. You can either have it return a new complex type or you can map it to an entity. Since you're saying the data can be readonly, you probably are okay with a basic function import returning a complex type (EF's default behavior). Either way, you will have noticeable performance gains.
For db-first, see http://www.entityframeworktutorial.net/stored-procedure-in-entity-framework.aspx
Basically, you'll follow these steps.
Create the stored procedure on your database
Update the model from the database. When it asks which objects to include, you should be able to select your stored procedure.
Click Finish. EF will generate a complex type that has all the properties returned by your stored procedure, and it will add a signature to your context for executing the stored procedure, so it can be called like this: var results = myContext.myProcedure(param1, param2); There are screenshots of this at the link above.
You can also go in and modify the model to customize the details, such as the name of the complex type and the name of the function (by default the function will match the name of the SP and will return an ObjectResult<T> where T is your complex type, which will be the name of the procedure with "_Result" as a suffix).

Is it possible to improve EF6 WarmUp time?

I have an application in which I verify the following behavior: the first requests after a long period of inactivity take a long time, and timeout sometimes.
Is it possible to control how the entity framework manages dispose of the objects? Is it possible mark some Entities to never be disposed?
...in order to avoid/improve the warmup time?
Regards,

The reasons that similar queries will have an improved response time are manifold.
Most Database Management Systems cache parts of the fetched data, so that similar queries in the near future will be faster. If you do query Teachers with their Students, then the Teachers table will be joined with the Students table. This join result is quite often cached for a while. The next query for Teachers with their Students will reuse this join result and thus become faster
DbContext caches queried object. If you select a Single teacher, or Find one, it is kept in local memory. This is to be able to detect which items are changed when you call SaveChanges. If you Find the same Teacher again, this query will be faster. I'm not sure if the same happens if you query 1000 Teachers.
When you create a DbContext object, the initializer is checked to see if the model has been changed or not.
So it might seem wise not to Dispose() a created DbContext, yet you see that most people keep the DbContext alive for a fairly short time:
using (var dbContext = new MyDbContext(...))
{
var fetchedTeacher = dbContext.Teachers
.Where(teacher => teacher.Id = ...)
.Select(teacher => new
{
Id = teacher.Id,
Name = teacher.Name,
Students = teacher.Students.ToList(),
})
.FirstOrDefault();
return fetchedTeacher;
}
// DbContext is Disposed()
At first glance it would seem that it would be better to keep the DbContext alive. If someone asks for the same Teacher, the DbContext wouldn't have to ask the database for it, it could return the local Teacher..
However, keeping a DbContext alive might cause that you get the wrong data. If someone else changes the Teacher between your first and second query for this Teacher, you would get the old Teacher data.
Hence it is wise to keep the life time of a DbContext as short as possible.
Is there nothing I can do to improve the speed of the first query?
Yes you can!
One of the first things you could do is to set the initialize of your database such that it doesn't check the existence and model of the database. Of course you can only do this when you are fairly sure that your database exists and hasn't changed.
// constructor; disables initializer
public SchoolDBContext() : base(...)
{
//Disable initializer
Database.SetInitializer<SchoolDBContext>(null);
}
Another thing could be, if you already have fetched your object to update the database, and you are sure that no one else changed the object, you can Attach it, instead of fetching it again, as is shown in this question
Normal usage:
// update the name of the teacher with teacherId
void ChangeTeacherName(int teacherId, string name)
{
using (var dbContext = new SchoolContext(...))
{
// fetch the teacher, change the name and save
Teacher fetchedTeacher = dbContext.Teachers.Find(teacherId);
fetchedTeader.Name = name;
dbContext.SaveChanges();
}
}
Using Attach to update an earlier fetched Teacher:
void ChangeTeacherName (Teacher teacher, string name)
{
using (var dbContext = new SchoolContext(...))
{
dbContext.Teachers.Attach(teacher);
dbContext.Entry(teacher).Property(t => t.Name).IsModified = true;
dbContext.SaveChanges();
}
}
Using this method doesn't require to fetch the Teacher again. During SaveChanges the value of IsModified of all properties of all Attached items is checked. If needed they will be updated.

Understanding many to many relationships and Entity Framework

I'm trying to understand the Entity Framework, and I have a table "Users" and a table "Pages". These are related in a many-to-many relationship with a junction table "UserPages". First of all I'd like to know if I'm designing this relationship correctly using many-to-many: One user can visit multiple pages, and each page can be visited by multiple users..., so am I right in using many2many?
Secondly, and more importantly, as I have understood m2m relationships, the User and Page tables should not repeat information. I.e. there should be only one record for each user and each page. But then in the entity framework, how am I able to add new visits to the same page for the same user? That is, I was thinking I could simply use the Count() method on the IEnumerable returned by a LINQ query to get the number of times a user has visited a certain page.
But I see no way of doing that. In Linq to Sql I could access the junction table and add records there to reflect added visits to a certain page by a certain user, as many times as necessary. But in the EF I can't access the junction table. I can only go from User to a Pages collection and vice versa.
I'm sure I'm misunderstanding relationships or something, but I just can't figure out how to model this. I could always have a Count column in the Page table, but as far as I have understood you're not supposed to design database tables like that, those values should be collected by queries...
Please help me understand what I'm doing wrong...

You are doing it right.
In the Entity Data Model (EDM) Many-To-Many relationships can be represented with or without a join table, depending on whether it contains some additional fields. See the article below for more details.
In your case, the User entity will directly reference a collection of Page entities and vice versa, since your model doesn't include a mapping for the User_Page join table.
In order to add visits to a certain page on a user you could for example do something like:
using (var context = new YourEntityModelObjectContext())
{
var page = context.Pages.FirstOrDefault(p => p.Url == "http://someurl");
var user = context.Users.FirstOrDefault(u => u.Username == "someuser");
user.Pages.Add(page);
context.SaveChanges();
}
Or you could do it from the other side of the relation:
using (var context = new YourEntityModelObjectContext())
{
var page = context.Pages.FirstOrDefault(p => p.Url == "http://someurl");
var user = context.Users.FirstOrDefault(u => u.Username == "someuser");
page.Users.Add(user);
context.SaveChanges();
}
In both cases a new record will be added to the User_Page join table.
If you need to retrieve the number of pages visited by a particular user you could simply do:
using (var context = new YourEntityModelObjectContext())
{
var user = context.Users.FirstOrDefault(u => u.Username == "someuser");
var visitCount = user.Pages.Count;
}
Related resources:
Many to Many Relationships in the Entity Model