EF Core Bulk Delete on PostgreSQL - postgresql

I’m trying to do a, potentially, large scale delete operation on a single table. (think 100,000 rows on a 1m row table)
I’m using PostgreSQL and EntityFrameworkCore.
Details: The application code has a predicate to match and knows nothing about how many rows potentially match the predicate. It could be 0 row/s or a very large amount.
Research indicates EF Core is incapable of handling this efficiently. (i.e. the following code produces a Delete statement for each row!)
Using (var db = new DbContext)
var queryable = db.Table.AsQueryable()
.Where(o => o.ForeignKey == fKey)
.Where(o => o.OtherColumn == false);
db.Table.RemoveRange(queryable);
await db.SaveChangesAsync();
So here is the SQL I would prefer to run in a sort of batched operation:
delete from Table
where ForeignKey = 1234
and OtherColumn = false
and PK in (
select PK
from Table
where ForeignKey = 1234
and OtherColumn = false
limit 500
)
There are extension libraries out there, but I’ve yet to find an active one that supports Postgres. I’m currently executing the raw sql above through EF Core.
This leads to a couple questions:
Is there anyway to get EF Core to delete these rows more efficiently on Postgres using LINQ, etc?
(Seems to me like handing the context a queryable should give it everything it needs to make the proper decision here)
If not, what are your opinions on deleting in batches vs handing the DB just the predicate?

I think you are trying to do something you should not use EntityFrameworkCore for. The object of EntityFrameworkCore is to have a nice way to move data between a .Net-Core application and a database. The typical useway is single or a small amount of objects. For bulk operations there are some nuget-packages. There is this package for inserting and updating with postgres.This article by the creator explains how it uses temporary tables and the postgres COPY command to do bulk operations. This shows us a way to delete rows in bulk by id:
var toDelete = GetIdsToDelete();
using (var conn = new NpgsqlConnection(connectionString))
{
conn.Open();
using ( var cmd = conn.CreateCommand())
{
cmd.CommandText =("CREATE TEMP TABLE temp_ids_to_delete (id int NOT NULL) ON COMMIT DROP ");
cmd.Prepare();
cmd.ExecuteNonQuery();
}
using (var writer = conn.BeginBinaryImport($"COPY temp_ids_to_delete (id) FROM STDIN (FORMAT BINARY)"))
{
foreach (var id in toDelete)
{
writer .StartRow();
writer .Write(id);
}
writer .Complete();
}
using (var cmd = conn.CreateCommand())
{
cmd.CommandText = "delete from myTable where id in(select id from temp_ids_to_delete)";
cmd.Prepare();
cmd.ExecuteNonQuery();
}
conn.Close();
With some small changes this can be more generalized.
But you want to do something different. You dont want to move data or information between the application and the database. You want to use efcore to create a slq-procedure on the fly and run that on the server. The problem is that ef core is not realy build to do that. But maybe there are ways around that. One way i could think of is to use ef-core to build a query, get the query string and then insert that string into another sql-string to run on the server.
Getting the query string is currently not easy but apparently it will be with EF Core 5.0. Then you could do this:
var queryable = db.Table.AsQueryable()
.Where(o => o.ForeignKey == fKey)
.Where(o => o.OtherColumn == false);
var queryString=queryable.ToQueryString();
db.Database.ExecuteSqlRaw("delete from Table where PK in("+queryString+")" )
And yes that is terribly hacky and i would not recommend that. I would recommend to write procedures and functions on the databaseServer because this is not something ef-core should be used for. And then you can still run those functions from ef-core and pass parameters.

I would suggest using temp tables to do an operation like this. You would create a mirror temp table, bulk add the records to keep or delete to the temp table and then execute a delete operation that looks for records in/not in that temp table. Try using a library such as PgPartner to accomplish bulk additions and temp table creation very easily.
Check out PgPartner: https://www.nuget.org/packages/PgPartner/
https://github.com/SourceKor/PgPartner

Disclaimer: I'm the owner of the project Entity Framework Plus
Your scenario look to be something that our Batch Delete features could handle: https://entityframework-plus.net/batch-delete
Using (var db = new DbContext)
var queryable = db.Table.AsQueryable()
.Where(o => o.ForeignKey == fKey)
.Where(o => o.OtherColumn == false);
queryable.Delete();
Entities are not loaded in the application, and only a SQL is executed as you specified.

Related

Long Running Query in Entity Framework with multiple table joins

I have a query that joins about 10 tables some that are self referencing tables. I use an "IN" statement for the conditional on the ID column (indexed) of the top most table.
var aryOrderId = DetermineOrdersToGet(); //Logic to determine what orderids to get
var result = dbContext.Orders.Where(o=>aryOrderId.Contains(o.id)
.Include(o=>o.Customer)
.Include(o=>o.Items.Select(oi=>oi.ItemAttributes))
.Include(o=>o.Items.Select(oi=>oi.Dimensions))
.Include(o=>o.CustomOptions.Select(oc => oc.CustomOptions1))
.....A Bunch more.....
.ToList();
I would like to figure out a way to speed this up without redesigning my tables and flattening out the structure. Currently 50-200 records take 10-20 seconds.
This data can be read only. I don't need to update these records.
Can I convert this to a stored procedure?
How hard is this to do?
Will I be able to get noticeable performance gains?
One of the slower parts of the database query is the transport of the selected data from the DBMS to your local process. Hence it is wise to select only the properties you actually plan to use.
For example, it seems that an Order has zero or more ItemAttributes. Evey ItemAttribute belongs to exactly one Order, using a foreign key OrderId.
If you fetch all Orders with Id in ArryOrderId, each order with its thousand ItemAttributes, you know that every ItemAttribute will have a foreign key OrderId with the same value as the Id of the Order that it belongs to. It is a waste to send 1000 times the same value.
When querying data using entity framework, always use Select. Select only the properties yo actually plan to use. Only use Include if you intend to change the fetched objects.
var result = dbContext.Orders
.Where(order=>aryOrderId.Contains(order.id)
.Select(order => new
{ // select only the properties you plan to use:
Id = order.Id,
...
Customer = order.Customer.Select(customer => new
{ // again: only the properties you plan to use
Id = order.Customer.Id,
Name = order.Customer.Name,
...
},
ItemAttributes = order.ItemAttributes.Select(itemAttribute => new
{
...
})
.ToList(),
Dimensions = order.Dimensions.Select(dimension => new
{
...
})
.ToList(),
....A Bunch more.....
})
.ToList();
If after selecting only the properties that you actually plan to use, the query still takes too long, think again: do I really need all these properties.
Another solution to limit the execution time is fetching the date 'per page', using Skip / Take. The danger is of course that when you are viewing page 10, the data of page 1 might be changed in a way that page 10 should be interpreted differently.
As jtate mentions, if you don't need everything from the joined tables, don't include them. Instead, utilize .Select() to retrieve just the data you want from the entity and it's associated relationships.
I.e.
var query = dbContext.Orders
.Where(x => aryOrderId.Contains(x => x.OrderId))
.Select(x => new
{
x.OrderId,
x.OrderNumber,
OrderItems = x.Items.Select(i => new
{
i.ItemId,
Attributes = i.Attributes.Select(a => a.AttributeName).ToList(),
Dimensions = i.Dimensions.Select(d => new {d.DimensionId, d.Name}).ToList(),
}).ToList(),
// ...
}).ToList();
You can structure the query, or queries however you like to find an optimal result.
Alternatively you can consider utilizing a view on the database and binding an entity to the view. This option works well for read-only views of data. Provided you include the relevant IDs you can always retrieve the applicable "real" entities at any time to load a details page or perform an action/update against the entity.
Answering your 3 questions. Yes, you can use a stored procedure and that's what I would do in this situation. It is not hard at all; EF makes it quite simple. You can either have it return a new complex type or you can map it to an entity. Since you're saying the data can be readonly, you probably are okay with a basic function import returning a complex type (EF's default behavior). Either way, you will have noticeable performance gains.
For db-first, see http://www.entityframeworktutorial.net/stored-procedure-in-entity-framework.aspx
Basically, you'll follow these steps.
Create the stored procedure on your database
Update the model from the database. When it asks which objects to include, you should be able to select your stored procedure.
Click Finish. EF will generate a complex type that has all the properties returned by your stored procedure, and it will add a signature to your context for executing the stored procedure, so it can be called like this: var results = myContext.myProcedure(param1, param2); There are screenshots of this at the link above.
You can also go in and modify the model to customize the details, such as the name of the complex type and the name of the function (by default the function will match the name of the SP and will return an ObjectResult<T> where T is your complex type, which will be the name of the procedure with "_Result" as a suffix).

Bulk Insert/Update with EF6?

I’m looking for a way to Insert or Update about 155,000 records using EF6. It has become obvious that EF6 out of the box is going to take way to long to look up a record and decide if it’s an insert or update, create or update an object, and then commit it to the database.
Looking around I’ve seen third party apps like EntityFramework.Extend but it looks like they are designed to do mass updates like “Update Table where Field=value” which doesn’t quite fit what I’m looking to do.
In my case I read in an XML doc, create a list of objects from that document, and then use EF to either insert or update to a table. Would it be better off going back to just regular ADO.Net and using bulk inserts that way?
BTW: this is using an Oracle database, not SQL Server.
You may use the EntityFramework.BulkInsert-ef6 package:
using EntityFramework.BulkInsert.Extensions;
class Program
{
static void Main(string[] args)
{
var data = new List<Demo>();
for (int i = 0; i < 1000000; i++)
{
data.Add(new Demo { InsertDate = DateTime.Now, Key = Guid.NewGuid(), Name = "Example " + i });
}
Stopwatch sw = Stopwatch.StartNew();
using (Model1 model = new Model1())
{
sw.Start();
model.BulkInsert(data);
sw.Stop();
}
Console.WriteLine($"Elapsed time for {data.Count} rows: {sw.Elapsed}");
Console.ReadKey();
}
}
Running this on my local HDD drive gives
Elapsed time for 1000000 rows: 00:00:24.9646688
P.S. The package provider claims that this version of the bulk package is outdated. Anyhow it's fitting my needs for years now the the package proposed by the author is no longer free of charge.
If you are looking for a "free" way to do it, I recommend you to go back to ADO.NET and use Array Bindings which is what I do under the hood with my library.
Disclaimer: I'm the owner of Entity Framework Extensions
This library support all major provider including Oracle
Oracle DevArt
Oracle DataAccess
Oracle DataAccessManaged
This library allows you to perform all bulk operations you need for your scenarios:
Bulk SaveChanges
Bulk Insert
Bulk Delete
Bulk Update
Bulk Merge
Example
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);
// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);
// Customize Primary Key
context.BulkMerge(customers, operation => {
operation.ColumnPrimaryKeyExpression =
customer => customer.Code;
});
This library will make you save a ton of time without having you to make any ADO.NET!

How to Create Custom Stored Procedures Using Code First Fluent API

There are many articles that show how to create Insert, Update, Delete procedures using Code First like this one.
How about custom procedure like this simple select statement:
Select * from Customers
I could make changes to the Up and Down Migration methods, but is there a way to create custom procs using Fluent API directly.
Here is a solution which I don't recommand but If you want to use DbContext.Database, here it is:
using(var db = new MyDbContext(connectionString))
{
db.Database.ExecuteSqlCommand("CREATE PROCEDURE MyProcedure ... END;");
var command = "EXEC MyProcedure;";
IEnumerable<Customer> customers = db.Database.SqlQuery<Customer>(command, null);
}

Entity Framework 6: is there a way to iterate through a table without holding each row in memory

I would like to be able to iterate through every row in an entity table without holding every row in memory. This is a read only operation and every row can be discarded after being processed.
If there is a way to discard the row after processing that would be fine. I know that this can be achieved using a DataReader (which is outside the scope of EF), but can it be achieved within EF?
Or is there a way to obtain a DataReader from within EF without directly using SQL?
More detailed example:
Using EF I can code:
foreach (Quote in context.Quotes)
sw.WriteLine(sw.QuoteId.ToString()+","+sw.Quotation);
but to achieve the same result with a DataReader I need to code:
// get the connection to the database
SqlConnection connection = context.Database.Connection as SqlConnection;
// open a new connection to the database
connection.Open();
// get a DataReader for our table
SqlCommand command = new SqlCommand(context.Quotes.ToString(), connection);
SqlDataReader dr = command.ExecuteReader();
// get a recipient for our database fields
object[] L = new object[dr.FieldCount];
while (dr.Read())
{
dr.GetValues(L);
sw.WriteLine(((int)L[0]).ToString() + "," + (string)L[1]);
}
The difference is that the former runs out of memory (because it is pulling in the entire table in the client memory) and the later runs to completion (and is much faster) because it only retains a single row in memory at any one time.
But equally importantly the latter example loses the Strong Typing of EF and should the database change, errors can be introduced.
Hence, my question: can we get a similar result with strongly typed rows coming back in EF?
Based on your last comment, I'm still confused. Take a look at both of below code.
EF
using (var ctx = new AppContext())
{
foreach (var order in ctx.Orders)
{
Console.WriteLine(order.Date);
}
}
Data Reader
var constr = ConfigurationManager.ConnectionStrings["AppContext"].ConnectionString;
using (var con = new SqlConnection(constr))
{
con.Open();
var cmd = new SqlCommand("select * from dbo.Orders", con);
var reader = cmd.ExecuteReader();
while (reader.Read())
{
Console.WriteLine(reader["Date"]);
}
}
Even though EF has few initial query, both of them execute similar query that can be seen from profiler..
I haven't tested it, but try foreach (Quote L in context.Quotes.AsNoTracking()) {...}. .AsNoTracking() should not put entities in cache so I assume they will be consumed by GC when they out of the scope.
Another option is to use context.Entry(quote).State = EntityState.Detached; in the foreach loop. Should have the similar effect as the option 1.
Third option (should definitely work, but require more coding) would be to implement batch processing (select top N entities, process, select next top N). In this case make sure that you dispose and create new context every iteration (so GC can eat it:)) and use proper OrderBy() in the query.
You need to use an EntityDataReader, which behaves in a way similar to a traditional ADO.NET DataReader.
The problem is that, to do so, you need to use ObjectContext instead of DbContext, which makes things harder.
See this SO answer, not the acepted one: How can I return a datareader when using Entity Framework 4?
Even if this referes to EF4, in EF6 things work in the same way. Usually an ORM is not intended for streaming data, that's why this functionality is so hidden.
You can also look at this project: Entity Framework (Linq to Entities) to IDataReader Adapter
I have done this by pages.
And cleaning the Context after each page load.
Sample:
Load first 50 rows
Iterate over them
Clean the Context or create a new one.
Load second 50 rows
...
Clean the Context = Set all its Entries as Detached.

Update statement with Entity Framework

Simple question, is it possible to achieve this query this with Entity Framework when updating one entity?
update test set value = value + 1 where id = 10
Use the Batch Update feature of the Entity Framework Extended Library, like this:
dbContext.Tests.Update(t => t.Id == 10, t => new Test() { Value = t.Value + 1 });
Not really under this form no.
You will have to select all entities that match your criteria, foreach over them and update them.
If you are looking for something that will do it right in the DB because your set could be huge, you will have to use SQL directly. (I don't remember if EF has a way to execute UPDATE queries directly the way Linq To SQL does).
It should be, it will just be a little bit more constrained generally.
var myEntity = context.First(item => item.id == 10);
myEntity.value += 1;
context.SaveChanges();
Should produce similar SQL, you can watch the profiler to see what SQL is actually being generated, but it should be very similar to your statement.