What Methodology To Implement For Insert Statements In Entity Framework - entity-framework

When inserting a lot of interrelated nested objects, is it best practice to first create and persist the inner entites as to respect the foreign key relationsihps and move up in the hierarchy or is it better to create all the objects with their inner objects and persist only the outer object? So far, I have experienced that when doing the latter, Entity Framework seems to be intelligently figuring out what to insert first with regard to the relationships. But is there a caveat that one should be aware of? The first method seems to me as the classical SQL logic whereas the latter seems to suit the idea of Entity Framework more.

It's all about managing the risk of data corruption and ensuring that database is valuable.
When you persist the inner records first, you open yourself up to the possibility of bad data if the outer save fails. Depending on your application needs, orphaned inner records could be anywhere from a major problem to a minor annoyance.
When you save everything together, if a single record fails the entire save fails.
Of course, if the inner records are meaningful without the outer records (again, determined by your business needs) then you might be preventing the application from progressing.
In short: if the inner record is dependent on the outer, save together. If the inner is meaningful on its own, make the decision based on performance / readability.
Examples:
A House has meaning without a Homeowner. It would be acceptable to create a House on its own.
A HSA (Homeowner's Association) does not make sense without Homeowners. These should be created together.
Obviously, these examples are contrived and become trivial when using existing data. For the purpose of this question, we're assuming that both are being created at the same time.

Related

Problem with boundry for different aggregates

I have a problem with the boundaries of aggregates. I was trying to read about aggregates, aggregate roots, and boundaries, looking for some code examples but I still struggle with it.
The app that I'm working on is an app to manage architecture projects.
Among the screens in the app there will be a screen with all details for the selected project, and one with all jobs for the selected constructor.
I have one AggregateRoot - ArchitectureProject.It has an Architect, Stages, etc. and it has a list of ConstructorJobs (as it has to be on the screen with project details). ConstructorJob has its name, some value, and a Constructor. A Constructor can have some ConstructorType. As for me, Constructor is another AggregateRoot. I have a problem with ConstructorJob. Where should I place it? What should be responsible for managing it?
I was trying to thing what cannot exist with what, and ConstructorJob cannot exists without Project, but on the other hand it has to have Constructor as well...
I can't imagine that Constructor would belong to Project Aggregate, as ConstructorType would be 4th level child to id, so searching for all constructors of that type would be painful, wouldn't be?
I would appreciate any explanation, how to handle such cases.
I think you are missing an important rule which usually makes your life a lot easier:
Rule: Reference Other Aggregates by Identity
See also Vaughn Vernon's Book Implementing Domain-Driven Design, chapter 10 - Aggregates.
It is important to note that Aggregates in the sense of domain-driven design are not so much focused on if the existence of one aggregate makes sense without the other. It is more about transactional boundaries. So an aggregate should create a boundary around elements that should only change together within the same transaction - to adhere to consistency.
So I guess, that you will change your Project in different use cases you would change the Constructor - which I guess can be referenced in different projects.
This means you should reference other aggregates within aggregates only by id which avoids modelling huge aggregates with deep hierarchies. It also means that if your aggregates tend to grow bigger over time that you might have missed some new aggregate which you initially modelled as entity and should be an aggregate on its own.
As for me, Constructor is another AggregateRoot. I have a problem with ConstructorJob. Where should I place it? What should be responsible for managing it?
In your case I would model it the following way:
The ConstructorJob is a Value Object which holds some data (name, etc.) and also a reference to a Constructor aggregate. But this reference is not a reference in terms of object reference like you would do it with a child entity of an aggregate root. The constructor aggregate is referenced by an identifier (UUID, integer or whatever you are using as id type) in the ConstructorJob.
The ConstructorJob value object would be part of the Project aggregate. The project aggregate could of course directly hold the id of the constructor aggregate but I guess in your case the value object might fit quite well.

What are the benefits of ORM lazy loading?

I'm researching data layer underpinnings for a new web-based reporting system and have spent a lot of time evaluating ORM's over the last few days. That said, I've never dealt with "lazy loading" before and am confused at why its the default setting for LINQ queries in the Entity Framework. It seems like it creates a lot of network traffic and unnecessarily tasks the database with additional queries that could otherwise be resolved with joins.
Can someone describe a scenario in which lazy loading would be beneficial?
Some meta:
The new system will be working against a database with hundreds of tables and many terabytes of data in a production environment with over 3,000 concurrent users on the system 24 hours a day. They will be retrieving large datasets continuously. Is it possible that an ORM just isn't the right solution for our needs, especially since the app will be web-based?
When we talk about lazy loading we are talking about Navigation Properties (how we follow foreign keys). What lazy loading will do for us is to populate the entity from a remote table as we attempt to access that entity. For example if we have a model like this
public class TestEntity
{
public int Id{get;set;}
public AnotherEntity RemoteEntity{get;set;}
}
And call the following
var something = WhateverContext.TestEntities.First().RemoteEntity;
We will get 2 database calls, one for WhateverContext.TestEntities.First() and one for loading the remote entity.
I'm a web guy, (and more specifically an MVC guy) and for web stuff I don't think there is ever a good reason for wanting to do this, One database call is always going to be quicker than two if we require the same set of data.
The situation where I think that lazy loading is actually worth considering is when you don't know when you do your first query if you will need the second entity at all. In my opinion this is much more relevant for windows applications where we have a user who is performing actions in real time (rather than stateless MVC where users are requesting whole pages at once). For example I think lazy loading shines when we have a list of data with a details link, then we don't load the details until the user decides they want to see them.
I don't feel this extends to paging, sorting and filtering, IMO there should be one specifically crafted database query per page of data you are displaying, which returns exactly the data set required to display that page.
In terms of your performance question, I feel that EF (or another ORM) can probably meet your needs here but you want to be careful with how you are retrieving large datasets due to the way EF tracks entities. Check out my EF performance tuning cheat sheet, and read up on DetectChanges and AsNoTracking if you do decide to use EF with large queries.
Most ORMs will give you the option, when you're building up your object selections, to say "don't be lazy, go ahead and join", so if you're worried about it from an efficiency perspective, don't be. You can make it work (usually).
There are 2 particular cases I know of where lazy loading helps:
Chaining commands
What if you want to create a basic select, but then you want to run it through a sort and a filter function that's based on user input. You can simply pass the ORM object in, and attach the sort and filtering functionality to it. Instead of evaluating it each time, it only evaluates when it's actually used.
Avoiding huge, deep, highly-relational queries
What if you just need the IDs of some related fields? If it loads lazily, you don't have to worry about it joining a whole bunch of data and tables that you don't need, potentially slowing down the query and overusing bandwidth. Of course, if you DID want everything else, then you'll need to be explicit, or you may run into a problem where it lazily runs a query for each detail record. Like I mentioned at the outset, that's easily overcome in any ORM worth using.
A simple case is a result set of N records which you do not want to bring to the client at once. The benefit is that you are able to lazily load only what is needed for the clients demands, such as sorting, filtering, etc... An example would be a paging view where one could page through records and sort them accordingly, thus the client only needs N amount at a given time.
When you perform the LINQ query it translates that to SQL commands on the server side to provide only what is needed in the given context. It boils down to offloading work to the database and minimizing what you need to send back to the client.
Some will argue that ORM based lazy loading is wrong however that starts to move to semantics fairly quick and should be more about approach to design versus what is right and wrong.

Entity Framework TPC with multiple inheritance

I am using EF with TPC and I have a multiple inheritance lets say I have
Employee (abstract)
Developer (inherits from Employee)
SeniorDeveloper (inherits from Developer)
I inserted some rows in the database and EF reads them correctly.
BUT
When I insert a new SeniorDeveloper, the values get written to the SeniorDeveloper AND Developer database table, hence querying just the Developers (context.Employees.OfType()) also gets the recently added SeniorDevelopers.
Is there a way to tell EF, that it should store only in one table, or why does EF fall back to TPT strategy?
Since it doesn't look like EF supports the multiple inheritance with TPC, we ended up using TPC for Employee to Developer and TPT between Developer and SeniorDeveloper...
I believe there is a reason for this, although I may not see the full picture and might just be speculating.
The situation
Indeed, the only way (that I see) for EF to be able to list only the non-senior developers (your querying use-case) in a TPT scenario by reading only the Developer table would be by using a discriminator, and we know that EF doesn't use one in TPT/TPC strategies.
Why? Well, remember that all senior developers are developers, so it's only natural (and necessary) that they have a Developer record as well as a SeniorDeveloper record.
The only exception is if Developer is an abstract type, in which case you can use a TPC strategy to remove the Developer table altogether. In your case however, Developer is concrete.
The current solution
Remembering this, and without a discriminator in the Developer table, the only way to determine if any developer is a non-senior developer is by checking if it is not a senior developer; in other words, by verifying that there is no record of the developer in the SeniorDeveloper table, or any other subtype table.
That did sound a little obvious, but now we understand why the SeniorDeveloper table must be used and accessed when its base type (Developer) is concrete (non-abstract).
The current implementation
I'm writing this from memory so I hope it isn't too off, but this is also what Slauma mentioned in another comment. You probably want to fire up a SQL profiler and verify this.
The way it is implemented is by requesting a UNION of projections of the tables. These projections simply add a discriminator declaring their own type in some encoded way[1]. In the union set, the rows can then be filtered based on this discriminator.
[1] If I remember correctly, it goes something like this: 0X for the base type, 0X0X for the first subtype in the union, 0X1X for the second subtype, and so on.
Trade-off #1
We can already identify a trade-off: EF can either store a discriminator in the table, or it can "generate one" at "run time".
The disadvantages of a stored discriminator are that it is less space efficient, and possibly "ugly" (if that's an argument). The advantages are lookup performance in a very specific case (we only want the records of the base type).
The disadvantages of a "run time" discriminator are that lookup performance is not as good for that same use-case. The advantages are that it is more space efficient.
At first sight, it would seem that sometimes we might prefer to trade a little bit of space for query performance, and EF wouldn't let us.
In reality, it's not always clear when; by requesting a UNION of two tables, we just lookup two indexes instead of one, and the performance difference is negligible. With a single level of inheritance, it can't be worse than 2x (since all subtype sets are disjoint). But wait, there's more.
Trade-off #2
Remember that I said the performance advantage of the stored-discriminator approach would only appear in the specific use-case where we lookup records of the base type. Why is that?
Well, if you're searching for developers that may or may not be senior developers[2], you're forced to lookup the SeniorDeveloper table anyway. While this, again, seems obvious, what may be less obvious is that EF can't know in advance if the types will only be of one type or another. This means that it would have to issue two queries in the worst case: one on the Developer table, and if there is even one senior developer in the result set, a second one on the SeniorDeveloper table.
Unfortunately, the extra roundtrip probably has a bigger performance impact than a UNION of the two tables. (I say probably, I haven't verified it.) Worse, it increases for each subtype for which there is a row in the result set. Imagine a type with 3, or 5, or even 10 subtypes.
And that's your trade-off #2.
[2] Remember that this kind of operation could come from any part of your application(s), while the resolving the trade-off must be done globally to satisfy all processes/applications/users. Also couple that with the fact that the EF team must make these trade-offs for all EF users (although it is true that they could add some configuration for these kinds trade-offs).
A possible alternative
By batching SQL queries, it would be possible to avoid the multiple roundtrips. EF would have to send some procedural logic to the server for the conditional lookups (T-SQL). But since we already established in trade-off #1 that the performance advantage is most likely negligible in many cases, I'm not sure this would ever be worth the effort. Maybe someone could open an issue ticket for this to determine if it makes sense.
Conclusion
In the future, maybe someone can optimize a few typical operations in this specific scenario with some creative solutions, then provide some configuration switches when the optimization involves such trade-offs.
Right now however, I think EF has chosen a fair solution. In a strange way, it's almost cleaner.
A few notes
I believe the use of union is an optimization applied in certain cases. In other cases, it would be an outer join, but the use of a discriminator (and everything else) remains the same.
You mentioned multiple inheritance, which sort of confused me initially. In common object-oriented parlance, multiple inheritance is a construct in which a type has multiple base types. Many object-oriented type systems don't support this, including the CTS (used by all .NET languages). You mean something else here.
You also mentioned that EF would "fallback" to a TPT strategy. In the case of Developer/SeniorDeveloper, a TPC strategy would have the same results as a TPT strategy, since Developer is concrete. If you really want a single table, you must then use a TPH strategy.

find(1234) including relationships

I am having a model "Events" (Zend_Db_Table_Abstract) that's got various relationships to other models. Usually I think I would do something like this to find it and its relationships:
$events = new Events();
$event = $events->find($id)->current();
$eventsRelationship1 = $event->findDependentRowset('Relationship1');
As the relationship is already set up I am wondering if there's any sort of automatic join available or something. Every time I fetch my event I need to have all the relationships, too. Currently I see only two ways to achieve that:
Build the query myself, hard coded. Don't like this, because it's working around the already set up relationship and "model method convenience".
Fetch every related object with a single query. This one's ugly, too, as I have to trigger too many queries.
This goes even a step further when thinking about getting a set of multiple rows. For a single event I may query the database multiple times, but when fetching 100 rows joins are just elementary.
So, does anyone know a way to create joins by using those relationships or is there no other way than hardcoding the query?
Thanks in advance
Arne
The way to solve this challenge is to 'upgrade' your database access to use the dataMapper pattern.
You are essentially adding an extra layer between the model in your application an their representation in the db. This mapper layer allows you read/write data from different tables - rather than a direct link between one model and one table.
Here is a good tutorial to follow. (There are some bits you can skip - I left out all the getters and setters as its just me using the code).
It takes a little while to get your head round the way it works, when you've just been using Zend_Db_Table_Abstract, but it is worth it.

DDD: Persisting aggregates

Let's consider the typical Order and OrderItem example. Assuming that OrderItem is part of the Order Aggregate, it an only be added via Order. So, to add a new OrderItem to an Order, we have to load the entire Aggregate via Repository, add a new item to the Order object and persist the entire Aggregate again.
This seems to have a lot of overhead. What if our Order has 10 OrderItems? This way, just to add a new OrderItem, not only do we have to read 10 OrderItems, but we should also re-insert all these 10 OrderItems again. (This is the approach that Jimmy Nillson has taken in his DDD book. Everytime he wants to persists an Aggregate, he clears all the child objects, and then re-inserts them again. This can cause other issues as the ID of the children are changed everytime because of the IDENTITY column in database.)
I know some people may suggest to apply Unit of Work pattern at the Aggregate Root so it keeps track of what has been changed and only commit those changes. But this violates Persistence Ignorance (PI) principle because persistence logic is leaking into the Domain Model.
Has anyone thought about this before?
Mosh
This doesn't have to be a problem, some ORM's support lazy lists.
e.g.
You could load the order entity and add items to the Details collection w/o actually materializing all of the other entities in that list.
I think N/Hibernate supports this.
If you are writing your own entity persistence code w/o any ORM, then you are pretty much out of luck, you would have to re-implement the same dirty tracking machinery as ORMappers give you for free.
The entire aggregate must be loaded from database because DDD assumes that aggregate roots ensure consistency within boundaries of aggregates. For these rules to be checed, all necessary data must be loaded. If there is a requirement that an order can be worth no more then $100000 for particular customer, aggregate root (Order) must check this rule before persisting changes. This does not imply that all the exisiting items must be loaded and their value summed up. Order can maintain pre-calculated sum of existing items which is updated on adding new ones. This way checking the business rule requires only Order data to be loaded when adding new items.
I'm not 100% sure about this approach , but I think applying unit of work pattern could be the answer . Keeping in mind that any transaction should be done , in application or domain services , you could populate the unit of work class/object with the objects from the aggregate that you have changed . After that let the UoW class/object do the magic (ofcourse building a proper UoW might be hard for some cases)
Here is a description of the unit of work pattern from here :
A Unit of Work keeps track of everything you do during a business transaction that can affect the database. When you're done, it figures out everything that needs to be done to alter the database as a result of your work.