Is there a way to define a Spring Data Specification (returning a JPA predicate) whose sole purpose is to perform eager fetching?
I have an entity which defines various relationships using lazy loading, but there are several queries where I want to return the entire entity representation including all related collections, but the criteria of these queries may vary. I've seen a few posts (e.g. on the spring forum) that discuss the potential introduction of fetch groups, which would likely be the ideal solution; however, since this is not yet part of the JPA spec, Spring Data does not provide support for it.
I'm looking for a reusable way to perform eager fetching on a variety of dynamic queries.
For example, I've considered developing a reusable Specification whose sole purpose is to perform the eager loading, and which could be combined with other specifications, e.g:
private static Specification<MyEntity> eager() {
return new Specification<MyEntity>() {
#Override
public Predicate toPredicate(Root<MyEntity> root, CriteriaQuery<?> query, CriteriaBuilder cb) {
for (PluralAttribute<? super MyEntity, ?, ?> fetch : root.getModel().getPluralAttributes()) {
root.fetch(fetch, JoinType.LEFT);
}
query.distinct(true);
return null;
}
};
}
The goal of this specification is to reuse it in various queries, e.g:
repository.findAll(where(eager()).and(otherCriteria).or(otherCriteria));
However, the eager fetching specification is not a true predicate, so it returns null, and would cause obvious problems (i.e. NullPointerExceptions) when chained with other predicates.
(Note that this predicate alone does work as expected; i.e. the following query will properly fetch: repository.findAll(eager());).
Is there an appropriate non-null Predicate that can be returned from the "eager" specification, or are there any other reusable approaches for triggering eager fetches using Spring Data JPA specifications (without having to tack the eager load onto another specification)?
We have improved the handling of null Specifications and Predicates in the course of fixing DATAJPA-300. You might wanna give the 1.4 snapshots a try and see how this affects your scenario.
Related
In the following, author advises to not partially initialize domain entities.
As we stated earlier, each customer must have no more than 5 contacts. By not returning the contacts along with the customers themselves, we leave a hole in our domain model which allows us to add a 6th contact and thus break this invariant.
Because of that, the practice of partial initialization should be avoided. If your repository returns a list of domain entities (or just a single domain entity), make sure the entities are fully initialized meaning that all their properties are filled out.
https://enterprisecraftsmanship.com/posts/partially-initialized-entities-anti-pattern/
So, should we have to load the whole object graph? A customer with all contacts and all related things or entity framework lazy loading would help?
It probably has less to do with the object graph and more to do with the invariants involved.
As someone posted in the comments of that post, a performance issue may very well arise when there are 1000's of permitted contacts. An example of something to this effect may be that a Customer may only have, say, 5 active Order instances. Should all order instances linked to the customer be loaded? Most certainly not. In fact, an Order is another aggregate and an instance of one aggregate should not be contained in another aggregate. You could use a value object containing the id of the other aggregate but for a great many of these the same performance issue may manifest itself.
An alternative may be to simply keep a ContactCount or, in my example, an ActiveOrderCount which is kept consistent. If the actual relationships are to be stored/removed then these may be attached to the relevant aggregate when adding/removing in order to persist the change but that is a transient representation.
So, should we have to load the whole object graph? A customer with all contacts and all related things or entity framework lazy loading would help?
The answer is, actually, a resounding "yes". However, your object model should not be deep. You should make every attempt to create small aggregates. I try to model my aggregates with a single root entity and then containing value objects. The entire aggregate is loaded. Lazy-loading is probably an indication that you are querying your domain which is something that I suggest one not do. Rather create a simple query mechanism that uses some read model to return return the relevant data for your front-end.
The anti-pattern of partially loaded entities has to do with both graphs (children and relatives) as well as data within an entity. The reason it is an anti-pattern is because any code that is written to accept, and expect an entity should be given a complete and valid entity.
This is not to say that you always must load a complete entity, it is that if you ever return an entity, it should be a complete, or complete-able entity. (proxies associated to a live DbContext)
An example of a partially loaded example and why it goes bad:
Someone goes to write the following method that an MVC controller will call to get a customer and return it to a view...
public IEnumerable<Customer> GetCustomers(string criteria)
{
using (var context = new MyDbContext())
{
return context.Customers.Where(x => x.IsActive && x.CustomerName.StartsWith(criteria)).ToList();
}
}
Code like this may have worked earlier with simpler entities, but Customer had related data like Orders and when MVC went to serialize it, they got an error because the Orders proxies could not lazy load due to the DbContext being disposed. The options were to somehow eager-load all related details with this call to return the complete customer, completely disable lazy loading proxies, or return an incomplete customer. Since this method would be used to display a summary list of just customer details, the author could choose to do something like:
public IEnumerable<Customer> GetCustomers(string criteria)
{
using (var context = new MyDbContext())
{
return context.Customers.Where(x => x.IsActive && x.CustomerName.StartsWith(criteria))
.Select(x => new Customer
{
CustomerId = x.CustomerId,
CustomerName = x.CustomerName,
// ... Any other fields that we want to display...
}).ToList();
}
}
The problem seems solved. The trouble with this approach, or turning off lazy load proxies, is that you are returning a class that implies "I am a Customer Entity". That object may be serialized to a view, and de-serialized back from a view and passed to another method that is expecting a Customer Entity. Modifications to your code down the road will need to somehow determine which "Customer" objects are actually associated with a DbContext (or a complete, disconnected entity) vs. one of these partial, and incomplete Customer objects.
Eager-loading all of the related data would avoid the issue of the partial entity, however it is both wasteful in terms of performance and memory usage, and prone to bugs as entities evolve as when relatives are added they need to be eager-fetched in the repository or could result in lazy load hits, errors, or incomplete entity views getting introduced down the road.
Now in the early days of EF & NHibernate you would be advised to always return complete entities, or write your repositories to never return entities, instead, return DTOs. For example:
public IEnumerable<CustomerDTO> GetCustomers(string criteria)
{
using (var context = new MyDbContext())
{
return context.Customers.Where(x => x.IsActive && x.CustomerName.StartsWith(criteria))
.Select(x => new CustomerDTO
{
CustomerId = x.CustomerId,
CustomerName = x.CustomerName,
// ... Any other fields that we want to display...
}).ToList();
}
}
This is a better approach than the above one because by returning and using the CustomerDTO, there is absolutely no confusion between this partial object and a Customer entity. However, this solution has its drawbacks. One is that you may have several similar, but different views that need customer data, and some may need a bit extra or some of the related data. Other methods will have different search requirements. Some will want pagination or sorting. Using this approach will be similar to the article's example where you end up with a repository returning several similar, but different DTOs with a large number of variant methods for different criteria, inclusions, etc. (CustomerDTO, CustomerWithAddressDTO, etc. etc.)
With modern EF there is a better solution available for repositories, and that is to return IQueryable<TEntity> rather than IEnumerable<TEntity> or even TEntity. For example, to search for customers leveraging IQueryable:
public IEnumerable<Customer> GetCustomers()
{
return Context.Customers.Where(x => x.IsActive)
}
Then, when your MVC Controller goes to get a list of customers with it's criteria:
using (var contextScope = ContextScopeFactory.Create())
{
return CustomerRepository.GetCustomers()
.Where(x => x.CustomerName.Contains(criteria)
.Select(x => new CustomerViewModel
{
CustomerId = x.CustomerId,
CustomerName = x.CustomerName,
// ... Details from customer and related entities as needed.
}).ToList();
}
By returning IQueryable the repository does not need to worry about complete vs. incomplete representations of entities. It can enforce core rules such as active state checking, but leave it up to the consumers to filter, sort, paginate, or otherwise consume the data as they see fit. This keeps the repositories very lightweight and simple to work with while allowing controllers and services that consume them to be unit tested with mocks in place of the repositories. The controllers should consume the entities returned by the repository, but take care not to return these entities themselves. Instead they can populate view models (or DTOs) to hand over to the web client or API consumer to avoid partial entities being passed around and confused for real entities.
This applies to cases even when a repository is expected to return just 1 entity, returning IQueryable has it's advantages.
for instance, comparing:
public Customer GetCustomerById(int customerId)
{
return Context.Customers.SingleOrDefault(x => x.CustomerId == customerId);
}
vs.
public IQueryable<Customer> QGetCustomerById(int customerId)
{
return Context.Customers.Where(x => x.CustomerId == customerId);
}
These look very similar, but to the consumer (controller/service) it would be a bit different.
var customer = CustomerRepository.GetCustomerById(customerId);
vs.
var customer = CustomerRepository.QGetCustomerById(customerId).Single();
Slightly different, but the 2nd is far more flexible. If we just wanted to check if a customer existed?
var customerExists = CustomerRepository.GetCustomerById(customerId) != null;
vs.
var customerExists = CustomerRepository.QGetCustomerById(customerId).Any();
The first would execute a query that loads the entire customer entity. The second merely executes an Exists check query. When it comes to loading related data? The first method would need to rely on lazy loading or simply not have related details available, where as the IQueryable method could:
var customer = CustomerRepository.QGetCustomerById(customerId).Include(x => x.Related).Single();
or better, if loading a view model with or without related data:
var customerViewModel = CustomerRepository.QGetCustomerById(customerId)
.Select(x => new CustomerViewModel
{
CustomerId = x.CustomerId,
CustomerName = x.CustomerName,
RelatedName = x.Related.Name,
// ... etc.
}).Single();
Disclaimer: Actual mileage may vary depending on your EF version. EF Core has had a number of changes compared to EF6 around lazy loading and query building.
A requirement for this pattern is that the DbContext either has to be injected (DI) or provided via a unit of work pattern as the consumer of the repository will need to interact with the entities and their DbContext when materializing the query created by the repository.
A case where using a partially initialized entity is perfectly valid would be when performing a Delete without pre-fetching the entity. For instance in cases where you're certain a particular ID or range of IDs needs to be deleted, rather than loading those entities to delete you can instantiate a new class with just that entity's PK populated and tell the DbContext to delete it. The key point when considering the use of incomplete entities would be that it is only cases where the entity only lives within the scope of the operation and is not returned to callers.
Let's say you have a domain object:
class ArgumentEntity
{
public int Id { get; set; }
public List<AnotherEntity> AnotherEntities { get; set; }
}
And you have ASP.NET Web API controller to deal with it:
[HttpPost("{id}")]
public IActionResult DoSomethingWithArgumentEntity(int id)
{
ArgumentEntity entity = this.Repository.GetById(id);
this.DomainService.DoDomething(entity);
...
}
It receives entity identifier, load entity by id and execute some business logic on it with domain service.
The problem:
The problem here is with related data. ArgumentEntity has AnotherEntities collection that will be loaded by EF only if you explicitly ask to do so via Include/Load methods.
DomainService is a part of business layer and should know nothing about persistence, related data and other EF concepts.
DoDomething service method expects to receive ArgumentEntity instance with loaded AnotherEntities collection.
You would say - it's easy, just Include required data in Repository.GetById and load whole object with related collection.
Now lets come back from simplified example to reality of the large application:
ArgumentEntity is much more complex. It contains multiple related collections and that related entities have their related data too.
You have multiple methods of DomainService. Each method requires different combinations of related data to be loaded.
I could imagine possible solutions, but all of them are far from ideal:
Always load the whole entity -> but it is inefficient and often impossible.
Add several repository methods: GetByIdOnlyHeader, GetByIdWithAnotherEntities, GetByIdFullData to load specific data subsets in controller -> but controller become aware of which data to load and pass to each service method.
Add several repository methods: GetByIdOnlyHeader, GetByIdWithAnotherEntities, GetByIdFullData to load specific data subsets in each service method -> it is inefficient, sql query for each service method call. What if you call 10 service methods for one controller action?
Each domain method call repository method to load additional required data ( e.g: EnsureAnotherEntitiesLoaded) -> it is ugly because my business logic become aware of EF concept of related data.
The question:
How would you solve the problem of loading required related data for the entity before passing it to business layer?
In your example I can see method DoSomethingWithArgumentEntity which obviously belongs to Application Layer. This method has call to Repository which belongs to Data Access Layer. I think this situation does not conform to classic Layered Architecture - you should not call DAL directly from Application Layer.
So your code can be rewritten in another manner:
[HttpPost("{id}")]
public IActionResult DoSomethingWithArgumentEntity(int id)
{
this.DomainService.DoDomething(id);
...
}
In DomainService implementation you can read from repo whatever it needs for this specific operation. This avoids your troubles in Application Layer. In Business Layer you will have more freedom to implement reading: with serveral repository methods reads half-full entity, or with EnsureXXX methods, or something else. Knowledge about what you need to read for operation will be placed into operation's code and you don't need this knowledge in app-layer any more.
Every time situation like this emerged it is a strong signal about your entity is not preperly designed. As krzys said the entity has not cohesive parts. In other words if you often need parts of an entity separately you should split this entity.
Nice question :)
I would argue that "related data" in itself is not a strict EF concept. Related data is a valid concept with NHibernate, with Dapper, or even if you use files for storage.
I agree with the other points mostly, though. So here's what I usually do: I have one repository method, in your case GetById, which has two parameters: the id and a params Expression<Func<T,object>>[]. And then, inside the repository I do the includes. This way you don't have any dependency on EF in your business logic (the expressions can be parsed manually for another type of data storage framework if necessary), and each BLL method can decide for themselves what related data they actually need.
public async Task<ArgumentEntity> GetByIdAsync(int id, params Expression<Func<ArgumentEntity,object>>[] includes)
{
var baseQuery = ctx.ArgumentEntities; // ctx is a reference to your context
foreach (var inlcude in inlcudes)
{
baseQuery = baseQuery.Include(include);
}
return await baseQuery.SingleAsync(a=>a.Id==id);
}
Speaking in context of DDD, It seems that you had missed some modeling aspects in your project that led you to this issue. The Entity you wrote about looked not to be highly cohesive. If different related data is needed for different processes (service methods) it seems like you didn't find proper Aggregates yet. Consider splitting your Entity into several Aggregates with high cohesion. Then all processes correlated with particular Aggregate will need all or most of all data that this Aggregate contains.
So I don't know the answer for your question, but if you can afford to make few steps back and refactor your model, I believe you will not encounter such problems.
I have a situation where I will be using a repository pattern and pulling objects from the database with a lazy loaded GetAll method that returns IQueryable. However I also need to build dynamic objects that will be included with the lazy loaded objects(query).
Is it possible to add built objects to a lazy loaded IQueryable and still keep the lazy loaded benefits? For instance
public override IQueryable<Foo> GetAll()
{
return _entities; // lazy loaded
}
public override IQueryable<Foo> GetAllPlusDynamic()
{
var entities = GetAll();
foreach(var d in GetAllDynamic())
{
entities.Add(d); // eagerly loaded
}
return entities;
}
I am unsure if I understand you correctly but refering to your comment...
Yes, basically query the database for a set of objects and then query
another data source (in this case a service) and build a set of
objects.
... I would say that it's not possible.
An object of type IQueryable<T> used with Entity Framework (LINQ to Entities) is basically a description of a query which the underlying data store can execute, usually an abstract description (expression tree) which gets translated into SQL.
Every part of such a query description - where expressions, select expression, Any(...) expressions, etc. - must be translatable into the native language (SQL) of the data store. It's especially not possible to include some method calls - like a service call - in an expression that the database cannot understand and perform.
IQueryable<T> knows an underlying "provider". This provider is responsible to translate the expression tree hold by the IQueryable<T> object into "something", for example T-SQL used by SQL Server, or the SQL dialects used by MySQL or Oracle. I believe it's possible to write your own provider which might then be able to perform somehow the service calls and the database queries. But I also believe that this is not an easy task.
Using the standard SQL providers for Entity Framework you have to perform database query and calling the service separately one after each other: Run query and materialize entities in memory - then run the service call on the result collection for each entity.
can anyone give some reference for non-sql database query interface design pattern?
For sql-based database, the query can be achieved by combining the query token.
but for non-sql, how to design the query, given that the query could be very complex.
EDIT:
I am using db4o to store some objects, I may need to query according to a certain Id, time range, or the combination of them.
How to design the query method?
public IEnumerable<Foo> GetFoos(int id);
public IEnumerable<Foo> GetFoos(int id, TimeRange range);
To build a lot of overloads seem stupid, what if a new query is needed?
In C#, it's definitely best to use Linq. Native Queries often fail to optimize, which will lead db4o to hydrate all objects and actually call the lambda expression on the instantiated object. That is nothing but an automated fallback to linq-to-objects and it's damn slow in comparison. Merely hydrating 60k of our typical objects takes multiple seconds.
Hint: A breakpoint on the lambda expression must never be invoked.
Even when using Db4oTool.exe to optimize native queries as a post-build step, even simple queries lead to problems when using properties or auto-properties in domain objects.
The linq provider always gave the best results for me. It has the most concise syntax and it's optimizations work. Also the linq provider is very complete, only it might fall back to linq-to-objects more often than you expect.
Also, it's important for the linq provider to have certain dlls in the project folder. Which these are depends on the version a bit. If you are using builds >= 14204, make sure Mono.Reflection.dll is in your app folder.
For older versions, all of the following must be present:
Db4obects.Db4o.Instrumentation.dll
Db4objects.Db4o.NativeQueries.dll
Mono.Cecil.dll
Cecil.FlowAnalysis.dll
Note that for native queries these are still required even in newer builds.
It looks like db4o uses its own queries, which Versant calls Native Queries (note: there's a separate syntax for .Net and Java native queries). Something like:
IObjectContainer container = Database();
container.Query(delegate(Foo foo) {
return foo.id == id;
});
container.Query(delegate(Foo foo) {
return foo.id == id;
},
delegate(Foo foo) {
return range.IsIn(foo.time);
});
I'm using DataNucleus as a JPA implementation to store my classes in my web application. I use a set of converters which all have toDTO() and fromDTO().
My issue is, that I want to avoid the whole DB being sent over the wire:
If I lazy load, the converter will try to access ALL the fields, and load then (resulting in very eager loading).
If I don't lazy load, I'll get a huge part of the DB, since user contains groups, and groups contains users, and so on.
Is there a way to explicitly load some fields and leave the others as NULL in my loaded class?
I've tried the DataNucleus docs with no luck.
Your DTOs are probably too fine-grained. i.e. dont plan to have a DTO per JPA entity. If you have to use DTOs then make them more coarse grained and construct them manually.
Recently we have had the whole "to DTO or not to DTO, that is the question" discussion AGAIN. The requirement for them (especially in the context of a JPA app) is often no longer there, but one of the arguments FOR DTOs tends to be that the view has coarser data requirements.
To only load the data you really require, you would need to use a custom select clause containing only these elements that you are about to use for your DTOs. I know how painful this is, especially when it involves joins, which is why I created Blaze-Persistence Entity Views which will take care of making the query efficient.
You define your DTO as an interface with mappings to the entity, using the attribute name as default mapping, this looks very simple and a lot like a subset of an entity, though it doesn't have to. You can use any JPQL expression as mapping for your DTO attributes.