EF - multiple includes to eager load hierarchical data. Bad practice? - entity-framework

I am needing to eager load a hierarchy structure so that I can recursively iterate through it. The eager loading is necessary to prevent multiple db queries while traversing the tree. It seems the consensus is that you can't eager load infinite levels of the tree, so I did something like
var item= db.ItemHierarchies
.Include("Children.Children.Children.Children.Children")
.Where(x => x.condition == condition)
to load 5 levels of children. This seems to get the job done. I'm wondering what the drawback is to doing this? If there is none then theoretically could I add 50 levels of includes here without slowing things down?

I recommend taking a look at the SQL that is generated as you add eager loading to your query.
var item= db.ItemHierarchies
.Include("Children")
.Include("Children.Children")
.Include("Children.Children.Children")
.Include("Children.Children.Children.Children")
.Include("Children.Children.Children.Children.Children")
var sql = ((System.Data.Objects.ObjectQuery) item).ToTraceString()
// http://visualstudiomagazine.com/blogs/tool-tracker/2011/11/seeing-the-sql.aspx
You'll see that the SQL quickly gets very big and complicated and can potentially have serious performance implications. You'd do well to limit your eager loading to data that you are certain you will need and to consider using explicit loading for some of the related entities - especially if you're working with connected entities in which case you can explicitly load collection properties when they're needed.
Also note that you may not need multiple separate Includes. For example, the following needs to be separate Includes because they're addressing separate properties (Widgets and Spanners) of the root.
var item= db.ItemHierarchies
.Include("Widgets")
.Include("Spanners.Flanges")
But the following isn't necessary:
var item= db.ItemHierarchies
.Include("Widgets") //This isn't necessary.
.Include("Widgets.Flanges") //This loads both Widges and Flanges.

Well honestly.. It's an extremely bad practice.
Let's assume you had 50 objects in your root.. and 50 per level.
You may end up retrieving 312500000 "capsules" of information.
Now, one might ask: "So what is wrong with that?!",
I mean if that is what is required than why not do that..
Rule #1: we develop software that should be used by human beings.
And the fact is that no human capable of taking a glimpse at 312500000 items of information at once and learn or conclude something beneficial out of it. (except.. that it does not help him or her to watch it)
Rule #2: UI should be based on what is needed and not what is possible.
And since we already established that showing 312500000 capsules of data is not needed there is no reason to bring all that at once.
And now you might come forward and say - But I don't care about the UI, really! All I need is to iterate in that data in order to process some information!
In that case you would probably want to save your results somewhere for future reference, but that means that its a batch job.. so why not apply batch job rules upon it.. like process it item by item which will also may give you the benefit of splitting it between even more machines if needed.
So you see.. no matter which path you choose there should be no reason to do it.
(= definition of what is a bad practice.)
Update:
After reading interesting concerns in the comments, I would like to update this answer with more analysis:
Deciding what is a bad practice must always be in reference to what is to be achieved or what is the role of each part in the system. In the current situation (after reading the comments) it has been brought or implied that the data storage is actually a persistent medium for objects opposed to a different concept where the data is the 'heart' of the application.
We can define two data types:
1) Data-Center which is being used in data-centric applications such as banks, CRM, ERP, websites or other service based solutions.
VS.
2) Data-Persistence medium which is being used as data to be saved for when the application is not active, in example: any simple app save file or any game save file and etc.
The main difference is that a data persistence medium is to be accessed only by a single instance of the app at a single point in time.. meaning the data is not designed to be shared by many instances. if the data is to be shared - we are dealing with a data-center application.
If your app just need a data-persistence medium - loading all the information cannot be considered as a bad practice - but you still need to make sure you are not exploding the memory. and in that frame of work, SQL Server might not be what you need or the best tool to use.
In the other case of Data-Centric application - my original answer remains as it will be a bad practice to bring all the information per instance of the application.

Related

REST design principles: Referencing related objects vs Nesting objects

My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.
Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Loading a context

Have I understood this correctly please.
When you are running a web application to view pages and you create an instance of the context is that instance loading all the database date into it?
If it does does that not take up a lot of memory a blog with five years of blogs could have 1,500 to 2,000 (or more)post in it, with all the comments tags etc that would be a great deal of data.
So what does happen when you create the instance of a context?
A context only loads the records that you request, so when you first instantiate one it will be empty and won't perform any queries against the database until you tell it to. Any entities you load through it will (usually) be cached within the context, though, so they use more and more memory every time you run a query and can become very large over time.
For that reason, and because contexts are relatively cheap to instantiate, it's a good idea to only keep them alive while you actually need them, and dispose of them as soon as you're done. This is part of the "unit of work" pattern -- basically using a new context for each set of operations that go together as one unit or transaction.
Edited to add:
If you're performing read-only queries (i.e. you just want to display data, you don't need to make changes and save them back to the database), you might check out non-tracking queries (e.g. the .AsNoTracking() method if you're using a DbContext/DbSet, or the MergeOption.NoTracking property if you're using an ObjectContext/ObjectSet) -- that will avoid caching the results in the context, increasing performance and reducing memory use.

Data Base Design Dilemma

I am creating a simple DB application for reports. According to DB design theory, you should never store the same information twice. This makes sense for most DB applications, but I need something that you can simply select a generic topic, you could then keep the new instance copy of the generic topic untouched or change the information but the generic topic should not be modified by modifying the instance copy, but the relationship needs to be tracked between the original topic and the instance copy of the topic.
Confusing, I know. Here is a diagram that may help:
I need the report to be immutable or mutable based off of the situation.
A quick example would be you select a customer, then you finish your report. A month later the customer's phone number changes so you update the customer portion of the DB, but you do not want to pull up a finished report and have the new information update into the already completed report.
What would be the most elegant solution to this scenario?
This may work:
But by utilizing this approach I would find myself using looping statements and if statements to identify the relationships between Generic, Checked Off and Report.
for (NSManagedObject *managedObject in checkedOffTaskObjects) {
if ([[reportObject valueForKeyPath:#"tasks"] containsObject:managedObject]) {
if ([[managedObject valueForKeyPath:#"tasks"] containsObject:genericTaskObjectAtIndexPath]) {
cell.backgroundView = [[[UIImageView alloc] initWithImage:[UIImage imageNamed:#"cellbackground.png"]] autorelease];
}
}
}
I know a better solution exists, but I cannot see it.
Thank you for time.
It's tricky to be very precise without knowing much about what exactly you're modelling, but here goes...
As you've noted, there's at least two strategies to get the "mutable instance copies of a prototype" functionality you want:
1) When creating an instance based on a prototype, completely copy the instance data from the prototype. No link between them thereafter.
PRO: faster access to the instance data with less logic involved.
CON 1: Any update to your prototype will not make it into the instances. e.g. if you have the address of a company wrong in the prototype.
CON 2: you're duplicating database data -- to a certain extent -- wasteful if you have huge records.
2) When creating an instance based on a prototype, store a reference to the 'parent' record, i.e. the prototype, and then only store updated fields in the actual instance.
PRO 1: Updates to prototype get reflected in all instances.
PRO 2: More efficient use of storage space (less duplication of data)
CON: more logic around pulling an instance from the database.
In summary: there's not any magical solution I can think of that gets you the best of both of these worlds. They're both valid strategies, depending on your exact problem and constraints (runtime speed versus storage size, for example).
If you go for 2), I certainly don't think it's a disaster -- particularly if you model things well and find out the best most efficient way to structure things in core data.

Lazy and Deferred TreeViewer questions

I have actually two questions but they are kind of related so here they go as one...
How to ensure garbage collection of tree nodes that are not currently displayed using TreeViewer(SWT.VIRTUAL) and ILazeTreeContentProvider?
If a node has 5000 children, once they are displayed by the viewer they are never let go,
hence Out of Memory Error if your tree has great number of nodes and leafs and not big enough heap size.
Is there some kind of a best practice how to avoid memory leakages, caused by never closed view holding a treeviewer with great amounts of data (hundreds of thousands objects or even millions)?
Perhaps maybe there is some callback interface which allow greater flexibility with viewer/content provider elements?
Is it possible to combine deffered (DeferredTreeContentManager) AND lazy (ILazyTreeContentProvider) loading for a single TreeViewer(SWT.VIRTUAL)?
As much as I understand by looking at examples and APIs, it is only possible to use either one at a given time but not both in conjunction, e.g. ,
fetch ONLY the visible children for a given node AND fetch them in a separate thread using Job API. What bothers me is that Deferred approach
loads ALL children. Although in a different thread, you It still load all elements
even though only a minimal subset are displayed at once.
I can provide code examples to my questions if required...
I am currently struggling with those myself so If I manage to come up with something in the meantime I will gladly share it here.
Thanks!
Regards,
Svilen
I find the Eclipse framework sometimes schizophrenic. I suspect that the DeferredTreeContentManager as it relates to the ILazyTreeContentProvider is one of these cases.
In another example, at EclipseCon this past year they recommended that you use adapter factories (IAdapterFactory) to adapt your models to the binding context needed at the time. For example, if you want your model to show up in a tree, do it this way.
treeViewer = new TreeViewer(parent, SWT.BORDER);
IAdapterFactory adapterFactory = new AdapterFactory();
Platform.getAdapterManager().registerAdapters(adapterFactory, SomePojo.class);
treeViewer.setLabelProvider(new WorkbenchLabelProvider());
treeViewer.setContentProvider(new BaseWorkbenchContentProvider());
Register your adapter and the BaseWorkbenchContentProvider will find the adaption in the factory. Wonderful. Sounds like a plan.
"Oh by-the-way, when you have large datasets, please do it this way", they say:
TableViewertableViewer = new TableViewer(parent, SWT.VIRTUAL);
// skipping the noise
tableViewer.setItemCount(100000);
tableViewer.setContentProvider(new LazyContentProvider());
tableViewer.setLabelProvider(new TableLabelProvider());
tableViewer.setUseHashlookup(true);
tableViewer.setInput(null);
It turns out that first and second examples are not only incompatible, but they're mutually exclusive. These two approaches where probably implemented by different teams that didn't have a common plan or maybe the API is in the middle of a transition to a common framework. Nevertheless you're on your own.