REST - should endpoints include summary data? - rest

Simple model:
Products, which have weights (can be mixed - ounces, grams, kilograms etc)
Cagtegories, which have many products in them
REST endpoints:
/products - get all products and post to add a new one
/products/id - delete,update,patch a single product
/categories/id - delete,update,patch a single category
/categories - get all categories and post to add a new one
The question is that the frontend client wants to display a chart of the total weight of products in ALL categories. Imagine a bar chart or a Pie chart showing all categories, and the total weight of products in each.
Obviously the product model has a 'weight_value' and a 'weight_unit' so you know the weight of a product and its measure (oz, grams etc).
I see 2 ways of solving this:
In the category model, add a computed field that totals the weight of all the products in a category. The total is calculated on the fly for that category (not stored) and so is always up to date. Note the client always needs all categories (eg. to populate drop downs when creating a product, you have to choose the category it belongs to) so it now will automatically always have the total weight of each category. So constructing the chart is easy - no need to get the data for the chart from the server, it's already on the client. But first time loading that data might be slow. Only when a product is added will the totals be stale (insignificant change though to the overall number and anyway stale totals are fine).
Create a separate endpoint, say categories/totals, that returns for all categories: a category id, name and total weight. This endpoint would loop through all the categories, calculating the weight for each and returning a category dataset with weights for each.
The problem I see with (1) is that it is not that performant. I know it's not very scalable, especially when a category ends up with a lot of products (millions!) but this is a hobby project so not worried about that.
The advantage of (1) is that you always have the data on the client and don't have to make a separate request to get the data for the chart.
However, the advantage of (2) is that every request to category/id does not incur a potentially expensive totalling operation (because now it is inside its own dedicated endpoint). Of course that endpoint will have to do quite a complex database query to calculate the weights of all products for all categories (although handing that off to the database should be the way as that is what db's are good for!)
I'm really stumped on which is the better way to do this. Which way is more true to the RESTful way (HATEOS basically)?

I would go with 2. favouring scalability and best practices. It does not really make any sense to perform the weight calculations every time the category is requested and even though you may not anticipate this to be a problem since it is a 'hobby' project, it's always best to avoid shortcuts where the benefits are minimal ( or so experience has taught me !).
Choosing 1, the only advantage would be that you have to set up one less endpoint and one extra call to get the weight total - the extra call shouldn't add too much overhead, and setting up the extra endpoint shouldn't take up too much effort.
Regardless of whether you choose 1 or 2, I would consider caching the weight total ( for a reasonable amount of time, depending on accuracy required ) to increase performance even further.

Related

How to handle static data in ES/CQRS?

After reading dozens of articles and watching hours of videos, I don't seem to get an answer to a simple question:
Should static data be included in the events of the write/read models?
Let's take the oh-so-common "orders" example.
In all examples you'll likely see something like:
class OrderCreated(Event):
....
class LineAdded(Event):
itemID
itemCount
itemPrice
But in practice, you will also have lots of "static" data (products, locations, categories, vendors, etc).
For example, we have a STATIC products table, with their SKUs, description, etc. But in all examples, the STATIC data is never part of the event.
What I don't understand is this:
Command side: should the STATIC data be included in the event? If so, which data? Should the entire "product" record be included? But a product also has a category and a vendor. Should their data be in the event as well?
Query side: should the STATIC data be included in the model/view? Or can/should it be JOINED with the static table when an actual query is executed.
If static data is NOT part of the event, then the projector cannot add it to the read model, which implies that the query MUST use joins.
If static data IS part of the event, then let's say we change something in the products table (e.g. typo in the item description), this change will not be reflected in the read model.
So, what's the right approach to using static data with ES/CQRS?
Should static data be included in the events of the write/read models?
"It depends".
First thing to note is that ES/CQRS are a distraction from this question.
CQRS is simply the creation of two objects where there was previously only one. -- Greg Young
In other words, CQRS is a response to the idea that we want to make different trade offs when reading information out of a system than when writing information into the system.
Similarly, ES just means that the data model should be an append only sequence of immutable documents that describe changes of information.
Storing snapshots of your domain entities (be that a single document in a document store, or rows in a relational database, or whatever) has to solve the same problems with "static" data.
For data that is truly immutable (the ratio of a circle's circumference and diameter is the same today as it was a billion years ago), pretty much anything works.
When you are dealing with information that changes over time, you need to be aware of the fact that that the answer changes depending on when you ask it.
Consider:
Monday: we accept an order from a customer
Tuesday: we update the prices in the product catalog
Wednesday: we invoice the customer
Thursday: we update the prices in the product catalog
Friday: we print a report for this order
What price should appear in the report? Does the answer change if the revised prices went down rather than up?
Recommended reading: Helland 2015
Roughly, if you are going to need now's information later, then you need to either (a) write the information down now or (b) write down the information you'll need later to look up now's information (ex: id + timestamp).
Furthermore, in a distributed system, you'll need to think about the implications when part of the system is unavailable (ex: what happens if we are trying to invoice, but the product catalog is unavailable? can we cache the data ahead of time?)
Sometimes, this sort of thing can turn into a complete tangle until you discover that you are missing some domain concept (the invoice depends on a price from a quote, not the catalog price) or that you have your service boundaries drawn incorrectly (Udi Dahan talks about this often).
So the "easy" part of the answer is that you should expect time to be a concept you model in your solution. After that, it gets context sensitive very quickly, and discovering the "right" answer may involve investigating subtle questions.

Storing parameters for rules

I am using RdeHat Decision Maker 7.1 (Drools) to create a rule for assigning a case to a department. The rule itself is quite simple, however it requires quite a lot of parameters (~12) like the agent type, working area, case type, customer seniority and more. The result "action" is the department to which the case is assigned.
I tried to place the parameters in a decision table , but the table quickly bloated to over 15,000 rows and will probably get even larger then that. I did, however, notices that in many cases the different between two rows is 1 or two parameters (e.g. same row with the only different is agent type "Local" vs. "Regional") resulting in different assignment.
I am thinking of replacing the table with something else, like a tree structure, so I can group similar rows under the same node and then navigate over the tree to make the decision. To do this I plan to prioritize the parameters and give parameters with higher priority a higher place in the tree.
Does anyone has experience with such a problem ? I looked at decision trees but they focus more on ML and probabilities, so I'm not sure this is what I need.
Is there any other method to deal with bloated tables that become unmanageable ? I cannot go to our customer and ask them to maintain a 15,000 rows excel. They'll shoot me there and then.
Thanks
Alon.

REST design principles: Referencing related objects vs Nesting objects

My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.
Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.

Updating Redundant data/denormalized data in NoSQL(Aerospike)

My question is that I am having a problem where I need to update the data which is been denormalized due to being in NoSQL because a single update in one data needs to be updated in all other redundant data.
For eg: Consider an e-commerce database where there is one table "Products" which contains all the details about a product , let's say name,imageName, LogoImage
Now in this case the LogoImage of various "Products" table entry can be same, and now I need to update the LogoImage, so I need to update in all the fields which contains the given LogoImage. which seems like a very poor solution
So is there any better way to do that?
P.S.: If we seperate logo and Products into 2 different table , so when I need to get 1000 products at a time , I need to get the related logos by implementing a client level join type thing, which is also not a good solution.
You're suggesting using the database as your CDN and storing the binary image in it? That's not a great approach, in my opinion. You should be storing that image in an actual CDN like Amazon Cloudfront, or a simple one like Amazon S3, or your own webserver as a file. Whichever, the point is that you should be referring to it by URI. In Aerospike you would store the metadata about that image, not the image itself.
Next, you can have two sets - prod for products and prodimg for product images. The various products store a list of IDs referring to the product image set. The product image set has metadata about each image as a separate record { uri, name, title, width, length, ... } . If anything changes about this image, you just update the one record with the metadata for that image in prodimg. No need to change anything about the products.
And you don't really need JOIN functionality in this case. Your application can get the prod record first, and use the bin (images) that has all the IDs of the images for the product (each referring to a key of a record in prodimg). You can then issue either a few get operations (reads) or a single batch-read for all of them if there are many. The latencies for Aerospike are such that this will return faster and scale better than an equivalent JOIN in an RDBMS. A batch-read is a multi-node, multi-core, multi-threaded operation. A cluster of 3 multi-core nodes has plenty of parallel computing power.
Again, if you "need 1000 products at a time" use batch-read. In the Java client that's an AerospikeClient.get() with a list of Key objects. In the Python client that's an aerospike.Client.get_many. Every Aerospike client has batch-read functionality.

Core data max amount of entity

I am working on an app where I use core data.
I already have tried to do it with one entity but that didn't work.
But I now have like twenty entities and my question is: is there a limit to the number of entities or a number recommended?
Is there a better way to store that amount of data?
UPDATE:
What I am storing are grads from school but not the A,b,c,d,e,f but a number from 1 to 10. And each grad has is own weighing(amount of times a number count) like some grad count 2 time because the are more imported.
So i first thought to have an array with a string for the name of the subject and then to array one store's the grad the other the corresponding weighing.
Like this:
var subjects: [String,[Int],[Int]]
but this isn't possible and I don't even know how I should put this in core data and get it back properly.
Because I couldn't figure it out, I thought of just making a entity for each subject but there are a lot of them so there for this question.
There's no limit to the number of entities, but it's possible to go overboard and create more than you actually need. The recommended number is "as many as you need and no more", which obviously will vary a great deal depending on the nature of the data and how the app uses it. Whether there's a better way than your current approach is totally dependent on the fine details of exactly what you're doing, and so is impossible to answer without a far more detailed question.
You could setup a Subject entity that has one-to-many relationships to ordered sets of Grade and Weight, like so:
However, each grade apparently has a corresponding weight, so it would be more accurate to store each grade's weight in the Grade entity:
This still may not represent your real-world model.
If your subject is something general, like math or english, you could have more than one subject per grade, (e.g., algebra, geometry, trigonometry), or more than one level per subject (e.g., algebra 1, algebra 2) which may or may not have a different grade.
If your subject is very specific, your data may end up spread across unique one-to-one relationships, instead of one-to-many relationships.
You would also need to consider whether you can use ordered or unordered relationships, or whether an attribute exists that you can use to sort an entity.
You should consider these different facets of what you're trying to model (as well as the specific fetches you'd want to perform), before you try to design or implement the model, to allow you to efficiently represent this particular object graph.