Where should data be transformed for the database?

Where should data be transformed for the database? - forms

Where should data be transformed for the database? I believe this is called data normalization/sanitization.
In my database I have user created shops- like an Etsy. Let's say a user inputs a price for an item in their shop as "1,000.00". But my database stores the prices as an integer/pennies- "100000". Where should "1,000.00" be converted to "100000"?
These are the two ways I thought of.
In the frontend: The input data is converted from "1,000.00" to "100000" in the frontend before the HTTP request. In this case, the backend would validate that the price it is an integer.
In the backend: "1,000.00" is sent to the backend as is, then the backend validates that it is a price format, then the backend converts the price to an integer "100000" before being stored in the database.
It seems either would work but is one way better than the other or is there a standard way? I would think the second way is best to reduce code duplication since there is likely to be multiple frontends - mobile, web, etc- and one backend. But the first way also seems cleaner- just send what the api needs.
I'm working with a MERN application if that makes any difference, but I would think this would be language agnostic.

I would go with a mix between the two (which is a best practice, AFAIK).
The frontend has to do some kind of validation anyway, because it you don't want to wait for the backend to get the validation response. I would add the actual conversion here as well.
The validation code will be adapted to each frontend I guess, because each one has a different framework / language that it uses, so I don't necessarily see the code duplication here (the logic yes, but not the actual code). As a last resort, you can create a common validation library.
The backend should validate again the converted value sent by the frontends, just as a double check, and then store it in the database. You never know if the backend will be integrated with other components in the future and input sanitization is always a best practice.

Short version
Both ways would work and I think there is no standard way. I prefer formatting in the frontend.
Long version
What I do in such cases is to look at my expected business requirements and make a list of pros and cons. Here are just a few thoughts about this.
In case, you decide for doing every transformation of the price (formatting, normalization, sanitization) in the frontend, your backend will stay smaller and you will have less endpoints or endpoints with less options. Depending on the frontend, you can choose the perfect fit for you end user. The amount of code which is delivered will stay smaller, because the application can be cached and makes all the formatting stuff.
If you implement everything in the backend, you have full control about which format is delivered to your users. Especially when dealing with a lot of different languages, it could be helpful to get the correct display value directly from the server.
Furthermore, it can be helpful to take a look at some different APIs of well-known providers and how these handle prices.
The Paypal API uses an amount object to transfer prices as decimals together with a currency code.
"amount": {
"value": "9.87",
"currency": "USD"
}
It's up to you how to handle it in the frontend. Here is a link with an example request from the docs:
https://developer.paypal.com/docs/api/payments.payouts-batch/v1#payouts_post
Stripe uses a slightly different model.
{
unit_amount: 1600,
currency: 'usd',
}
It has integer values in the base unit of the currency as the amount and a currency code to describe prices. Here are two examples to make it more clear:
https://stripe.com/docs/api/prices/create?lang=node
https://stripe.com/docs/checkout/integration-builder
In both cases, the normalization and sanitization has to be done before making requests. The response will also need formatting before showing it to the user. Of course, most of these requests are done by backend code. But if you look at the prebuilt checkout pages from Stripe or Paypal, these are also using normalized and sanitized values for their frontend integrations: https://developer.paypal.com/docs/business/checkout/configure-payments/single-page-app
Conclusion/My opinion
I would always prefer keeping the backend as simple as possible for security reasons. Less code (aka endpoints) means a smaller attack surface. More configuration possibilities means a lot more effort to make the application secure. Furthermore, you could write another (micro)service which overtakes some transformation tasks, if you have a business requirement to deliver everything perfectly formatted from the backend. Example use cases may be if you have a preference for backend code over frontend code (think about your/your team's skills), if you want to deploy a lot of different frontends and want to make sure that they all use a single source of truth for their display values or maybe if you need to fulfill regulatory requirements to know exactly what is delivered to your user.
Thank you for your question and I hope I have given you some guidance for your own decision. In the end, you will always wish you had chosen a different approach anyway... ;)

For sanitisation, it has to be on the back end for security reasons. Sanitisation is concerned with ensuring only certain field's values from your web form are even entertained. It's like a guest list at an exclusive club. It's concerned not just with the field (e.g. username) but also the value (e.g. bobbycat, or ); DROP TABLE users;). So it's about ensuring security of your database.
Normalisation, on the other hand, is concerned with transforming the data before storing them in the database. It's the example you brought up: 1,000 to 1000 because you are storing it as integers without decimals in the database.
Where does this code belong? I think there's no clear winner because it depends on your use case.
If it's a simple matter like making sure the value is an integer and not a string, you should offload that to the web form (I.e. the front end), since forms already have a "type" attribute to enforce these.
But imagine a more complicated scenario. Let's say you're building an app that allows users to construct a Facebook ads campaign (your app being the third party developer app, like Smartly.io). That means there will be something like 30 form fields that must be filled out before the user hits "create campaign". And the value in some form fields affect the validity of other parts of the form.
In such a situation, it might make sense to put at least some of the validation in the back end because there is a series of operations your back end needs to run (like create the Post, then create the Ad) in order to ensure validity. It wouldn't make sense for you to code those validations and normalisations in the front end.
So in short, it's a balance you'll need to strike. Offload the easy stuff to the front end, leveraging web APIs and form validations. Leave the more complex normalisation steps to the back end.
On a related note, there's a broader concept of ETL (extract, transform, load) that you'd use if you were trying to consume data from another service and then transforming it to fit the way you store info in your own database. In that case, it's usually a good idea to keep it as a repository on its own - using something like Apache Airflow to manage and visualise the cron jobs.

Related

REST API Design - Single General Endpoint or Many Specific endpoints

This is a relatively subjective question, but I want to get other people's opinion nonetheless
I am designing a REST Api that will be accessed by internal systems (a couple of clients apps at most).
In general the API needs to update parameters of different car brands. Each car brand has around 20 properties, some of which are shared between all car brands, and some specific for each brand.
I am wondering what is a better approach to the design for the endpoints of this API.
Whether I should use a single endpoint, that takes in a string - that is a JSON of all the properties of the car brand, along with an ID of the car brand.
Or should I provide a separate endpoint per car brand, that has a body with the exact properties necessary for that car brand.
So in the first approach I have a single endpoint that has a string parameter that I expect to be a JSON with all necessary values
PUT /api/v1/carBrands/
Whereas in the second approach in the second scenario I have an endpoint per type of car brand, and each endpoint has a typed dto object representing all the values it needs.
PUT /api/v1/carBrand/1
PUT /api/v1/carBrand/2
.
.
.
PUT /api/v1/carBrand/n
The first approach seems to save a lot of repetitive code - afterall the only difference is the set of parameters. However, since this accepts an arbitrary string, there is no way for the enduser to know what he should pass - he will need someone to tell it to him and/or read from documentation.
The second approach is a lot more readable, and any one can fill in the data, since they know what it is. But it involves mostly replicating the same code around 20 times.
Its really hard for me to pick an option, since both approaches have their drawbacks. How should I judge whats the better option

I am wondering what is a better approach to the design for the endpoints of this API.
Based on your examples, it looks as though you are asking about resource design, and in particular whether you should use one large resource, or a family of smaller ones.
REST doesn't answer that question... not directly, anyway. What REST does do is identify that caching granularity is at the resource level. If there are two pieces of information, and you want the invalidation of one to also invalidate the other, then those pieces of information should be part of the same resource, which is to say they should be accessed using the same URI.
If that's not what you want, then you should probably be leaning toward using separated resources.
I wouldn't necessarily expect that making edits to Ford should force the invalidation of my local copy of Ferrari, so that suggests that I may want to treat them as two different resources, rather than two sub-resources.
Compare
/api/v1/carBrands#Ford
/api/v1/carBrands#Ferrari
with
/api/v1/carBrands/Ford
/api/v1/carBrands/Ferrari
In the former case, I've got one resource in my cache (/api/v1/carBrands); any changes I make to it invalidate the entire resource. In the latter case, I've got two resources cached; changing one ignores the other.
It's not wrong to use one or the other; both are fine, and have plenty of history. They make different trade offs, one or the other may be a better fit for the problem you are trying to solve today.

REST design principles: Referencing related objects vs Nesting objects

My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?

My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.

Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.

My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.

Forcing web api consumers to accept new fields in responses

I'm creating v2 of an existing RESTful web api.
The responses are JSON lists of objects, roughly in the form:
[
{
name1=value1,
name2=value2,
},
{
name1=value3,
name2=value4,
}
]
One problem we've observed with v1 is that some clients will access fields by integer position, instead of by name. This means that if we decide to add fields to the response (which we had originally considered a compatibility-preserving change), then some of our client's code breaks, unless we add the fields at the end. Even then, other clients code breaks anyway, because they will fail in some way when they encounter an unexpected attribute name.
To counter this in v2, we are considering randomly reordering the fields in every response. This will force clients to index fields by name instead of by position.
Additionally, we are considering adding a randomly-named field to every response. This will force clients to ignore fields they don't recognize.
While this sounds somewhat barmy, it does have the advantage that we will be able to add new fields, safe in the knowledge that this isn't breaking any clients. This means we can issue compatible updates to v2.1, v2.3, etc at the same URL, and that means we will only have to maintain & support a smaller number of API versions.
The alternative is to issue compatibility-breaking v3, v4, at new URLs, which means that we will have to maintain & support many incompatible API versions, which will stretch us that little bit thinner.
Is this a bad idea, and if so, why? Are there any other similar ideas I should think about?
Update: The first couple of responses are pointing out that if I document the issue (i.e. indicate in the docs that fields may be added or reordered) then I am no longer to blame if client code breaks when I subsequently add or reorder fields. Sadly I don't think this is an appropriate option for us: Many dozens of organisations rely on the functionality of our APIs for real-world transactions with substantial financial impact. These organisations are not technically oriented - and the resulting implementations at the client end cover the whole spectrum of technical proficiency. We already did document that fields may get added or reordered in the docs for v1, and that clearly didn't work, because now we're having to issue v2 because many clients, due to lack of time or experience or ability, still wrote code that breaks when we add new fields. If I were now to add fields to the interface, it breaks a dozen different company's interfaces to us, which means they (and us) are bleeding money every minute. If I were to refuse to revert the change or fix it, saying "They should have read the docs!", then I will soon be out of the job, and rightly so. We may attempt to educate the 'failing' partners, but this is doomed to fail as the problem gets larger every month as we continue to grow. My question is, can I systemically head the whole issue off at the pass, preventing this situation from ever arising, no matter what clients try to do? If the techniques I suggest would work, why shouldn't I use them? Why isn't everyone else using them?

If you want your media types to be "evolvable", make that point very clear in the documentation. Similarly, if the placement order of fields is not guaranteed, make that explicitly clear too. If you supply sample code for your API, make sure it does not rely on field ordering.
However, even assuming that you have to maintain different versions of your media types, you don't have to version the URI. REST gives you the ability to maintain the same version-agnostic URI but use HTTP content negotiation (via the Accept and Content-Type headers) to offer different payloads at the same URI.
Therefore any client that doesn't explicitly wish to accept your new v2/v3/etc encoding won't get it. By default, you can return the old v1 encoding with the original field ordering and all of those brittle client apps will work fine. However, new client developers will know (thanks to your documentation) to indicate via Accept that they are willing and able to see the new fields and they don't care about their order. Best of all, you can (and should) use the same URI throughout. Remember - different payloads like this are just different representations of the same underlying resource, so the URI should be the same.

I've decided to run with the described techniques, to the max. I haven't heard any objections to them that hold any water for me. Brian's answer, about re-using the same URI for different API versions, is a solid and much-appreciated complementary idea (with upvote), but I can't award it 'best answer' because it doesn't get to the core of my original question.

Best workflow for any RESTful operation in web CRUD

As a general rule for any RESTful CRUD operation, I follow these steps:
Validating information on client-side
Sending required information in JSON format to the server (possibly a web service)
Validating information on the server
Doing the operation
Returning JSON as the result of operation
Updating DOM based on server's response
Though this list is general, I think it's the most complete list. The only problem is that, I do it for any and every operation. I mean, DRY (don't repeat yourself) tells us to stop repeating things. Is it considered a repetition? Or should we follow these steps always?

Well, you can skip validating the data client-side if you wish…
Seriously, those are the necessary minimum for doing a lot of things; you must validate server side to prevent a whole host of potential problems and the other parts are just fundamental. OK, you can skip the sending to the server, but then you're not interacting with a REST service in the first place. You could also skip updating the DOM, but then you're not showing the results. In other words, every step of that sequence serves its own purpose that is independent from the other ones: they're not redundant.
But that doesn't mean that you should ignore DRY. Not at all. Instead, you should factor out as much of that code as possible into a single place so as to keep the number of repetitions to a minimum. (Maybe even find a framework to do some of that for you.)

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.

Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)

Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.

Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour

www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse