Validate entity fields using Snowplow Micro - snowplow

According to this Snowplow Micro blog post, you can validate:
The value of specific fields sent with specific events is as expected
The correct contexts / entities are sent with the appropriate events
However, it doesn’t look like it is possible to see any detail about what values were passed for the attached entities.
This means that Micro is good for validating certain events were logged and that entities were attached, but we can’t verify anything about the attached entities beyond their existence. If, as part of an automated QA process, we want to validate that when an entity has a particular property set another property is also set, how should we go about achieving that?

Credit to Paul Boocock on Discourse:
In the parameters object, the cx property represents the contexts but
they are Base64 encoded. If you you decode this you will get another JSON object containing the entities.

Related

REST API Design: Path variable vs request body on UPDATE (Best practices)

When creating an UPDATE endpoint to change a resource, the ID should be set in the path variable and also in the request body.
Before updating a resource, I check if the resource exists, and if not, I would respond with 404 Not Found.
Now I ask myself which of the two information I should use and if I should check if both values are the same.
For example:
PUT /users/42
// request body
{
"id": 42,
"username": "user42"
}
You should PUT only the properties you can change into the request body and omit read-only properties. So you should check the id in the URI, because it is the only one that should exist in the message.
It is convenient to accept the "id" field in the payload. But you have to be sure it is the same as the path parameter. I solve this problem by setting the id field to the value of the path parameter (be sure to explain that in the Swagger of the API). In pseudo-code :
idParam = request.getPathParam("id");
object = request.getPayload();
object.id = idParam;
So all these calls are equivalent :
PUT /users/42 + {"id":"42", ...}
PUT /users/42 + {"id":"41", ...}
PUT /users/42 + {"id":null, ...}
PUT /users/42 + {...}
Why do you need the id both in URL and in the body? because now you have to validate that they are both the same or ignore one in any case. If it is a requirement for some reason, than pick which one is the one that is definitive and ignore the other one. If you don't have to have this strange duplication, than I'd say pass it in the body only
If you take a closer look at how HTTP works you might notice that the URI used to send a request to is also used as key for caching results. Any non-safe operation performed on that URI, such as POST, PUT, PATCH, will lead to (intermediary) caches automatically invalidating any stored responses for that URI. As such, if you use an other URI than the actual resource URI you are actually bypassing that feature and risk getting served outdated state from caches. As caching is one of the few constraints REST has simply skipping all caching via certain directives isn't ideal in first place.
In regards to including the ID of the resource or domain entity in the URI and/or in the payload: A common mistake in designing so-called REST APIs is that the domain object is mapped in a 1:1 manner onto a resource. We had a customer once who went through a merger and in a result they ended up with the same products being addressed by multiple IDs. In order to reduce the data in their DB they at one point tried to consolidate their data and continue. But they had to support still the old URIs they exposed for their products. In the end they realized that exposing the product ID via the URI wasn't ideal in their situation as it lead to plenty of downstream changes that affected their customers. As such, a recommendation here is to use UUIDs that don't give the target resource any semantic meaning and don't ever change. If the product ID in the back changes it doesn't affect the exposed URI at all. Sure, you might need a further table/collection to map from the product to the actual resource URI but you in the end designed your system with the eventuality of change which it now is more likely to coop with.
I've read so many times that the product ID shouldn't be part of the resource as it is already present in the URI. First, the whole URI is a unique identifier of that resource and not only a part of it. Next as mentioned above, IMO the product ID shouldn't be part of the URI in first place but it should be part of the resources' state. After all, the product ID is part of the products properties and therefore should be included there accordingly. As such, the media type exposed should contain all the necessities that a client is able to identify the product ID off the payload. The media type the resource's state is exchange with should also provide means to include the ID if you want to perform an update. I.e. if you take HTML as example, here you get served a HTML form by the server which basically teaches you where to send the request to, which HTTP operation to use, which media-type to marshal the request with and the actual properties of the resource, including the ones you are not meant to change. HTML does this i.e. via hidden input fields. Other form-based media types, such as HAL forms, JsonForms or Ion, might provide other mechanisms though.
So, to sum my post up:
Don't map the product ID onto URIs. Use a mapping from product ID to UUIDs instead
Use form-based media-types that support clients in creating requests. These media types should allow to include unmodifiable properties, such as hidden input fields and the like

How to properly access children by filtering parents in a single REST API call

I'm rewriting an API to be more RESTful, but I'm struggling with a design issue. I'll explain the situation first and then my question.
SITUATION:
I have two sets resources users and items. Each user has a list of item, so the resource path would like something like this:
api/v1/users/{userId}/items
Also each user has an isPrimary property, but only one user can be primary at a time. This means that if I want to get the primary user you'd do something like this:
api/v1/users?isPrimary=true
This should return a single "primary" user.
I have client of my API that wants to get the items of the primary user, but can't make two API calls (one to get the primary user and the second to get the items of the user, using the userId). Instead the client would like to make a single API call.
QUESTION:
How should I got about designing an API that fetches the items of a single user in only one API call when all the client has is the isPrimary query parameter for the user?
MY THOUGHTS:
I think I have a some options:
Option 1) api/v1/users?isPrimary=true will return the list of items along with the user data.
I don't like this one, because I have other API clients that call api/v1/users or api/v1/users?isPrimary=true to only get and parse through user data NOT item data. A user can have thousands of items, so returning those items every time would be taxing on both the client and the service.
Option 2) api/v1/users/items?isPrimary=true
I also don't like this because it's ugly and not really RESTful since there is not {userId} in the path and isPrimary isn't a property of items.
Option 3) api/v1/users?isPrimary=true&isShowingItems=true
This is like the first one, but I use another query parameter to flag whether or not to show the items belonging to the user in the response. The problem is that the query parameter is misleading because there is no isShowingItems property associated with a user.
Any help that you all could provide will be greatly appreciated. Thanks in advance.
There's no real standard solution for this, and all of your solutions are in my mind valid. So my answer will be a bit subjective.
Have you looked at HAL for your API format? HAL has a standard way to embed data from one resources into another (using _embedded) and it sounds like a pretty valid use-case for this.
The server can decide whether to embed the items based on a number of criteria, but one cheap solution might be to just add a query parameter like ?embed=items
Even if you don't use HAL, conceptually you could still copy this behavior similarly. Or maybe you only use _embedded. At least it's re-using an existing idea over building something new.
Aside from that practical solution, there is nothing in un-RESTful about exposing data at multiple endpoints. So if you created a resource like:
/v1/primary-user-with-items
Then this might be ugly and inconsistent with the rest of your API, but not inherently
'not RESTful' (sorry for the double negative).
You could include a List<User.Fieldset> parameter called fieldsets, and then include things if they are specified in fieldsets. This has the benefit that you can reuse the pattern by adding fieldsets onto any object in your API that has fields you might wish to include.
api/v1/users?isPrimary=true&fieldsets=items

simple model when requesting collection and extended model when requesting resource - how

I have the following URI: /articles/:id, where article is a resource on web-service and have associated model/class. Now I need to return only partial data for each resource (to save bandwidth and make for speed) when collection is requested, but when a single item is requested from collection I need to send full data. My question is should I use two models/classes for the same resource on the server and initiate different one depending on collection or single resource is requested? Or maybe there is should be only one model/class but not all fields should be filled with data when a collection is requested? Or maybe there is another approach?
I suggest using the approach suggested here with a fields query parameter.
If the API is going to be open to everyone to use and client usage is going to be unpredictable, then by default you probably need to limit the fields that you return. Just make sure you document in some way all the possible fields that could be used, in case a client actually needs them.
If the API is going to be consumed only by an app or apps you made, then by default you could return all of the fields and then your app can pass that fields parameter to speed things up.

Should the natural or surrogate key be returned in an API?

First time I think about it...
Until now, I always used the natural key in my API. For example, a REST API allowing to deal with entities, the URL would be like /entities/{id} where id is a natural key known to the user (the ID is passed to the POST request that creates the entity). After the entity is created, the user can use multiple commands (GET, DELETE, PUT...) to manipulate the entity. The entity also has a surrogate key generated by the database.
Now, think about the following sequence:
A user creates entity with id 1. (POST /entities with body containing id 1)
Another user deletes the entity (DELETE /entities/1)
The same other user creates the entity again (POST /entities with body containing id 1)
The first user decides to modify the entity (PUT /entities/1 with body)
Before step 4 is executed, there is still an entity with id 1 in the database, but it is not the same entity created during step 1. The problem is that step 4 identifies the entity to modify based on the natural key which is the same for the deleted and new entity (while the surrogate key is different). Therefore, step 4 will succeed and the user will never know it is working on a new entity.
I generally also use optimistic locking in my applications, but I don't think it helps here. After step 1, the entity's version field is 0. After step 3, the new entity's version field is also 0. Therefore, the version check won't help. Is the right case to use timestamp field for optimistic locking?
Is the "good" solution to return surrogate key to the user? This way, the user always provides the surrogate key to the server which can use it to ensure it works on the same entity and not on a new one?
Which approach do you recommend?
It depends on how you want your users to user your api.
REST APIs should try to be discoverable. So if there is benefit in exposing natural keys in your API because it will allow users to modify the URI directly and get to a new state, then do it.
A good example is categories or tags. We could have these following URIs;
GET /some-resource?tag=1 // returns all resources tagged with 'blue'
GET /some-resource?tag=2 // returns all resources tagged with 'red'
or
GET /some-resource?tag=blue // returns all resources tagged with 'blue'
GET /some-resource?tag=red // returns all resources tagged with 'red'
There is clearly more value to a user in the second group, as they can see that the tag is a real word. This then allows them to type ANY word in there to see whats returned, whereas the first group does not allow this: it limits discoverability
A different example would be orders
GET /orders/1 // returns order 1
or
GET /orders/some-verbose-name-that-adds-no-meaning // returns order 1
In this case there is little value in adding some verbose name to the order to allow it to be discoverable. A user is more likely to want to view all orders first (or a subset) and filter by date or price etc, and then choose an order to view
GET /orders?orderBy={date}&order=asc
Additional
After our discussion over chat, your issue seems to be with versioning and how to manage resource locking.
If you allow resources to be modified by multiple users, you need to send a version number with every request and response. The version number is incremented when any changes are made. If a request sends an older version number when trying to modify a resource, throw an error.
In the case where you allow the same URIs to be reused, there is a potential for conflict as the version number always begins from 0. In this case, you will also need to send over a GUID (surrogate key) and a version number. Or don't use natural URIs (see original answer above to decided when to do this or not).
There is another option which is to disallow reuse of URIs. This really depends on the use case and your business requirements. It may be fine to reuse a URI as conceptually it means the same thing. Example would be if you had a folder on your computer. Deleting the folder and recreating it, is the same as emptying the folder. Conceptually the folder is the same 'thing' but with different properties.
User account is probably an area where reusing URIs is not a good idea. If you delete an account /accounts/u1, that URI should be marked as deleted, and no other user should be able to create an account with username u1. Conceptually, a new user using the same URI is not the same as when the previous user was using it.
Its interesting to see people trying to rediscover solutions to known problems. This issue is not specific to a REST API - it applies to any indexed storage. The only solution I have ever seen implemented is don't re-use surrogate keys.
If you are generating your surrogate key at the client, use UUIDs or split sequences, but for preference do it serverside.
Also, you should never use surrogate keys to de-reference data if a simple natural key exists in the data. Indeed, even if the natural key is a compound entity, you should consider very carefully whether to expose a surrogate key in the API.
You mentioned the possibility of using a timestamp as your optimistic locking.
Depending how strictly you're following a RESTful principle, the Entity returned by the POST will contain an "edit self" link; this is the URI to which a DELETE or UPDATE can be performed.
Taking your steps above as an example:
Step 1
User A does a POST of Entity 1. The returned Entity object will contain a "self" link indicating where updates should occur, like:
/entities/1/timestamp/312547124138
Step 2
User B gets the existing Entity 1, with the above "self" link, and performs a DELETE to that timestamp versioned URI.
Step 3
User B does a POST of a new Entity 1, which returns an object with a different "self" link, e.g.:
/entities/1/timestamp/312547999999
Step 4
User A, with the original Entity that they obtained in Step 1, tries doing a PUT to the "self" link on their object, which was:
/entities/1/timestamp/312547124138
...your service will recognise that although Entity 1 does exist; User A is trying a PUT against a version which has since become stale.
The service can then perform the appropriate action. Depending how sophisticated your algorithm is, you could either merge the changes or reject the PUT.
I can't remember the appropriate HTTP status code that you should return, following a PUT to a stale version... It's not something that I've implemented in the Rest framework that I work on, although I have planned to enable it in future. It might be that you return a 410 ("Gone").
Step 5
I know you don't have a step 5, but..! User A, upon finding their PUT has failed, might re-retrieve Entity 1. This could be a GET to their (stale) version, i.e. a GET to:
/entities/1/timestamp/312547124138
...and your service would return a redirect to GET from either a generic URI for that object, e.g.:
/entities/1
...or to the specific latest version, i.e.:
/entities/1/timestamp/312547999999
They can then make the changes intended in Step 4, subject to any application-level merge logic.
Hope that helps.
Your problem can be solved either using ETags for versioning (a record can only modified if the current ETag is supplied) or by soft deletes (so the deleted record still exists but with a trashed bool which is reset by a PUT).
Sounds like you might also benefit from a batch end point and using transactions.

RESTful url - getting new subentity

There are 2 models: Entity and Subentity. Entity can have many connected Subentities (one:many relation).
There is a method on server that returns new Subentity (let's call it GetEmptySubentity). Point is, when you want to create new Subentity, you press a button, and model comes from server with some fields pre-filled. Some of those Subentity pre-filled values depend on according Entity, so I need to pass an Entity id in this request.
So should the correct url to get the empty Subentity be like /Entity/{id}/Subentity/empty? Or I am getting something wrong?
Yes you are. According to the uniform interface / hateoas constraint you should send hyperlinks to your REST clients and they should use the API by following those hyperlinks. In order to do this you need a hypermedia format, for example HTML, ATOM+XML, HAL+JSON, LD+JSON & Hydra, etc... (use google). So by HTML the result should contain a HTML form with input fields having default values, etc... You should add semantics to that for with RDFa and so by processing the HTML your REST client will know, that the link is about creating a new resource. Ofc it is easier to parse the other hypermedia formats. By them you can use the same concept with RDF (by JSON-LD or ATOM for example), or you can use link relations with vendor specific MIME types (by HAL or ATOM for example), or your custom solution which describes those input fields. So you usually get the necessary information with the hyperlink, and you don't have to send another request to get the default values.
If you want to make things complicated, then you can send a request for the default values to the entity itself in order to send the values of properties, and not to send a form with input fields. Optionally you can send a request which returns the entire link, for example GET /Entity/{id}/SubEntity/offset=0&count=0 can return an empty array of subentities and the form for creation. You can use additional query or path parameters if that form is really big, and you don't want to send it with every response related to the SubEntity collection. The URL specification says only that the path should contain the hierarchical part and the query should contain the non-hierarchical part of the URL.
Btw. REST is just a delivery method, you don't have to map it to your database entities. The REST resource and URL structure can be completely different from your database, since you can use any type of data storage mechanisms with REST, even the file system...