I am storing Uber webhook events in my DB as there may be cases where the same event is fired twice for different scopes, as mentioned here : https://developer.uber.com/docs/webhooks . I am handling multiple user profiles, and want to know if the events are unique across users. If not, I need to store both the event id and the user the event was generated for in my DB model.
The event id should be practically unique across space and time as it is a UUID - Universally unique identifier generated using Version 4 (random) of the RFC 4122 Variant specification.
"event_id": "3a3f3da4-14ac-4056-bbf2-d0b9cdcb0777"
Version 4 UUIDs have the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit and y is one of 8, 9, A, or B
The version 4 UUID is meant for generating UUIDs from truly-random or pseudo-random numbers.
Which depending on the quality of the generated cryptographic random numbers / if sufficient entropy was feed into the generator, the resulting event ID should be more or less globally unique. (less/more chance of a hash collision)
Related
Most of the basic ticketing platforms work like this: There is a website where users can buy a ticket for a specific event. After their information is stored in the database together with an unique ID or hash to identify the ticket, they receive an email with a QR-code. When entering the event, that QR-code (containing the unique ID/hash) gets scanned and the back-end checks if the ticket corresponding to that ID or hash exists, is valid, and is not yet used.
The security guards scanning the tickets at the entrance would have a dedicated but very basic app: The app scans the content of the QR-code on the ticket (the QR code only holds the unique ID/hash of the ticket), and sends that unique ID/hash to the back-end. If the checks on back-end conclude the ticket is valid the app shows a green screen to the guard, otherwise a red one. So the scanning-app is very 'dumb', simple, and basic.
The question I have is: What if the logic of the back-end changes?
Example: In 2017 we programmed a ticketing platform and decided to use MD5-hashes as unique identifier for all the tickets. So every ticket in the database has it's info saved (username, user-email, what event,...) together with it's own unique MD5-hash. The QR-code the owner of the ticket receives by mail also only contains that MD5-hash. This way we can link the physical ticket with the correct data in our database when using our dedicated scanning-app.
But now in 2021 we have decided that MD5 is not secure enough anymore and from now on we will be using SHA256-hashes to give the tickets unique IDs.
So from now on the column hash_id in the database contains SHA256 hashes and all the QR-codes we send by mail to the buyers contains the SHA256-hash matching their bought ticket.
And we change (pseudo-code):
if ($_POST['action_check_hash']) {
if (md5_hash_is_valid_ticket($_POST['hash_on_ticket'])) {
return GREEN_SCREEN;
} else {
return RED_SCREEN;
}
}
to
if ($_POST['action_check_hash']) {
if (sha256_hash_is_valid_ticket($_POST['hash_on_ticket'])) {
return GREEN_SCREEN;
} else {
return RED_SCREEN;
}
}
Considering the given explanation we could also just compare the strings, nullifying the problem. But for the sake of this questions let's imagine we're using a hash-algorithm-dependent function to check if there exists a valid ticket with the given hash.
The tickets do not have an expiration date.
So now user Foo who just bought his ticket yesterday comes to the entrance of the event and the security guard scans Foo's QR-code containing the SHA256-hash. sha256_hash_is_valid_ticket($_POST['hash_on_ticket']) returns a green screen and Foo can enter.
Now user Gux who bought his ticket in 2018 enters the entrance with his - actually valid ticket - containing a MD5-hash in it's QR-code (because in the old version of our platform we used MD5) but the security guard gets a red screen because sha256_hash_is_valid_ticket($_POST['hash_on_ticket']) returns a red screen. Obviously because the sha256-dependent function can't interpret the MD5-hash. So Gux does have a valid ticket, but can't enter the event because the back-end can't interpret his old ticket anymore.
How would we solve this? How do we make sure, when updating our back-end or platform, that already bought tickets do still work?
I was thinking something like this:
We put a specific version-number on our tickets so we can identify what version of the back-end we should use.
E.g:
The QR-code of the tickets in 2017 would contain:
V1.0;098f6bcd4621d373cade4e832627b4f6 (V1.0;MD5_HASH)
And the tickets bought in 2021 or later would contain
V2.0;9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 (V2.0;SHA256_HASH)
The back-end would look something like this:
if ($_POST['action_check_hash']) {
valid = false;
// Checking what version number the QR-code contains
if ($_POST['version_on_ticket'] == "V1.0") {
valid = md5_hash_is_valid_ticket($_POST['hash_on_ticket']);
}
else if ($_POST['version_on_ticket'] == "V2.0") {
valid = sha256_hash_is_valid_ticket($_POST['hash_on_ticket']);
} else {
valid = false;
}
if (valid) {
return GREEN_SCREEN;
}
return RED_SCREEN;
}
Each update we make sure that the hash in the QR-code is prepended with a new version-number. In the back-end we then know what algorithm or logic to use by checking the version-number. Is this a good way to solve it?
A few solutions I came up with:
The one explained above; Defining a version-number on the ticket which the scanner-apps sends to the back-end together with the hash. Back-end uses that version-number to interpret the data correctly.
We save the timestamp of when a ticket was bought in the database. Then we'd use a separate table containing a version-number and the timestamp of when the update was pushed. The same principle as the previous solution but without exposing the version-number to the client.
We give each ticket a unique URL so we can update the tickets of our users on-the-go (good for digital solutions, but what if the user prints his ticket?)
We notify all of our users that our platform is updated and their previous tickets are not valid anymore, and mail them their tickets containing the new QR-code (for free). This will cause a lot of issues and will be expensive for customer support.
TL;DR:
How to keep old tickets using a format/logic not supported by the newest version of our back-end working. And how to let the back-end know it needs to switch to the old logic to interpret the incoming data correctly?
I am really looking forward to your answers!
Yes, using a version number for this is a good idea. In general, you should assume that every format of data (tokens, identifiers, etc.) you emit to users will change and include a version number. Then, when (not if) you need to make a change, you can just bump the version number. This change can be the hash algorithm, the data you need to parse out of the identifier, or any other data format change you need to make.
The way I personally like to do this is to use a BER-encoded integer for this and just prepend it to the data, encoding it with hex or base64 or whatever you're using. This is an encoding such that the lower seven bits of each byte represent the data, and the top bit is 1 if the integer continues to the next byte and 0 if this is the last byte. This is variable-length, but it optimizes for smaller values (which are usually more frequent) and is easy to parse. Of course, you can pick any sensible scheme you like.
Note that if you had picked a version number up front, you could simply have stored the identifier in the database with the version number prepended to it and then you wouldn't need an additional field specifying the type. Of course, hindsight is always 20/20.
The normal MO for creating items in a database is to let the database control the generation of the primary key (id). That's usually true whether you're using auto-incremented integer ids or UUIDs.
I'm building a clientside app (Angular but the tech is irrelevant) that I want to be able to build offline behaviour into. In order to allow allow offline object creation (and association) I need the the client appplication to generate primary keys for new objects. This is both to allow for associations with other objects created offline and also to allow for indempotence (making sure I don't accidentally save the same object to the server twice due to a network issue).
The challenge though is what happens when that object gets sent to the server. Do you use a temporary clientside ID which you then replace with the ID that the server subsequently generates or you use some sort of ID translation layer between the client and the server - this is what Trello did when building their offline functionality.
However, it occurred to me that there may be a third way. I'm using UUIDs for all tables on the back end. And so this made me realise that I could in theory insert a UUID into the back end that was generated on the front end. The whole point of UUIDs is that they're universally unique so the front end doesn't need to know the server state to generate one. In the unlikely event that they do collide then the uniqueness criteria on the server would prevent a duplicate.
Is this a legitimate approach? The risk seems to be 1. Collisions and 2. any form of security that I haven't anticipated. Collisons seem to be taken care of by the way that UUIDs are generated but I can't tell if there are risks in allowing a client to choose the ID of an inserted object.
However, it occurred to me that there may be a third way. I'm using UUIDs for all tables on the back end. And so this made me realise that I could in theory insert a UUID into the back end that was generated on the front end. The whole point of UUIDs is that they're universally unique so the front end doesn't need to know the server state to generate one. In the unlikely event that they do collide then the uniqueness criteria on the server would prevent a duplicate.
Yes, this is fine. Postgres even has a UUID type.
Set the default ID to be a server-generated UUID if the client does not send one.
Collisions.
UUIDs are designed to not collide.
Any form of security that I haven't anticipated.
Avoid UUIDv1 because...
This involves the MAC address of the computer and a time stamp. Note that UUIDs of this kind reveal the identity of the computer that created the identifier and the time at which it did so, which might make it unsuitable for certain security-sensitive applications.
You can instead use uuid_generate_v1mc which obscures the MAC address.
Avoid UUIDv3 because it uses MD5. Use UUIDv5 instead.
UUIDv4 is simplest, it's a 122 bit random number, and built into Postgres (the others are in the commonly available uuid-osp extension). However, it depends on the strength of the random number generator of each client. But even a bad UUIDv4 generator is better than incrementing an integer.
I have a custom profile for a proprietary device (my smartphone app will be the only thing communicating with my peripheral) that includes two simple services. Each service allows the client to read and write a single byte of data on the peripheral. I would like to add the ability to read and write both bytes in a single transaction.
I tried adding a third service that simply included the two existing single byte services but all that appears to do is assign a UUID that combines the UUIDs for the existing services and I don't see how to use the combined UUID since it doesn't have any Characteristic Values.
The alternatives I'm considering are to make a separate service for the two bytes and combine their effects on my server, or I could replace all of this with a single service that includes the two bytes along with a boolean flag for each byte that indicates whether or not the associated byte should be written.
The first alternative seems overly complicated and the second would preclude individual control of notifications and indications for the separate bytes.
Is there a way to use included services to accomplish my goals?
It's quite an old question, but in case anyone else comes across it I leave a comment here.
Here are two parts. One is a late answer for Lance F: You had a wrong understanding of the BLE design principles. Services are defined on the host level of the BLE stack. And you considered your problem from the application level point view, wanting an atomic transaction to provide you with a compound object of two distinct entities. Otherwise why would you have defined two services?
The second part is an answer to the actual question taken as quote from "Getting Started with Bluetooth Low Energy" by Kevin Townsend et al., O'Reilly, 2014, p.58:
Included services can help avoid duplicating data in a GATT server. If a service will be referenced by other services, you can use this mechanism to save memory and simplify the layout of the GATT server. In the previous analogy with classes and objects, you could see include definitions as pointers or references to an existing object instance.
It's an update of my answer to clarify why there is no need for the included services in a problem stated by Lance F.
I am mostly familiar with BLE use in medical devices, so I briefly sketch the SIG defined Glucose Profile as an example to draw some analogies with your problem.
Let's imagine a server device which has the Glucose Service with 2 defined characteristics: Glucose Measurement and Glucose Measurement Context. A client can subscribe for notifications of either or both of these characteristics. In some time later the client device can change its subscriptions by simply writing to the Client Configuration Characteristic Descriptor of the corresponding characteristic.
Server also has a special mandatory characteristic - Record Access Control Point (RACP), which is used by a client to retrieve or update glucose measurement history.
If a client wants to get a number of stored history records it writes to the RACP { OpCode: 4 (Report number of stored records), Operator: 1 (All records) }. Then a server sends an indication from the RACP { OpCode: 5 (Number of stored records response), Operator: 0 (Null), Operand: 17 (some number) }.
If a client wants to get any specific records it writes to the RACP { OpCode: 1 (Report stored records), Operator: 4 (Within range of, inclusive), Operand: [13, 14] (for example the records 13 and 14) }. In response a server sends requested records one by one as notifications of the Glucose Measurement and Glucose Measurement Context characteristics, and then sends an indication from the RACP characteristic to report a status of the operation.
So Glucose Measurement and Glucose Measurement Context are your Mode and Rate characteristics, then you also need one more control characteristic - an analog of the RACP. Now you need to define a number of codes, operators, and operands. Create a structure whichever suits you best, for example, Code: 1 - update, Operator: 1 - Mode only, Operand: actual number. A client writes it to the control point characteristic. A server gets notified on write, interprets it, and acts in a way defined by your custom profile.
First time I think about it...
Until now, I always used the natural key in my API. For example, a REST API allowing to deal with entities, the URL would be like /entities/{id} where id is a natural key known to the user (the ID is passed to the POST request that creates the entity). After the entity is created, the user can use multiple commands (GET, DELETE, PUT...) to manipulate the entity. The entity also has a surrogate key generated by the database.
Now, think about the following sequence:
A user creates entity with id 1. (POST /entities with body containing id 1)
Another user deletes the entity (DELETE /entities/1)
The same other user creates the entity again (POST /entities with body containing id 1)
The first user decides to modify the entity (PUT /entities/1 with body)
Before step 4 is executed, there is still an entity with id 1 in the database, but it is not the same entity created during step 1. The problem is that step 4 identifies the entity to modify based on the natural key which is the same for the deleted and new entity (while the surrogate key is different). Therefore, step 4 will succeed and the user will never know it is working on a new entity.
I generally also use optimistic locking in my applications, but I don't think it helps here. After step 1, the entity's version field is 0. After step 3, the new entity's version field is also 0. Therefore, the version check won't help. Is the right case to use timestamp field for optimistic locking?
Is the "good" solution to return surrogate key to the user? This way, the user always provides the surrogate key to the server which can use it to ensure it works on the same entity and not on a new one?
Which approach do you recommend?
It depends on how you want your users to user your api.
REST APIs should try to be discoverable. So if there is benefit in exposing natural keys in your API because it will allow users to modify the URI directly and get to a new state, then do it.
A good example is categories or tags. We could have these following URIs;
GET /some-resource?tag=1 // returns all resources tagged with 'blue'
GET /some-resource?tag=2 // returns all resources tagged with 'red'
or
GET /some-resource?tag=blue // returns all resources tagged with 'blue'
GET /some-resource?tag=red // returns all resources tagged with 'red'
There is clearly more value to a user in the second group, as they can see that the tag is a real word. This then allows them to type ANY word in there to see whats returned, whereas the first group does not allow this: it limits discoverability
A different example would be orders
GET /orders/1 // returns order 1
or
GET /orders/some-verbose-name-that-adds-no-meaning // returns order 1
In this case there is little value in adding some verbose name to the order to allow it to be discoverable. A user is more likely to want to view all orders first (or a subset) and filter by date or price etc, and then choose an order to view
GET /orders?orderBy={date}&order=asc
Additional
After our discussion over chat, your issue seems to be with versioning and how to manage resource locking.
If you allow resources to be modified by multiple users, you need to send a version number with every request and response. The version number is incremented when any changes are made. If a request sends an older version number when trying to modify a resource, throw an error.
In the case where you allow the same URIs to be reused, there is a potential for conflict as the version number always begins from 0. In this case, you will also need to send over a GUID (surrogate key) and a version number. Or don't use natural URIs (see original answer above to decided when to do this or not).
There is another option which is to disallow reuse of URIs. This really depends on the use case and your business requirements. It may be fine to reuse a URI as conceptually it means the same thing. Example would be if you had a folder on your computer. Deleting the folder and recreating it, is the same as emptying the folder. Conceptually the folder is the same 'thing' but with different properties.
User account is probably an area where reusing URIs is not a good idea. If you delete an account /accounts/u1, that URI should be marked as deleted, and no other user should be able to create an account with username u1. Conceptually, a new user using the same URI is not the same as when the previous user was using it.
Its interesting to see people trying to rediscover solutions to known problems. This issue is not specific to a REST API - it applies to any indexed storage. The only solution I have ever seen implemented is don't re-use surrogate keys.
If you are generating your surrogate key at the client, use UUIDs or split sequences, but for preference do it serverside.
Also, you should never use surrogate keys to de-reference data if a simple natural key exists in the data. Indeed, even if the natural key is a compound entity, you should consider very carefully whether to expose a surrogate key in the API.
You mentioned the possibility of using a timestamp as your optimistic locking.
Depending how strictly you're following a RESTful principle, the Entity returned by the POST will contain an "edit self" link; this is the URI to which a DELETE or UPDATE can be performed.
Taking your steps above as an example:
Step 1
User A does a POST of Entity 1. The returned Entity object will contain a "self" link indicating where updates should occur, like:
/entities/1/timestamp/312547124138
Step 2
User B gets the existing Entity 1, with the above "self" link, and performs a DELETE to that timestamp versioned URI.
Step 3
User B does a POST of a new Entity 1, which returns an object with a different "self" link, e.g.:
/entities/1/timestamp/312547999999
Step 4
User A, with the original Entity that they obtained in Step 1, tries doing a PUT to the "self" link on their object, which was:
/entities/1/timestamp/312547124138
...your service will recognise that although Entity 1 does exist; User A is trying a PUT against a version which has since become stale.
The service can then perform the appropriate action. Depending how sophisticated your algorithm is, you could either merge the changes or reject the PUT.
I can't remember the appropriate HTTP status code that you should return, following a PUT to a stale version... It's not something that I've implemented in the Rest framework that I work on, although I have planned to enable it in future. It might be that you return a 410 ("Gone").
Step 5
I know you don't have a step 5, but..! User A, upon finding their PUT has failed, might re-retrieve Entity 1. This could be a GET to their (stale) version, i.e. a GET to:
/entities/1/timestamp/312547124138
...and your service would return a redirect to GET from either a generic URI for that object, e.g.:
/entities/1
...or to the specific latest version, i.e.:
/entities/1/timestamp/312547999999
They can then make the changes intended in Step 4, subject to any application-level merge logic.
Hope that helps.
Your problem can be solved either using ETags for versioning (a record can only modified if the current ETag is supplied) or by soft deletes (so the deleted record still exists but with a trashed bool which is reset by a PUT).
Sounds like you might also benefit from a batch end point and using transactions.
document/show?id=4cf8ce8a8aad6957ff00005b
Generally I think you should be cautious to expose internals (such as DB ids) to the client. The URL can easily be manipulated and the user has possibly access to objects you don't want him to have.
For MongoDB in special, the object ID might even reveal some additional internals (see here), i.e. they aren't completely random. That might be an issue too.
Besides that, I think there's no reason not to use the id.
I generally agree with #MartinStettner's reply. I wanted to add a few points, mostly elaborating what he said. Yes, a small amount of information is decodeable from the ObjectId. This is trivially accessible if someone recognizes this as a MongoDB ObjectID. The two downsides are:
It might allow someone to guess a different valid ObjectId, and request that object.
It might reveal info about the record (such as its creation date) or the server that you didn't want someone to have.
The "right" fix for the first item is to implement some sort of real access control: 1) a user has to login with a username and password, 2) the object is associated with that username, 3) the app only serves objects to a user that are associated with that username.
MongoDB doesn't do that itself; you'll have to rely on other means. Perhaps your web-app framework, and/or some ad-hoc access control list (which itself could be in MongoDB).
But here is a "quick fix" that mostly solves both problems: create some other "id" for the record, based on a large, high-quality random number.
How large does "large" need to be? A 128-bit random number has 3.4 * 10^38 possible values. So if you have 10,000,000 objects in your database, someone guessing a valid value is a vanishingly small probability: 1 in 3.4 * 10^31. Not good enough? Use a 256-bit random number... or higher!
How to represent this number in the document? You could use a string (encoding the number as hex or base64), or MongoDB's binary type. (Consult your driver's API docs to figure out how to created a binary object as part of a document.)
While you could add a new field to your document to hold this, then you'd probably also want an index. So the document size is bigger, and you spend more memory on that index. Here's what you might not have though of: simply USE that "truly random id" as your documents "_id" field. Thus the per-document size is only a little higher, and you use the index that you [probably] had there anyways.
I can set both the 128 character session string and other collection document object ids as cookies and when user visits do a asynchronous fetch where I fetch the session, user and account all at once. Instead of fetching the session first and then after fetching user, account. If the session document is valid ill share the user and account documents.
If I do this I'll have to make every single request for a user and account document require the session 128 character session cookie to be fetched too thus making exposing the user and account object id safer. It means if anyone is guessing a user ID or account ID, they also have to guess the 128 string to get any answers from the system.
Another security measure you could do is wrap the id is some salt which you only know the positioning such as
XXX4cf8ce8XXXXa8aad6957fXXXXXXXf00005bXXXX
Now you know exactly how to slice that up to get the ID.