MongoDB ObjectIDs are guessable.
I'm running an application that has publicly available resources located at
http://application.com/resource/**ObjectID**
These resources need to be publicly accessible (not behind a login), however I'm trying to reduce the chance of a hacker brute-forcing ObjectIDs and scraping them at will.
My idea is to include a randomly generated key with each MongoDB document, so that it can be matched up when the request is made. For example:
http://application.com/resource/**ObjectID**/**Key**
http://application.com/resource/**ObjectID**?key=**Key**
or even
http://**Key**.application.com/resource/**ObjectID**
If the key doesn't match the one stored in the document, then the server will return 404.
I realize this isn't true protection in the sense of guaranteed privacy, because if someone in the middle is sniffing URLs, they can access the resource. I'm just trying to prevent someone from brute-forcing ObjectIDs.
Is this approach feasible and effective?
In the context of the original answer you referenced on ObjectID generation, guessable is qualified by "given enough time".
Rather than focusing on whether your IDs might be guessable by brute force, I would look at approaches to detect and mitigate any brute force attacks. This removes the aspect of giving an adversary enough time to try all possible combinations (an approach that could work regardless of your ID format).
For example, obvious signatures to detect brute force attacks might include:
a large number of 404 requests from a specific IP address
successive requests for incremental ObjectIDs (which should be rare)
invalid ObjectIDs (if the adversary is unaware of the expected format)
There are many different strategies and countermeasures to consider, but a helpful starting point would be OWASP's info on Blocking Brute Force Attacks.
These resources need to be publicly accessible (not behind a login), however I'm trying to reduce the chance of a hacker brute-forcing ObjectIDs and scraping them at will.
If the resources are public, there may be an easier way for an adversary to find them: crawling public pages and fetching the linked resources. In this case, you could still apply anti-crawling strategies but these become trickier if you want to avoid affecting legitimate users. For example, a large number of valid requests from a specific IP might indicate a corporate or ISP proxy rather than someone trying to abuse your service. Smart crawlers can also mimic user patterns (delay between requests, randomness in requested urls, ..) in order to try to defeat any protections.
As a starting point see: Anti-crawling Techniques.
First of all it is hard to see your goal here. Resources should be publicly accessible and at the same time you are worried that some one can guess the location of the resource. It might be better if you will include the reason why guessing ObjectIDs in your application will cause a problem.
I assume that that you are trying to do some sharing service which allows to exchange data (like dropbox via file sharing) and you basically trying to protect data behind the ObjectID.
There is no problem with the way you outlined in such a case, so I will just add another approach: create your own objectIDs (some randomly generated long strings). One potential problem is that they can collide, but this event will be really rare so if you will wrap it into try catch and redo on fail, you should be ok.
Related
I'm learning about HATEOAS. The backend server I'm working on will use a third party REST API that uses HATEOAS. That API has an end point to return the url for each resource and also returns the related resource links with regular requests.
But I'm wondering what's a good way to manage these links on the server to avoid hardcoding them. For example if the third party changes the url of the resource, how will the server detect that change? Are there any standard practices for managing HATEOAS resource links?
Possible ways I can think of
When the server starts, get all the resources urls and cache them. Whenever the third party API needs to be called, reuse these cached urls. Whenever there is a 404 or related error, update the resource url. Or update the url periodically in intervals.
Get the resource url each time before calling the end point. Simplest but essentially doubles the number of requests.
But neither sound like robust ways.
While discovery is generally a good thing and should allow a HATEOAS system to introduce changes in ways that 'hardcoded urls' don't, if urls start breaking arbitrarily I would still consider this a major issue.
You should be able to store urls / links on your side and have some expectation that those keep working.
There are some mechanisms that deal with changes though:
The server should return 301 / 308 redirects if a resource moved. If this were the case, you should also update your references.
The server can emit Sunset or Deprecated headers. See: https://www.rfc-editor.org/rfc/rfc8594
Those are more general answers, but ultimately the existence of best practices does not mean that vendors will abide by them. With that in mind I think your best bet is to try and find out what the deprecation policy is of your vendor and see what they recommend.
Use a cached resource if it is valid, request a refresh when you don't have a local valid copy.
RFC 7234 defines the caching semantics of HTTP.
Ideally, you don't implement the caching rules yourself, but instead you use a general purpose cache.
In its ideal form, your bespoke implementation is talking to a headless browser, and the headless browser worries about the caching rules for you.
In theory, you need the initial URL to start the process, and everything else comes from that.
Each resource you get from the server should include links to other edges on the graph of service for that resource.
So, once you get the initial resource, all of the rest come automatically.
That said, it's not untoward to have "well known" entry points that are, ideally, unchanging URLs. But in the end, those are just "bookmarks", and not necessarily guaranteed end points.
Consider a shopping site such as Amazon. Outside of amazon.com, you don't know any of their URLs. They're all provided on the various forms and pages, and the human simply navigates the site. Those URLs can be changing all the time, and no one would know. With HATEOAS, it's up to the machine to follow the links, rather than a human. But the process of navigation is the same.
As others have mentioned, idea of caching a root resource has merit. Then you rely on the caching headers to direct you to how often you have to refresh the links.
But that said, operationally, there's no difference between following a normal link, and following a cached link. Underneath, the cached resource loads faster, but you still need to "follow the link". Because that's where the caching behavior kicks in. This is different from assuming the link is good, assuming you know the result of a resource lookup. Your application follows the link. Always. The underlying infrastructure is responsible for making it efficient.
So, your code should not, say, load up a root resource, and then stuff a map filled with links, and then assume they're good. Rather, the code should request the root resource, perhaps as a Map of links (datatypes for the win), and let the next layer handle the details. Because it all depends on the type of caching involved. Some have coded durations where no followup is necessary. Others, you make the request anyway, and the server tier responds back "nothing changed", so you can use your local copy, but you're still require to ask in the first place.
Those are implementation details that the SERVER mandates (not the client). It's a server contract. If they want you pinging them each and every time, so be it. That's the contract they're presenting to you and if you want to be a Good Citizen, then you should honor that contact.
Ideally, the server makes good decisions on these kinds of issues for the sake of efficiency, but in the end it's really up to them.
The client has to go along. The client in a HATEOAS system cedes a lot to the server. They're simply not decisions for the client to make.
I am designing an API for domain admin to manage user cookie sessions, specifically
GET users/{userKey}/sessions to get a list of a user's all sessions
DELETE users/{userKey}/sessions/{sessionId} to delete a user's specific session
I want to expose another method for the admin to delete (reset) a user's all sessions. I am considering 2 options, I wonder which one is more Restful
DELETE users/{userKey}/sessions - {sessionId} left blank to delete all sessions
POST users/{userKey}/sessions/reset
REST was never designed for bulk transaction support, it's for representing the state of individual objects. That said, API design is very opinionated and you have to balance REST "pureness" with functionality. If I were designing this, I would go with option 1 and use delete at the "sessions" endpoint since you are removing all of the user sessions and not just a single or subset.
This answer may be opinion based, so take it as such.
I would use DELETE if you are removing the resource (since you are going to be removing sessions).
If you keep the sessions (but change some data in those resources eg sliding expiration) then I would consider using PATCH as you're modifying (resetting and not replacing) existing sessions.
I would go with DELETE # users/sessions
If you think about it, a reset is simply an admin dropping a session. The user gets their new session when/if they return. So a reset route does not make much sense as you are not reissuing sessions to all of your users in this action.
My preference is users/sessions rather then users/{*}/sessions. The later route suggests that you are wanting to remove all sessions of the parent resource, in this case being a single user.
I want to expose another method for the admin to delete (reset) a user's all sessions. I am considering 2 options, I wonder which one is more Restful....
You probably want to be using POST.
POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.” -- Fielding, 2008.
HTTP DELETE isn't often the right answer
Relatively few resources allow the DELETE method -- its primary use is for remote authoring environments, where the user has some direction regarding its effect. -- RFC 7231
HTTP Methods belong to the "transfer documents over a network" domain, not to your domain.
REST doesn't actually care about the spelling of the target-uri -- that's part of the point. General-purpose HTTP components don't assume that the uri has any specific semantics encoded into it. It is just a opaque identifier.
That means that you can apply whatever URI design heuristics you like. It's a lot like choosing a name for a variable or a namespace in a general-purpose programming language; the compiler/interpreter don't usual care if the symbol "means" anything or not. We choose names that make things easier for the human beings that interact with the code.
And so it is with URI as well. You'll probably want to use a spelling that is consistent with other identifiers in your API, so that it looks as though the api were designed by "one mind".
A common approach starts from the notion that a resource is any information that can be named (Fielding, 2000). Therefore, it's our job to first (a) figure out the name of the resource that handles this request, then (b) figure out an identifier that "matches", that name. Resources are closely analogous to documents, so if you can think of the name of the document in which you would write this message, you are a good ways toward figuring out the name (ex: we write expiring sessions into the "security log", or into the "sessions log". Great, now figure out the corresponding URI.)
If I ran the zoo: I imagine that
GET /users/{userKey}/sessions
would probably return a representation of the users cookie sessions. Because this representation would be something that changes when we delete all of the users sessions, I would want to post the delete request to this same target URI
POST /users/{userKey}/sessions
because doing it that way makes the cache invalidation story a bit easier.
In the context of a restful web service, is it acceptable to have side effects for GET methods?
Single use download links for example
GET /downloads/664d92b3-b373-4dac-a4fb-7a41d015109a
will return 200 and "the thing" and 404 on next request.
HTTP spec says GET methods should be safe and according to https://www.rfc-editor.org/rfc/rfc7231#section-4.2.1
Request methods are considered "safe" if their defined semantics are
essentially read-only; i.e., the client does not request, and does
not expect, any state change on the origin server as a result of
applying a safe method to a target resource.
and
This definition of safe methods does not prevent an implementation
from including behavior that is potentially harmful, that is not
entirely read-only, or that causes side effects while invoking a safe
method. What is important, however, is that the client did not
request that additional behavior and cannot be held accountable for
it.
Several clarifying examples are provided which make me think safe methods are not allowed to purposefully remove the resource.
For example, most servers append request information to access
log files at the completion of every response, regardless of the
method, and that is considered safe even though the log storage might
become full and crash the server.
And
Likewise, a safe request initiated
by selecting an advertisement on the Web will often have the side
effect of charging an advertising account.
And
For example, it is
common for Web-based content editing software to use actions within
query parameters, such as "page?do=delete". If the purpose of such a
resource is to perform an unsafe action, then the resource owner MUST
disable or disallow that action when it is accessed using a safe
request method.
Single use links are obviously a reality. I just wonder whether they're abusing the spec or I just don't get it.
Having an opinion is fine but having worked on these specs and understanding their subtleties would be most convincing.
What you're suggesting is acceptable in some situations, and not necessarily an abuse of the spec.
Firstly, 2616 says regarding safe methods that they:
SHOULD NOT have the significance of taking an action other than
retrieval
And the phrase "SHOULD NOT" is defined as follows (emphasis added):
This phrase, or the phrase "NOT RECOMMENDED" mean that there may
exist valid reasons in particular circumstances when the particular
behavior is acceptable or even useful, but the full implications
should be understood and the case carefully weighed before
implementing any behavior described with this label.
The new version you linked to (which I think supercedes 2616) doesn't use the term "SHOULD NOT" - but they haven't replaced it with "MUST NOT" either. They also acknowledge that side effects are not ruled out as long as the client is not held responsible. So I think the idea of safe methods is the same.
So since the spec acknowledges that there are situations where it's ok, how do we know if yours is such a situation - and more importantly, how do we stay generally within the "spirit" of the spec i.e. make sure we're not abusing it?
I'd refer to this quote from 7231:
The purpose of distinguishing between safe and unsafe methods is to
allow automated retrieval processes (spiders) and cache performance
optimization (pre-fetching) to work without fear of causing harm.
If your app is a private intranet app and you're not concerned with the issues mentioned here, your approach is ok. Put another way: taking into consideration all the possible ways that a GET could happen, are you ok with this side effect?
Working outside RESTful guidelines is not always bad. It's just important to make sure you understand the effect it has.
With all that said, if you are looking for a way to implement reliable, consistent one-time delivery of a resource over HTTP, it's well worth reading Bill de hÓra's HTTPLR spec (http://www.dehora.net/doc/httplr/draft-httplr-01.html). This approach relies on the client acknowledging receipt of the message. You might be able to use something like to allow this user agents that are unaware of the one-use policy (spiders etc.) to GET the resource without causing side effects, but still allow participating clients to cause the resource to become unavailable after one GET.
A transactional approach like this has the added benefit of allowing the client to re-try the download as often as they need to. This is important because otherwise the server cannot know whether the client successfully received the message or not.
If you really need to enforce the once-only policy from the server side for any possible user agent, then your original approach might be best, but bear in mind it's really an "at most once" policy.
Sometimes breaking the spec is the only way, an example is web-page visit-counters that use a hidden image. Is requested with GET but updates a counter.
However some things can go wrong. Applications that follow the spec are allowed to presume that making a GET request won't have any side effects. So is perfectly valid for example for some kind of antivirus-enabled email server to follow the links found in an email to make sure all is safe. If you send this "download-once" link in an email the recipient could never see it. For same reason also a yes-no answer with two different links in an email is hard to deploy. But also in a web page: I recall Google browsing the links of a unique-by-user page known to google only because there was an analytics script inside and because the page contained these infamous links with side effects google was actually changing the answers of people that visited it...
Fake hits are not really a problem in the case of the hidden image counter , they are in any case not considered very reliable, but in the case of the "download-once" link could be problematic.
What is the REST-ful way of deleting multiple items?
My use case is that I have a Backbone Collection wherein I need to be able to delete multiple items at once. The options seem to be:
Send a DELETE request for every single record (which seems like a bad idea if there are potentially dozens of items);
Send a DELETE where the ID's to delete are strung together in the URL (i.e., "/records/1;2;3");
In a non-REST way, send a custom JSON object containing the ID's marked for deletion.
All options are less than ideal.
This seems like a gray area of the REST convention.
Is a viable RESTful choice, but obviously has the limitations you have described.
Don't do this. It would be construed by intermediaries as meaning “DELETE the (single) resource at /records/1;2;3” — So a 2xx response to this may cause them to purge their cache of /records/1;2;3; not purge /records/1, /records/2 or /records/3; proxy a 410 response for /records/1;2;3, or other things that don't make sense from your point of view.
This choice is best, and can be done RESTfully. If you are creating an API and you want to allow mass changes to resources, you can use REST to do it, but exactly how is not immediately obvious to many. One method is to create a ‘change request’ resource (e.g. by POSTing a body such as records=[1,2,3] to /delete-requests) and poll the created resource (specified by the Location header of the response) to find out if your request has been accepted, rejected, is in progress or has completed. This is useful for long-running operations. Another way is to send a PATCH request to the list resource, /records, the body of which contains a list of resources and actions to perform on those resources (in whatever format you want to support). This is useful for quick operations where the response code for the request can indicate the outcome of the operation.
Everything can be achieved whilst keeping within the constraints of REST, and usually the answer is to make the "problem" into a resource, and give it a URL.
So, batch operations, such as delete here, or POSTing multiple items to a list, or making the same edit to a swathe of resources, can all be handled by creating a "batch operations" list and POSTing your new operation to it.
Don't forget, REST isn't the only way to solve any problem. “REST” is just an architectural style and you don't have to adhere to it (but you lose certain benefits of the internet if you don't). I suggest you look down this list of HTTP API architectures and pick the one that suits you. Just make yourself aware of what you lose out on if you choose another architecture, and make an informed decision based on your use case.
There are some bad answers to this question on Patterns for handling batch operations in REST web services? which have far too many upvotes, but ought to be read too.
If GET /records?filteringCriteria returns array of all records matching the criteria, then DELETE /records?filteringCriteria could delete all such records.
In this case the answer to your question would be DELETE /records?id=1&id=2&id=3.
I think Mozilla Storage Service SyncStorage API v1.5 is a good way to delete multiple records using REST.
Deletes an entire collection.
DELETE https://<endpoint-url>/storage/<collection>
Deletes multiple BSOs from a collection with a single request.
DELETE https://<endpoint-url>/storage/<collection>?ids=<ids>
ids: deletes BSOs from the collection whose ids that are in the provided comma-separated list. A maximum of 100 ids may be provided.
Deletes the BSO at the given location.
DELETE https://<endpoint-url>/storage/<collection>/<id>
http://moz-services-docs.readthedocs.io/en/latest/storage/apis-1.5.html#api-instructions
This seems like a gray area of the REST convention.
Yes, so far I have only come accross one REST API design guide that mentions batch operations (such as a batch delete): the google api design guide.
This guide mentions the creation of "custom" methods that can be associated via a resource by using a colon, e.g. https://service.name/v1/some/resource/name:customVerb, it also explicitly mentions batch operations as use case:
A custom method can be associated with a resource, a collection, or a service. It may take an arbitrary request and return an arbitrary response, and also supports streaming request and response. [...] Custom methods should use HTTP POST verb since it has the most flexible semantics [...] For performance critical methods, it may be useful to provide custom batch methods to reduce per-request overhead.
So you could do the following according to google's api guide:
POST /api/path/to/your/collection:batchDelete
...to delete a bunch of items of your collection resource.
I've allowed for a wholesale replacement of a collection, e.g. PUT ~/people/123/shoes where the body is the entire collection representation.
This works for small child collections of items where the client wants to review a the items and prune-out some and add some others in and then update the server. They could PUT an empty collection to delete all.
This would mean GET ~/people/123/shoes/9 would still remain in cache even though a PUT deleted it, but that's just a caching issue and would be a problem if some other person deleted the shoe.
My data/systems APIs always use ETags as opposed to expiry times so the server is hit on each request, and I require correct version/concurrency headers to mutate the data. For APIs that are read-only and view/report aligned, I do use expiry times to reduce hits on origin, e.g. a leaderboard can be good for 10 mins.
For much larger collections, like ~/people, I tend not to need multiple delete, the use-case tends not to naturally arise and so single DELETE works fine.
In future, and from experience with building REST APIs and hitting the same issues and requirements, like audit, I'd be inclined to use only GET and POST verbs and design around events, e.g. POST a change of address event, though I suspect that'll come with its own set of problems :)
I'd also allow front-end devs to build their own APIs that consume stricter back-end APIs since there's often practical, valid client-side reasons why they don't like strict "Fielding zealot" REST API designs, and for productivity and cache layering reasons.
You can POST a deleted resource :). The URL will be
POST /deleted-records
and the body will be
{"ids": [1, 2, 3]}
SO says this may be subjective. I'm hoping not--I just can't seem to understand how this works in practice, and it seems like a specific enough technical question with I hope a definitive answer.
Context: LAPP stack.
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
Database handles have some resource cost associated with them, hence the existence of Apache::DBI, DBIx::Connector, and DBI::connect_cached(), to re-use a recent connection to a database. Making use of them should make a web app faster by avoiding the cost of connecting to a database.
The reason these seem to be mutually exclusive best practices is that, in my understanding, #1 implies that any database connection will be made with separate per-user credentials, which implies (as Apache::DBI documents) that re-using such connections will likely quickly cause your database backend to run out of connections.
The default maximum number of connections for PostgreSQL is 100.
The default numbers of servers and multiplied by subprocesses allowed for each, for Apache 2 running with the prefork MPM, far exceeds that, so it seems Apache::DBI's docs are right.
Thus the question: What do people do then, in practice?
Does this mean people using a LAPP stack generally connect using a single database user, and implement their own security/permissions model? Or does it mean they don't pool connections? Or do they choose between these two strategies based on speed vs security needs if they go with a LAPP stack, and if they need both, go with a desktop app or some other connection model?
Or if these are not, in fact, mutually exclusive strategies, what am I missing in my understanding here?
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
You probably misread this, or read it in a highly biased location. A more balanced view is (hopefully) this:
Managing perms (ACL or RBAC or other) within the database is a bloody mess and hard to get right. It can cripple performance, too, if done improperly (think: "select * from table join perms where convoluted_permission_scenario".) Depending on who you ask, you'll get more or less extreme viewpoints, e.g. here's (the very controversial) Zed Shaw: http://vimeo.com/2723800.
Managing perms at the DB level is just as much of a bloody mess. Not all engines implement row-level permissions, and even then there occasionally are leaks. For instance, calling a function in a where clause could (can?) leak rows in Postgres (until a recent version?) if raise gets called. And frankly, if you go past a superficial analysis of what is going on, it basically amounts to the former — just standardized and (usually) in C.
Managing perms at the app level without a database is also a bloody mess. It'll cripple performance no matter what you do from the moment where you need to join outside of SQL, unless you're dealing with trivial amounts of data. If you try it, you'll do fine… until your database grows too large and you basically don't.
So, in short: it's a bloody mess no matter where you manage it. Because permissions are a mess. In addition to the casual and idealistic "Joe needs write access to this set of nodes", you also need to cope with more down to earth scenarios such as "John is going off on vacation for Christmas and needs to temporarily delegate his write permissions on this set of nodes to his assistant Jane". Moreover, whichever scenario you do pick, you need to manage read access (which is usually the most frequent) in such a way that it's fast so you can scale. There's no silver bullet.
Moreover, even in the first and last of the above scenarios, it's ideal to have three DB users. One for reads, one for read/writes, and one for schema changes. Most apps don't, because it's yet another bloody mess to configure your ORM that way, hence the typical one DB user per app.
Anyway, getting back to your question: what people do in practice is one or two database users (read vs read/write/modify), implement RBAC or ACL within the database itself, and avoid access restriction logic like the plague on public-facing pages for performance reasons.