How to balance REST api and Openedness to prevent data stealing - rest

One of our web site is a common "Announce for free your apartment".
Revenues are directly associated to number of public usage and announces
registered (argument of our marketing department).
On the other side, REST pushes to maintain a clear api when designing your
api (argument of our software department) which is a data stealing
invitation to any competitors. In this view, the web server becomes
almost an intelligent database.
We clearly identified our problem, but have no idea how to resolve these
contraints. Any tips would help?

Throttle the calls to the data rich elements by IP to say 1000 per day (or triple what a normal user would use)
If you expose data then it can be stolen. And think about search elements that return large datasets even if they are instigated by javascript or forms - I personally have written trawlers that circumvent these issues.
You may also think (if data is that important) about decrypting it in the client based on keys and authentication sent from the server (but this only raises the bar not the ability to steal.
Add captcha/re-captcha for users who are scanning too quickly or too much.
In short:
As always only expose the minimum API to do the job (attack surface minimisation)
Log and throttle
Force sign in(?). This at least MAY put off some scanners
Use capthca mechanism for users you think may be bots trawling your data

Related

Random email addresses being signed up to my website

Over the past few months random email addresses, some of which are on known spam lists, have been added at the rate of 2 or 3 a day to my website.
I know they aren't real humans - for a start the website is in a very narrow geographical area, and many of these emails are clearly from a different country, others are info# addresses that appear to have been harvested from a website, rather than something a human would use to sign up to a site.
What I can't work out is, what are reasons for somebody doing this? I can't see any benefit to an external party beyond being vaguely destructive. (I don't want to link to the site here, it's just a textbox where you enter email and press join).
These emails are never verified - my question isn't about how to prevent this, but what are some valid reasons why somebody might do this. I think it's important to understand why malicious users do what they do.
This is probably a list bombing attack, which is definitely not valid. The only valid use I can think of is for security research, and that's a corner case.
List bomb
I suspect this is part of a list bombing attack, which is when somebody uses a tool or service to maliciously sign up a victim for as much junk email as possible. I work in anti-spam and have seen victims' perspectives on this: it's nearly all opt-in verifications, meaning the damage is only one per service. It sounds like you're in the Confirmed Opt-In (COI) camp, so congratulations, it could be worse.
We don't have good solutions for list bombing. There are too many problems to entertain a global database of hashed emails that have recently opted into lists (so list maintainers could look up an address, conclude it's being bombed, and refuse to invite). A global database of hashed emails opting out of bulk mail (like the US Do Not Call list or the now-defunct Blue Frog's Do Not Intrude registry but without the controversial DDoS-the-spammers portion) could theoretically work in this capacity, though there'd still be a lot of hurdles to clear.
At the moment, the best thing you can do is to rate-limit (which this attacker is savvy enough to avoid) and use captchas. You can measure your success based on the click rate of the links in your COI emails; if it's still low, you still have a problem.
In your particular case, asking the user to identify a region via drop-down, with no default, may give you an easy way to reject subscriptions or trigger more complex captchas.
If you're interested in a more research-driven approach, you could try to fingerprint the subscription requests and see if you can identify the tool (if it's client-run, and I believe most are) or the service (if it's cloud-run, in which case you can hopefully just blacklist a few CIDR ranges instead). Pay attention to requesters' HTTP headers, especially the referer. Browser fingerprinting it its own arms race; take a gander at the EFF's Panopticlick or Brian Kreb's piece on AntiDetect.
Security research
The only valid case I can consider, whose validity is debatable, is that of security research (which is my field). When I'm given a possible phishing link, I'm going to anonymize it. This means I'll enter fake data rather than reveal my source. I'd never intentionally go after a subscription mechanism (at least with an email I don't control), but I suppose automation could accidentally stumble into such a thing.
You can avoid that by requiring POST requests to subscribe. No (well-designed) subscription mechanism should accept GET requests or action links without parameters (though there are plenty that do). No (well-designed) web crawler, for search or archiving or security, should generate POST requests, at least without several controls to ensure it's acceptable (such as already concluding that it's a bad actor's site). I'm going to be generous and not call out any security vendors that I know do this.

How appropriate it is to use SAML_login with AEM with more than 1m users?

I am investigating a slow login time and some profile synchronisation problems of a large enterprise AEM project. The system has around 1.5m users. And the website is served by 10 publishers.
The way this project is built, is that they have enabled the SAML_login for all these end-users and there is a third party IDP which I assume SAML_login talks to. I'm no expert on this SSO - SAML_login processes, so I'm trying to understand if this is the correct way to go at the first step.
Because of this setup and the number of users, SAML_login call takes 15 seconds on avarage. This is getting unacceptable day by day as the user count rises. And even more importantly, the synchronization between the 10 publishers are failing occasionally, hence some of the users sometimes can't use the system as they are expected to.
Because the users are stored in the JCR for SAML_login, you cannot even go and check the home/users folder from crx browser. It times out as it is impossible to show 1.5m rows at once. And my educated guess is, that's why the SAML_login call is taking so long.
I've come accross with articles that tells how to setup SAML_login on AEM, and this makes it sound legal for what it is used in this case. But in my opinion this is the worst setup ever as JCR is not a well designed quick access data store for this kind of usage scenarios.
My understanding so far is that this approach might work well but with only limited number of users, but with this many of users, it is an inapplicable solution approach. So my first question would be: Am I right? :)
If I'm not right, there is certainly a bottleneck somewhere which I'm not aware of yet, what can be that bottleneck to improve upon?
The AEM SAML Authentication handler has some performance limitations with a default configuration. When your browser does an HTTP POST request to AEM under /saml_login it includes a base 64 encoded "SAMLResponse" request parameter. AEM directly processes that response and does not contact any external systems.
Even though the SAML response is processed on AEM itself, the bottle-necks of the /saml_login call are the following:
Initial login where AEM creates the user node for the first time - you can look at creating the nodes ahead of time. You could write a script to create the SAML user nodes (under /home/users) in AEM ahead of time.
During each login when the session is first created - a token node is created under the user node under /home/users/.../{usernode}/.tokens - this can be avoided by enabling the encapsulated token feature.
Finally, the last bottle-neck occurs when it saves the SAMLResponse XML under the user node (for later use required for SAML-based logout). This can be avoided by not implementing SAML-based logout. The latest com.adobe.granite.auth.saml bundle supports turning off the saving of the SAML response. Service packs AEM 6.4.8 and AEM 6.5.4 include this feature. To enable this feature, set the OSGI configuration properties storeSAMLResponse=false and handleLogout=false and it would not store the SAML response.

How can I search for past sent emails with Sendgrid?

As Sendgrid's documentation makes clear, their web GUI activity page is only searchable for the past 7 days.
How do I search for activity from farther in the past?
Web API documentation is here, but I can't find anything about just plain searching for info on sent emails. All I see are endpoints for seeing particular categories of emails' various fates, like blocks, bounces, invalid emails, and "filters", which seem like actions and not like filters.
It's got to be possible to just find info about some particular sent email, right?
It's not possible. As you noted, the documentation clearly states that:
Email activity only shows the most recent 7 days. To access data in
real time, we recommend that you consider implementing our Event
Webhook.
If you want to record all the history associated with your account you should record and save it yourself. You can record all the emails you send provided you have an endpoint to do so. See here: https://sendgrid.com/docs/User_Guide/Settings/parse.html
Later Edit:
"real time" means "as it happens", it does not mean "history searchable at any point in time".
When you use an API, as a developer, the responsibility to log all API calls and responses lies with you. While it's true that bounces aren't necessarily reported in the API call response, the SendGrid API offers several ways in which you can be notified. Personal opinion: I know this functionality is often omitted in the MVP because you need to go to market as soon as possible, but an ELK stack is not that hard to set up.
There are several ways you can look for bounces and other events as you can see here: https://sendgrid.com/docs/Classroom/Track/Bounces/bounce_reports_how_can_i_be_notified.html
Webhook for events: http://sendgrid.com/docs/API_Reference/Webhooks/event.html
Enabling Bounce Forwarding on your account
Bounce API: https://sendgrid.com/docs/API_Reference/Web_API_v3/bounces.html
If you really need to find out what happened on day X with email send Y, you can contact their Support team. They can probably look it up for you.
Personal opinion:
That 7 days is not a random number. I'm willing to bet that SendGrid does in fact log all calls you made but it can't provide them for an earlier time. When you use Facebook API, Twitter API, etc. You don't expect them to provide you with historical data of every API call you made. This is an ungodly amount of data. We're talking about an API that is used to send probably upwards of millions of emails per day, maybe even more. I believe they actually did the math and recalling historical data from earlier would put an unnecessary strain on the system, it would take a long time to answer such a request.
I'm sorry if I went on a bit of a rant but people often don't think about the volume of data needed to store such things and how much it would cost to search it.

Keeping things RESTful

New to rest and not having even known what REST was, I began watching a few videos and picked up a book to help guide me towards the correct approach. Unfortunately, my first version is completely botched to hell and I'm likely going to have to break any customers using that implementation shortly. To ensure that I don't need to do this again, I need your assistance!
I have a few DB tables that I'm concerned with here:
'PrimaryBuyer' & 'AllBuyers'
They share a majority of fields, but AllBuyers has a few things Primary does not and vice versa. Each primary buyer is given a unique 'CaseNumber' when entered into the system. This in addition to a 'SequenceNumber' is then used to identify 'AllBuyers'. This CaseNumber is returned to the user of the web service to store for future use. The sequence numbers however are implied based on their location within the XML / JSON.
To specify these tables -> For example, if I were to buy a car I would be the primary buyer and would thusly be entered into BOTH Primary and AllBuyers tables. However, if my credit was bad I could have my spouse cosign on the loan. This would make her a secondary buyer, and she would be entered exclusively into 'AllBuyers' table.
I currently have one REST URI set up as '/buyers/' which mandates that all information for all buyers is entered at once. Similarly if I were to do an update on this URI, the Primary is updated in both tables and any Secondary buyers in the payload would replace previously existing ones.
Ultimately, there is no way to directly access tables 'PrimaryBuyer' and 'AllBuyers'
I've been trying to think of a solution around this problem, but have been unable to think of anything that's necessarily RESTful or not a pain for customers. Is it ridiculous to think that the user should (say on an add) POST to /primarybuyer/, take the returned casenumber, and then POST the same information and then some to /allbuyers/? That seems like it would be a little silly on bandwidth among other things. Should things be left in their current state?
Hopefully that's not too much information to answer such a seemingly simple question.
Is it ridiculous to think that the user should (say on an add) POST to
/primarybuyer/, take the returned casenumber, and then POST the same
information and then some to /allbuyers/?
When you talk about "user" do you mean the person or the system (browser) that uses your service ?
Using a REST service is normally made by a system, today that means a lot browser+Javascript. To do the job for the user person is done at a web facade, and in back is running all your (Javascript) code to make the appropiated REST calls.
Why not post the buyer's information as a serialized object to the rest server? You could do this via the parameters section of the request. When the request got to the server, you could deserialize the object, and implement the logic of updating the database.
Kind of like how amazon does it? http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_AddInstanceGroups.html
If the client application does POST /PrimaryBuyer there is no reason that the server cannot also copy that case information into the /AllBuyers resource and vice versa.

Transactions in REST?

I'm wondering how you'd implement the following use-case in REST. Is it even possible to do without compromising the conceptual model?
Read or update multiple resources within the scope of a single transaction. For example, transfer $100 from Bob's bank account into John's account.
As far as I can tell, the only way to implement this is by cheating. You could POST to the resource associated with either John or Bob and carry out the entire operation using a single transaction. As far as I'm concerned this breaks the REST architecture because you're essentially tunneling an RPC call through POST instead of really operating on individual resources.
Consider a RESTful shopping basket scenario. The shopping basket is conceptually your transaction wrapper. In the same way that you can add multiple items to a shopping basket and then submit that basket to process the order, you can add Bob's account entry to the transaction wrapper and then Bill's account entry to the wrapper. When all the pieces are in place then you can POST/PUT the transaction wrapper with all the component pieces.
There are a few important cases that aren't answered by this question, which I think is too bad, because it has a high ranking on Google for the search terms :-)
Specifically, a nice propertly would be: If you POST twice (because some cache hiccupped in the intermediate) you should not transfer the amount twice.
To get to this, you create a transaction as an object. This could contain all the data you know already, and put the transaction in a pending state.
POST /transfer/txn
{"source":"john's account", "destination":"bob's account", "amount":10}
{"id":"/transfer/txn/12345", "state":"pending", "source":...}
Once you have this transaction, you can commit it, something like:
PUT /transfer/txn/12345
{"id":"/transfer/txn/12345", "state":"committed", ...}
{"id":"/transfer/txn/12345", "state":"committed", ...}
Note that multiple puts don't matter at this point; even a GET on the txn would return the current state. Specifically, the second PUT would detect that the first was already in the appropriate state, and just return it -- or, if you try to put it into the "rolledback" state after it's already in "committed" state, you would get an error, and the actual committed transaction back.
As long as you talk to a single database, or a database with an integrated transaction monitor, this mechanism will actually work just fine. You might additionally introduce time-outs for transactions, which you could even express using Expires headers if you wanted to.
In REST terms, resources are nouns that can be acted on with CRUD (create/read/update/delete) verbs. Since there is no "transfer money" verb, we need to define a "transaction" resource that can be acted upon with CRUD. Here's an example in HTTP+POX. First step is to CREATE (HTTP POST method) a new empty transaction:
POST /transaction
This returns a transaction ID, e.g. "1234" and according URL "/transaction/1234". Note that firing this POST multiple times will not create the same transaction with multiple IDs and also avoids introduction of a "pending" state. Also, POST can't always be idempotent (a REST requirement), so it's generally good practice to minimize data in POSTs.
You could leave the generation of a transaction ID up to the client. In this case, you would POST /transaction/1234 to create transaction "1234" and the server would return an error if it already existed. In the error response, the server could return a currently unused ID with an appropriate URL. It's not a good idea to query the server for a new ID with a GET method, since GET should never alter server state, and creating/reserving a new ID would alter server state.
Next up, we UPDATE (PUT HTTP method) the transaction with all data, implicitly committing it:
PUT /transaction/1234
<transaction>
<from>/account/john</from>
<to>/account/bob</to>
<amount>100</amount>
</transaction>
If a transaction with ID "1234" has been PUT before, the server gives an error response, otherwise an OK response and a URL to view the completed transaction.
NB: in /account/john , "john" should really be John's unique account number.
Great question, REST is mostly explained with database-like examples, where something is stored, updated, retrieved, deleted. There are few examples like this one, where the server is supposed to process the data in some way. I don't think Roy Fielding included any in his thesis, which was based on http after all.
But he does talk about "representational state transfer" as a state machine, with links moving to the next state. In this way, the documents (the representations) keep track of the client state, instead of the server having to do it. In this way, there is no client state, only state in terms of which link you are on.
I've been thinking about this, and it seems to me reasonable that to get the server to process something for you, when you upload, the server would automatically create related resources, and give you the links to them (in fact, it wouldn't need to automatically create them: it could just tell you the links, and it only create them when and if you follow them - lazy creation). And to also give you links to create new related resources - a related resource has the same URI but is longer (adds a suffix). For example:
You upload (POST) the representation of the concept of a transaction with all the information. This looks just like a RPC call, but it's really creating the "proposed transaction resource". e.g URI: /transaction
Glitches will cause multiple such resources to be created, each with a different URI.
The server's response states the created resource's URI, its representation - this includes the link (URI) to create the related resource of a new "committed transaction resource". Other related resources are the link to delete the proposed transaction. These are states in the state-machine, which the client can follow. Logically, these are part of the resource that has been created on the server, beyond the information the client supplied. e.g URIs: /transaction/1234/proposed, /transaction/1234/committed
You POST to the link to create the "committed transaction resource", which creates that resource, changing the state of the server (the balances of the two accounts)**. By its nature, this resource can only be created once, and can't be updated. Therefore, glitches committing many transactions can't occur.
You can GET those two resources, to see what their state is. Assuming that a POST can change other resources, the proposal would now be flagged as "committed" (or perhaps, not available at all).
This is similar to how webpages operate, with the final webpage saying "are you sure you want to do this?" That final webpage is itself a representation of the state of the transaction, which includes a link to go to the next state. Not just financial transactions; also (eg) preview then commit on wikipedia. I guess the distinction in REST is that each stage in the sequence of states has an explicit name (its URI).
In real-life transactions/sales, there are often different physical documents for different stages of a transaction (proposal, purchase order, receipt etc). Even more for buying a house, with settlement etc.
OTOH This feels like playing with semantics to me; I'm uncomfortable with the nominalization of converting verbs into nouns to make it RESTful, "because it uses nouns (URIs) instead of verbs (RPC calls)". i.e. the noun "committed transaction resource" instead of the verb "commit this transaction". I guess one advantage of nominalization is you can refer to the resource by name, instead of needing to specify it in some other way (such as maintaining session state, so you know what "this" transaction is...)
But the important question is: What are the benefits of this approach? i.e. In what way is this REST-style better than RPC-style? Is a technique that's great for webpages also helpful for processing information, beyond store/retrieve/update/delete? I think that the key benefit of REST is scalability; one aspect of that is not needing to maintain client state explicitly (but making it implicit in the URI of the resource, and the next states as links in its representation). In that sense it helps. Perhaps this helps in layering/pipelining too? OTOH only the one user will look at their specific transaction, so there's no advantage in caching it so others can read it, the big win for http.
I've drifted away from this topic for 10 years. Coming back, I can't believe the religion masquerading as science that you wade into when you google rest+reliable. The confusion is mythic.
I would divide this broad question into three:
Downstream services. Any web service you develop will have downstream services that you use, and whose transaction syntax you have no choice but to follow. You should try and hide all this from users of your service, and make sure all parts of your operation succeed or fail as a group, then return this result to your users.
Your services. Clients want unambiguous outcomes to web-service calls, and the usual REST pattern of making POST, PUT or DELETE requests directly on substantive resources strikes me as a poor, and easily improved, way of providing this certainty. If you care about reliability, you need to identify action requests. This id can be a guid created on the client, or a seed value from a relational DB on the server, it doesn't matter. For server generated ID's, use a 'preflight' request-response to exchange the id of the action. If this request fails or half succeeds, no problem, the client just repeats the request. Unused ids do no harm.This is important because it lets all subsequent requests be fully idempotent, in the sense that if they are repeated n times they return the same result and cause nothing further to happen. The server stores all responses against the action id, and if it sees the same request, it replays the same response. A fuller treatment of the pattern is in this google doc. The doc suggests an implementation that, I believe(!), broadly follows REST principals. Experts will surely tell me how it violates others. This pattern can be usefully employed for any unsafe call to your web-service, whether or not there are downstream transactions involved.
Integration of your service into "transactions" controlled by upstream services. In the context of web-services, full ACID transactions are considered as usually not worth the effort, but you can greatly help consumers of your service by providing cancel and/or confirm links in your confirmation response, and thus achieve transactions by compensation.
Your requirement is a fundamental one. Don't let people tell you your solution is not kosher. Judge their architectures in the light of how well, and how simply, they address your problem.
If you stand back to summarize the discussion here, it's pretty clear that REST is not appropriate for many APIs, particularly when the client-server interaction is inherently stateful, as it is with non-trivial transactions. Why jump through all the hoops suggested, for client and server both, in order to pedantically follow some principle that doesn't fit the problem? A better principle is to give the client the easiest, most natural, productive way to compose with the application.
In summary, if you're really doing a lot of transactions (types, not instances) in your application, you really shouldn't be creating a RESTful API.
You'd have to roll your own "transaction id" type of tx management. So it would be 4 calls:
http://service/transaction (some sort of tx request)
http://service/bankaccount/bob (give tx id)
http://service/bankaccount/john (give tx id)
http://service/transaction (request to commit)
You'd have to handle the storing of the actions in a DB (if load balanced) or in memory or such, then handling commit, rollback, timeout.
Not really a RESTful day in the park.
First of all transferring money is nothing that you can not do in a single resource call. The action you want to do is sending money. So you add a money transfer resource to the account of the sender.
POST: accounts/alice, new Transfer {target:"BOB", abmount:100, currency:"CHF"}.
Done. You do not need to know that this is a transaction that must be atomic etc. You just transfer money aka. send money from A to B.
But for the rare cases here a general solution:
If you want to do something very complex involving many resources in a defined context with a lot of restrictions that actually cross the what vs. why barrier (business vs. implementation knowledge) you need to transfer state. Since REST should be stateless you as a client need to transfer the state around.
If you transfer state you need to hide the information inside from the client. The client should not know internal information only needed by the implementation but does not carry information relevant in terms of business. If those information have no business value the state should be encrypted and a metaphor like token, pass or something need to be used.
This way one can pass internal state around and using encryption and signing the system can be still be secure and sound. Finding the right abstraction for the client why he passes around state information is something that is up to the design and architecture.
The real solution:
Remember REST is talking HTTP and HTTP comes with the concept of using cookies. Those cookies are often forgotten when people talk about REST API and workflows and interactions spanning multiple resources or requests.
Remember what is written in the Wikipedia about HTTP cookies:
Cookies were designed to be a reliable mechanism for websites to remember stateful information (such as items in a shopping cart) or to record the user's browsing activity (including clicking particular buttons, logging in, or recording which pages were visited by the user as far back as months or years ago).
So basically if you need to pass on state, use a cookie. It is designed for exactly the very same reason, it is HTTP and therefore it is compatible to REST by design :).
The better solution:
If you talk about a client performing a workflow involving multiple requests you usually talk about protocol. Every form of protocol comes with a set of preconditions for each potential step like perform step A before you can do B.
This is natural but exposing protocol to clients makes everything more complex. In order to avoid it just think what we do when we have to do complex interactions and things in the real world... . We use an Agent.
Using the Agent metaphor you can provide a resource that can perform all necessary steps for you and store the actual assignment / instructions it is acting upon in its list (so we can use POST on the agent or an 'agency').
A complex example:
Buying a house:
You need to prove your credibility (like providing your police record entries), you need to ensure financial details, you need to buy the actual house using a lawyer and a trusted third party storing the funds, verify that the house now belongs to you and add the buying stuff to your tax records etc. (just as an example, some steps may be wrong or whatever).
These steps might take several days to be completed, some can be done in parallel etc.
In order to do this, you just give the agent the task buy house like:
POST: agency.com/ { task: "buy house", target:"link:toHouse", credibilities:"IamMe"}.
Done. The agency sends you back a reference to you that you can use to see and track the status of this job and the rest is done automatically by the agents of the agency.
Think about a bug tracker for instance. Basically you report the bug and can use the bug id to check whats going on. You can even use a service to listen to changes of this resource. Mission Done.
You must not use server side transactions in REST.
One of the REST contraints:
Stateless
The client–server communication is further constrained by no client context being stored on the server between requests. Each request from any client contains all of the information necessary to service the request, and any session state is held in the client.
The only RESTful way is to create a transaction redo log and put it into the client state. With the requests the client sends the redo log and the server redoes the transaction and
rolls the transaction back but provides a new transaction redo log (one step further)
or finally complete the transaction.
But maybe it's simpler to use a server session based technology which supports server side transactions.
I think that in this case it is totally acceptable to break the pure theory of REST in this situation. In any case, I don't think there is anything actually in REST that says you can't touch dependent objects in business cases that require it.
I really think it's not worth the extra hoops you would jump through to create a custom transaction manager, when you could just leverage the database to do it.
In the simple case (without distributed resources), you could consider the transaction as a resource, where the act of creating it attains the end objective.
So, to transfer between <url-base>/account/a and <url-base>/account/b, you could post the following to <url-base>/transfer.
<transfer>
<from><url-base>/account/a</from>
<to><url-base>/account/b</to>
<amount>50</amount>
</transfer>
This would create a new transfer resource and return the new url of the transfer - for example <url-base>/transfer/256.
At the moment of successful post, then, the 'real' transaction is carried out on the server, and the amount removed from one account and added to another.
This, however, doesn't cover a distributed transaction (if, say 'a' is held at one bank behind one service, and 'b' is held at another bank behind another service) - other than to say "try to phrase all operations in ways that don't require distributed transactions".
I believe that would be the case of using a unique identifier generated on the client to ensure that the connection hiccup not imply in an duplicity saved by the API.
I think using a client generated GUID field along with the transfer object and ensuring that the same GUID was not reinserted again would be a simpler solution to the bank transfer matter.
Do not know about more complex scenarios, such as multiple airline ticket booking or micro architectures.
I found a paper about the subject, relating the experiences of dealing with the transaction atomicity in RESTful services.
I guess you could include the TAN in the URL/resource:
PUT /transaction to get the ID (e.g. "1")
[PUT, GET, POST, whatever] /1/account/bob
[PUT, GET, POST, whatever] /1/account/bill
DELETE /transaction with ID 1
Just an idea.