How appropriate it is to use SAML_login with AEM with more than 1m users? - single-sign-on

I am investigating a slow login time and some profile synchronisation problems of a large enterprise AEM project. The system has around 1.5m users. And the website is served by 10 publishers.
The way this project is built, is that they have enabled the SAML_login for all these end-users and there is a third party IDP which I assume SAML_login talks to. I'm no expert on this SSO - SAML_login processes, so I'm trying to understand if this is the correct way to go at the first step.
Because of this setup and the number of users, SAML_login call takes 15 seconds on avarage. This is getting unacceptable day by day as the user count rises. And even more importantly, the synchronization between the 10 publishers are failing occasionally, hence some of the users sometimes can't use the system as they are expected to.
Because the users are stored in the JCR for SAML_login, you cannot even go and check the home/users folder from crx browser. It times out as it is impossible to show 1.5m rows at once. And my educated guess is, that's why the SAML_login call is taking so long.
I've come accross with articles that tells how to setup SAML_login on AEM, and this makes it sound legal for what it is used in this case. But in my opinion this is the worst setup ever as JCR is not a well designed quick access data store for this kind of usage scenarios.
My understanding so far is that this approach might work well but with only limited number of users, but with this many of users, it is an inapplicable solution approach. So my first question would be: Am I right? :)
If I'm not right, there is certainly a bottleneck somewhere which I'm not aware of yet, what can be that bottleneck to improve upon?

The AEM SAML Authentication handler has some performance limitations with a default configuration. When your browser does an HTTP POST request to AEM under /saml_login it includes a base 64 encoded "SAMLResponse" request parameter. AEM directly processes that response and does not contact any external systems.
Even though the SAML response is processed on AEM itself, the bottle-necks of the /saml_login call are the following:
Initial login where AEM creates the user node for the first time - you can look at creating the nodes ahead of time. You could write a script to create the SAML user nodes (under /home/users) in AEM ahead of time.
During each login when the session is first created - a token node is created under the user node under /home/users/.../{usernode}/.tokens - this can be avoided by enabling the encapsulated token feature.
Finally, the last bottle-neck occurs when it saves the SAMLResponse XML under the user node (for later use required for SAML-based logout). This can be avoided by not implementing SAML-based logout. The latest com.adobe.granite.auth.saml bundle supports turning off the saving of the SAML response. Service packs AEM 6.4.8 and AEM 6.5.4 include this feature. To enable this feature, set the OSGI configuration properties storeSAMLResponse=false and handleLogout=false and it would not store the SAML response.

Related

How to manage HATEOAS links when the server is the client?

I'm learning about HATEOAS. The backend server I'm working on will use a third party REST API that uses HATEOAS. That API has an end point to return the url for each resource and also returns the related resource links with regular requests.
But I'm wondering what's a good way to manage these links on the server to avoid hardcoding them. For example if the third party changes the url of the resource, how will the server detect that change? Are there any standard practices for managing HATEOAS resource links?
Possible ways I can think of
When the server starts, get all the resources urls and cache them. Whenever the third party API needs to be called, reuse these cached urls. Whenever there is a 404 or related error, update the resource url. Or update the url periodically in intervals.
Get the resource url each time before calling the end point. Simplest but essentially doubles the number of requests.
But neither sound like robust ways.
While discovery is generally a good thing and should allow a HATEOAS system to introduce changes in ways that 'hardcoded urls' don't, if urls start breaking arbitrarily I would still consider this a major issue.
You should be able to store urls / links on your side and have some expectation that those keep working.
There are some mechanisms that deal with changes though:
The server should return 301 / 308 redirects if a resource moved. If this were the case, you should also update your references.
The server can emit Sunset or Deprecated headers. See: https://www.rfc-editor.org/rfc/rfc8594
Those are more general answers, but ultimately the existence of best practices does not mean that vendors will abide by them. With that in mind I think your best bet is to try and find out what the deprecation policy is of your vendor and see what they recommend.
Use a cached resource if it is valid, request a refresh when you don't have a local valid copy.
RFC 7234 defines the caching semantics of HTTP.
Ideally, you don't implement the caching rules yourself, but instead you use a general purpose cache.
In its ideal form, your bespoke implementation is talking to a headless browser, and the headless browser worries about the caching rules for you.
In theory, you need the initial URL to start the process, and everything else comes from that.
Each resource you get from the server should include links to other edges on the graph of service for that resource.
So, once you get the initial resource, all of the rest come automatically.
That said, it's not untoward to have "well known" entry points that are, ideally, unchanging URLs. But in the end, those are just "bookmarks", and not necessarily guaranteed end points.
Consider a shopping site such as Amazon. Outside of amazon.com, you don't know any of their URLs. They're all provided on the various forms and pages, and the human simply navigates the site. Those URLs can be changing all the time, and no one would know. With HATEOAS, it's up to the machine to follow the links, rather than a human. But the process of navigation is the same.
As others have mentioned, idea of caching a root resource has merit. Then you rely on the caching headers to direct you to how often you have to refresh the links.
But that said, operationally, there's no difference between following a normal link, and following a cached link. Underneath, the cached resource loads faster, but you still need to "follow the link". Because that's where the caching behavior kicks in. This is different from assuming the link is good, assuming you know the result of a resource lookup. Your application follows the link. Always. The underlying infrastructure is responsible for making it efficient.
So, your code should not, say, load up a root resource, and then stuff a map filled with links, and then assume they're good. Rather, the code should request the root resource, perhaps as a Map of links (datatypes for the win), and let the next layer handle the details. Because it all depends on the type of caching involved. Some have coded durations where no followup is necessary. Others, you make the request anyway, and the server tier responds back "nothing changed", so you can use your local copy, but you're still require to ask in the first place.
Those are implementation details that the SERVER mandates (not the client). It's a server contract. If they want you pinging them each and every time, so be it. That's the contract they're presenting to you and if you want to be a Good Citizen, then you should honor that contact.
Ideally, the server makes good decisions on these kinds of issues for the sake of efficiency, but in the end it's really up to them.
The client has to go along. The client in a HATEOAS system cedes a lot to the server. They're simply not decisions for the client to make.

How exactly does backend work from a developer perspective?

Theres a ton of videos and websites trying to explain backend vs frontend, but unfortunately none of them explains it in a way that you know how to develop a backend - driven website (at least I haven't found anything good).
So, I wanted to ensure that I understood it and kindly ask you to confirm or correct me on this topic.
Example:
I wanted to build Mini - Google. I have a Database containing 1000 stored websites.
Assumption #1:
Everytime I type something into the search bar, the autofill suggestions change. This means, everytime i type, another website / API gets called returning the current autofill suggestions. On a developer site, this means the website e.g. is a Python script which gets called with the current word typed in as a Parameter and is returning all suggestions as e.g. JSON:
// Client Side Script
function ontype(input):
suggestions = get("https://api.googlemini.com/suggestions?q=" + str(input))
show(suggestions)
Assumption #2:
This also means I could manually call the website containing the Python script, providing a random word and it would always return a JSON containing the autofill suggestions for that word.
Question #1:
If A#1 turns out true but A#2 turns out false, how could I prevent a user from randomly accessing the "API" while still returning results when called by a script?
Assumption #3:
After pressing enter, my website googlemini.com/search?... would be called. As google.com/search reloads everytime searching for a new query (or going to page 2 etc.), I assume, instead of calling an API, when the server gets the client request, it first searches through its database, sorts the results and then returns a whole html as a static webpage:
// Server Side Script
#app.route("/search")
function oncall():
query = getparam("q")
results = searchdatabase(query)
html = buildhtml(results)
return html
Question #2:
Often, I hear (or at least understand it this way) that database and webserver are 2 seperate servers. How would that work? Wouldn't that mean the database server needs to be accessible to the web too (of course it would have security layers etc., but technically it would)? How could I access the database server from the webserver?
Question #3:
Are there, on a technical basis, any other ways to build backend services?
That's it. I would also appreciate any recommendations like videos, websites or others to learn how to technically setup and / or secure backend servers.
Thanks in advance.
For your first question you can yes there is a way to prevent miss use.
What you can do is add identifier to api like Auth token to identify a user and every time a user access the api you can save the count on the server n whenever the count has exceeded a limit within a time span you can reject the call. And the limit can be set in such a way that it doesn't trouble the honest user and punishes the wrong one. There are even more complex and effective methods but this is the basic idea.
For question number to let me explain you a simple concept a database is a very efficient, resourcefull and expensive data storage solution we never want it to be used in a general sense as varible store or something. We always want to access the database in call get the data process the data update the data. So we do it data way and its not necessary you make sepreate server for data base. The thing is we mostly make databse to be accessible to various platforms android, ios, windows. So its better to add some abstraction and keep data base as a separte entity.
For the last, I am not well aware about what you meant by other but I am listing some backend teechnologies, some of these might be used in isolation some of these not some other tools as well.
Django
FLask
Djnago rest
GraphQL
SQL
PHP
Node
Deno

How to balance REST api and Openedness to prevent data stealing

One of our web site is a common "Announce for free your apartment".
Revenues are directly associated to number of public usage and announces
registered (argument of our marketing department).
On the other side, REST pushes to maintain a clear api when designing your
api (argument of our software department) which is a data stealing
invitation to any competitors. In this view, the web server becomes
almost an intelligent database.
We clearly identified our problem, but have no idea how to resolve these
contraints. Any tips would help?
Throttle the calls to the data rich elements by IP to say 1000 per day (or triple what a normal user would use)
If you expose data then it can be stolen. And think about search elements that return large datasets even if they are instigated by javascript or forms - I personally have written trawlers that circumvent these issues.
You may also think (if data is that important) about decrypting it in the client based on keys and authentication sent from the server (but this only raises the bar not the ability to steal.
Add captcha/re-captcha for users who are scanning too quickly or too much.
In short:
As always only expose the minimum API to do the job (attack surface minimisation)
Log and throttle
Force sign in(?). This at least MAY put off some scanners
Use capthca mechanism for users you think may be bots trawling your data

Beginner GXT issues

We have a working web application, which has been developed with ExtJS for client side, and Struts, Spring, Hibernate for server side. now, we are considering to migrate to GXT (or may be GWT itself). The thing is I'm very new to GWT/GXT. and we are trying to decide whether we go down this road or not.
1) Until now, we have 2 domains for our web-app. one is that the application (Struts+...) have been deployed to, and the other is mainly a cookie-less custom CDN. The transfer between client and server is mostly XHR requests, sending/receiving JSON and/or JSONP. But with the new approach ahead of us, I began to understand that we are supposed to have only ONE domain, for the whole GXT application. Is it correct or I forgot to consider something here?
and if not, Is it possible that we deployed just part of the application (i.e. com.ourcompany.webapp.gxt.server.*) to the main server, and the contents that have been compiled and generated by the GWT compiler to the other CDN-like domain?
2) The other big issue we are facing is that the current application is consists of mostly 3 huge modules. One is responsible for "SignIn", the other is for "Webtop", and the third one is "Modules which each users has access to". The latter has been generated on the server due to "access rights" of each users, and obviously could be different from one user to the other.
The only thing I could find on this matter, which might be related is Code Splitting. Although I'm not totally sure if this would be the right solution for this.
We want that the application, on Start Up, checks whether user has been logged in or not. if not, loads the SignIn sets of javascript files (i.e webapp.signin.nocache.js), then after user has entered the correct username/password, unloads this signin file and loads webtop.nocache.js AND modules.nocache.js.
I would be really appreciated if you could help me out.
1) If your GWT app is loaded from a different domain than you have to face the same origin policy. You can not do a xhr to a different domain. You could use the ScriptTagProxy to get around this. But it does not feel very netural.
2) You can use CodeSplitting in order to automatically load a particular part of your application dynamically. All you have to do is to warp your splitt point into an async call.
A detailed compile report gives you a pretty good overview how well code splitting is working.
But CodeSplitting does not unload already loaded code. If its really importend to do so you have to redirect the user to another url in order to load the appropriate user depended module.
Once Javascript code has been loaded and executed its impossible to remove the code from the browsers memory.
Grettings,
Peter

Connectedness & HATEOAS

It is said that in a well defined RESTful system, the clients only need to know the root URI or few well known URIs and the client shall discover all other links through these initial URIs. I do understand the benefits (decoupled clients) from this approach but the downside for me is that the client needs to discover the links each time it tries access something i.e given the following hierarchy of resources:
/collection1
collection1
|-sub1
|-sub1sub1
|-sub1sub1sub1
|-sub1sub1sub1sub1
|-sub1sub2
|-sub2
|-sub2sub1
|-sub2sub2
|-sub3
|-sub3sub1
|-sub3sub2
If we follow the "Client only need to know the root URI" approach, then a client shall only be aware of the root URI i.e. /collection1 above and the rest of URIs should be discovered by the clients through hypermedia links. I find this cumbersome because each time a client needs to do a GET, say on sub1sub1sub1sub1, should the client first do a GET on /collection1 and the follow link defined in the returned representation and then do several more GETs on sub resources to reach the desired resource? or is my understanding about connectedness completely wrong?
Best regards,
Suresh
You will run into this mismatch when you try and build a REST api that does not match the flow of the user agent that is consuming the API.
Consider when you run a client application, the user is always presented with some initial screen. If you match the content and options on this screen with the root representation then the available links and desired transitions will match nicely. As the user selects options on the screen, you can transition to other representations and the client UI should be updated to reflect the new representation.
If you try and model your REST API as some kind of linked data repository and your client UI as an independent set of transitions then you will find HATEOAS quite painful.
Yes, it's right that the client application should traverse the links, but once it's discovered a resource, there's nothing wrong with keeping a reference to that resource and using it for a longer time than one request. If your client has the possibility of remembering things permanently, it can do so.
consider how a web browser keeps its bookmarks. You probably have maybe ten or a hundred bookmarks in the browser, and you probably found some of these deep in a hierarchy of pages, but the browser dutifully remembers them without requiring remembering the path it took to find them.
A more rich client application could remember the URI of sub1sub1sub1sub1 and reuse it if it still works. It's likely that it still represents the same thing (it ought to). If it no longer exists or fails for any other client reason (4xx) you could retrace your steps to see if you can find a suitable replacement.
And of course what Darrel Miller said :-)
I don't think that that's the strict requirement. From how I understand it, it is legal for a client to access resources directly and start from there. The important thing is that you do not do this for state transitions, i.e. do not automatically proceed with /foo2 after operating on /foo1 and so forth. Retrieving /products/1234 initially to edit it seems perfectly fine. The server could always return, say, a redirect to /shop/products/1234 to remain backwards compatible (which is desirable for search engines, bookmarks and external links as well).