Large scale multi-tenanting scenarios with Keycloak - keycloak

I'm trying to understand how Keycloak can be used in a large scale multi-tenanted scenario.
The standard approach seems to be to use a realm for each tenant. This isolates each tenant's users and settings and makes a lot of sense.
In the Keycloak example for multi-tenanting it says it "demonstrates the simplest possible scenario for Keycloak Multi Tenancy support" (emphasis mine). I might be reading too much into this, but to me that implies there are other standard approaches. I haven't been able to find much discussion about these options though.
I've also read that there are potentially performance issues with more than 100 realms. It might be that these performance issues have been fixed, but this also suggests to me that Keycloak wouldn't handle a large scale multi-tenanting scenario with 1,000+ tenants.
So my questions are:
Are there any other recommended approaches for multi-tenanting, other than "one realm per tenant"?
Are there any large scale multi-tenanting deployments of Keycloak in the wild that demonstrate its ability to cope with lots of realms?
Are there any recommendations for sources of information that I should be looking at?

To date (Keycloak 18 being released) the performance issues with 100+ realms still exists. But in the real world this is not an issue at all.
Just split your Tenants to multiple Keycloak clusters. I case you need to integrate your realms, this can be done independent from the specific cluster/instance through Federation across the clusters, so there's no need to run everything on a single cluster.
From an operations point of view - having such a large amount of tenants / realms on a single cluster would also be suboptimal - as you'd have a hard time to organize maintenances and downtimes. So splitting things up a bit is not the worst thing to consider.

Related

Drools for Multi-Client Web Application feasible?

I am working on a multi-client web-based application that analyses sensor data and shall invoke actions based on this data with a rule engine.
Every client of this application has a set of environmental sensors (10s - 100s) and a set of rules to be evaluated every time the sensor values change (the sensor values are copied into a database).
A basic set of rules will often be reused by different clients but the rules are individually parameterized (e.g. time dependant) for each client and every client has a different amount of sensors and rules, which can be configured individually. Some rules might even be specific to single clients.
I believe that drools might be a good choice for such an implementation - using drools guvnor to manage the rules for each client. Every client would have his own knowledge base and rule execution session.
I wonder if such an environment would scale and if there is a benchmark or real-world example where someone has used drools for such scenario.
Most benchmarks I could find assess different rule engines by their ability to perform rules on a growing number of facts. The amount of facts in my scenario would be relatively stable (per client) and scalability would rather be limited by the amount of clients and concurrent application of many knowledge bases and sessions.
Any comment about benchmarks or rule engine comparison regarding this scalability problem is welcome. I'd also be glad to hear about real-world implementations where every client has his own rules and dataset to work on.
The main problem with benchmarks is that they will vary a lot depending on the specific rules that you write for your own domain. Most of the benchmarks are tweaked to perform better in the rule engine that is testing. If you have a session per client and you have a stable number of client you will face no problem. Once you get the initial version of your project you can fine tune the engine to improve the performance.
The most "difficult" thing in my opinion is to get the infrastructure right, with that I mean, when to create the sessions and how to select the rules for each of the clients. Because that's part of your specific domain, you will need to code it and manage all the sessions.
Hope it helps
Acting on sensor data is one of the examples given for "Complex Event Processing". The following link may give a deeper insight on this subject.
Drools Fusion is also capable of CEP.

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.

What are the challenges in embedding text search (Lucene/Solr/Hibernate Search) in applications that are hosted at client sites

We have a enterprise java web-app that our customers (external) deploy on their intranets. I am exploring different full text search options: Lucene/Solr/Hibernate Search and one common concern is deployment/administration/tuning overhead for this.
This is particularly challenging in our case, since we do not host these applications. From what I have seen, most uses of these technologies have been in hosted applications. Our customers typically deploy our application in a clustered environment and do not have any experience with Lucene/Solr.
Does anyone have any experience with this? What challenges have you encountered with this approach? How did you overcome them? At this point I am trying to determine if this is feasible.
Thank you
It is very feasible to deploy applications onto clients sites that use Lucene (or Solr).
Some things to keep in mind:
Administration
you need a way to version your index,
so if/when you change the document
structure in the index, it can be
upgraded.
you therefore need a good
way to force a re-index of all
existing data. Probably also a
good idea to provide an Admin
option to allow an Admin to
trigger a re-index as well.
you
could also provide an Admin option to
allow optimize() be called on your
index, or have this scheduled.
Best to test the actual impact
this will have first, since it may
not be needed depending on the shape
of your index
Deployement
If you are deploying into a clustered environment, the simplest (and fastest in terms of dev speed and runtime speed) solution could be to create the index on each node.
Tuning
* Do you have a reasonable approximation of the dataset you will be indexing? You will need to ensure you understand how your index scales (in both speed and size), since what you consider a reasonable dataset size, may not be the same as your clients... Therefore, you at least need to be able to let clients know what factors will lead to overly large index size, and possibly slower performance.
There are two advantages to embedding lucene in your app over sending the queries to a separate Solr cluster, performance and ease of deployment/installation. Embedding lucene means to run lucene in the same JVM which means no additional server round trips. Commits should be batched in a separate thread. Embedding lucene also means including some more JAR files in your class path so no separate install for Solr.
If your app is cluster aware, then the embedded lucene option becomes highly problematic. An update to one node in the cluster needs to be searchable from any node in the cluster. Synchronizing the lucene index on all nodes yields no better performance than using Solr. With Solr 4, you may find the administration to be less of a barrier to entry for your customers. Check out the literature of the grossly misnamed Solr Cloud.

How does "distributed computing" apply to web development or programming in general?

I am about to use Apache Hadoop, the headlines read:
The Apache Hadoop project develops
open-source software for reliable,
scalable, distributed computing.
I can relate "scalability" to programming, but I just don't know how this "distributing" can help me in my development. According to wikipedia:
A distributed system consists of
multiple autonomous computers that
communicate through a computer
network. The computers interact with
each other in order to achieve a
common goal
So does this mean I can deploy my web apps across multiple computers and do some sort of "intense computing"? The terms that come into my mind are Content Delivery Networks and Cloud Computing.
Web development has always been about distributed computing, since clients have been on different machines to the servers they talk to, web pages can pull in resources from many servers to build a page's content, and servers may talk to other machines to achieve their goals. CDNs make this more obvious than before, but really they're just an evolution, an introduction of a virtualization/indirection layer between what you ask for and the hardware used to provide it.
Clouds are about taking the concepts of virtualization and applying them to remote hosting, both of low-level OSes and higher-level software platforms. The really interesting thing about them is that this enables different business models on the part of customers (and with different risks too, but that's mostly not related to the fact that it's distributed computing but rather that it is not wholly under your control in your own jurisdiction).
I've found that the most effective use of distributed computing is when you think in terms of connecting together distinct services, each of which with different capabilities (which might be for technical reasons, or might not; sometimes, it's for business or legal reasons that things have to be divided up) and where each of those services may be provided by many components in multiple locations. There are, and continue to remain, issues with balancing the need for performance (which is a force that brings components together) and the need for robustness (which tends to lead to distribution and replication) within the overall context of the general capabilities map.
My goodness! That paragraph sounds like terrible piffle! What I'm trying to say is that it's all trade-offs, and you should be prepared for not getting it right first time.
(Hadoop is a mechanism for doing a distributed file store, and for efficiently applying certain classes of operation – those that fit well with MapReduce or other similar scatter-gather algorithms – across that whole dataset. If that shoe fits, use it. But it doesn't solve all problems, and thank goodness for that! Things that can do everything tend to look very much like things that can't actually do anything at all, and usefulness and comprehensibility come in the restrictions.)
Hadoop is typically used to process massive data sets by distributing the processing of that data set across multiple machines.
What this means is you probably don't want to use it to "deploy an application". You might use it to process stats on your application, however. For instance, you might have very large logs of user data. This would happen if your user data grows to become too large to fit on a single hard drive, and/or would take too long for one machine to process stats on (using standard methods like an SQL query).
Ygam. While the traditional role of "clients" and "servers" have been pretty stable from 1960 till about 2005.
I believe with every fiber of my being, that distributed computing is that we all carry processors around in our pockets.
Phones do computing work. Phones do NOT need centralized servers, but they DO benefit from them.
Phones , Smartphones, tablets are an example of where distributed computation is going.
You can make a wifi base-station out of an Android device now. So now a phone becomes a server of sorts for just that instant in the coffee shop that you turn it on for that cute person next to you without internet ....and now I digress.......

Non-relational databases (NoSQL) for small to medium sized applications

The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?
IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.
The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.
When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery
The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.
Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.
If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.
Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.