mongodb connection handling for multi-tenant architecture - mongodb

We are currently designing a SaaS application that has a subscriber/user based mode of operation. For example, a single subscriber can have 5, 10 or up to 25 users in their account dependent on what type of package they are on.
At the moment we are going with a single database per tenant approach. This has several advantages for us from the standpoint of the application.
I have read about the connection limits associated with Mongo and I am a little confused and worried. I was hoping someone could clarify it for me in simple terms, as I haven't worked much with Mongo.
From what I understand, there is a hard limit of 20,000 connections available on the mongod process and the mongos processes.
How does that translate to this multi-tenant approach? I am trying to basically asses how I would deploy the application in general in terms of replica sets and if sharding is necessary such that I don't hit these limits. How does one handle such a scenario for example if you have 10,000 tenants with multiple users that would exceed the limit.
Our application generally wouldn't need sharding as the each tenants collection wouldn't reach the point where it would need to be sharded. From what I understand though, MongoDB will create databases in a round robin fashion on each shard and will distribute the load which may help with the connection issues.
This is me just trying to make sense of what I've read and I'm hoping someone can clear this up for me.
Thanks in advance
EDIT
If I just add replica sets, will this alleviate this problem? Even though only the primary can accept writes from what I understand?

You just have to store a database connection in a pool and reuse it if you access the same database again. This will limit the number of connections to a reasonable number unless you aren't using 10,000s of databases which wouldn't be a good idea anyway.

Related

Best way to host many orient databases

For our architecture we are contemplating something like multi-tenancy. In our approach each tenant would get their own database. When I say database, I don't mean server. I mean a database within an OrientDB server.
The question is... Is there a best practice way to do this. The three options we see are:
Stand up an entire OrientDB server to host a single database.
This seems inefficient. Especially since we are going to look towards a clustered / replicated architecture.
Put multiple databases into a single OrientDB Server
Here I am curious as to scalability. Is there a practical limit to how many databases a single OrientDB cluster can hold? Each tenant may make many connections to the database. If say each tenant makes 20 or so database connections and we have 1,000 tenants, I now have 20,000 connections going to the database. Obviously we would have many servers supporting this load so that would be distributed.
Some middle ground where we have a certain number of tenants hosted in each clustered instance of OrientDB
Not sure how to draw the line here.
Just wondering if there are best practices around this? Thanks and keep up the good work.
The physical limits are given by memory size, number of transactions managed per second and number of open files on the OS.
Every db in OrientDB is just a folder on the filesystem, if you never access the db it does not use system resources, but as soon as you access and query it, OrientDB has to keep files open, establish connections to the clients, allocate disk cache and so on.
My suggestion is to have at most a few tens of small databases on the same OrientDB instance.

no single point of failure with traditional RDBMS

I am working in a trading applications that depends on an Oracle DB.
The DB is crashed two times and the business owner wants some solution in which the application still works even the DB is crashed.
My team leader introduced Cassandra NOSQL as a solution as it has no single point of failure but this option will make us move from the traditional relational model into the NOSQL model which I consider as a drawback.
My question here, Is there a way to avoid a single point of DB failure with traditional relational DBMS like Mysql, postgreSQL,......etc ?
Sounds like you just need a cluster of Oracle database instances, rather than just a single instance, such as Oracle RAC.
If your solution for the Oracle server being offline is to use Cassandra, what happens if the Cassandra cluster goes down? And are you really in the situation where it makes sense to rewrite and re-architect your entire application to use a different type of data store, just to avoid downtime from Oracle? I would suspect this only makes sense for applications with huge usage and load numbers, where any downtime is going to cost serious money (and not just cause embarrassment to the business folks to their bosses).
Is there a way to avoid a single point of DB failure with traditional relational DBMS
No, that's not possible. Simply because when one node dies. It is gone.
Any fault-tolerant system will use several nodes that replicate each other. You can still use traditional RDBMS, but you will need to configure mirroring in order for the system to tolerate a node failure.
NoSQL isn't the only possible solution. You can set up replication with MySQL:
http://dev.mysql.com/doc/refman/5.0/en/replication-solutions.html
and
http://mysql-mmm.org/
and concerming failover discussions:
http://serverfault.com/questions/274094/automated-failover-strategy-for-master-slave-mysql-replication-why-would-this

Should I split NoSQL database on separate servers?

We are still in the early stages of determining whether we go with RDBMS or NoSQL.
One of the areas of interest is if we go with NoSQL (probably CouchDB although could be MongoDB), would separate NoSQL database on different servers be better than one instance of a NoSQL server?
We will be building a file management system where certain files/videos will be grouped on different servers. Accounts related files/videos will be stored on the accounts server etc. To query for an accounts related file we will most likely search the database on the accounts server.
I can see in the future that someone will say "why cant I search over all servers for a type of file or video"?
Clearly having one database would be better here. However apart from the latency in http requests to query the servers, are there better ways to do this or pros and cons of having a large database?
JD
The idea of (most) NoSQL products is that they provide horizontal scalability. This means that you single logical instance could be on dozens of servers. In MongoDB, for example, you can use auto-sharding. For your program, this is completely transparent: Your code is (almost) the same you'd use for a single database server, but the data resides on, say, 5 servers.
There are many upsides: you have a central spot to administer your database, you can query across all databases if necessary, you don't need to mess around w/ multiple db connections in your code, the db automatically balances those collections that need to be balanced, map/reduce operations can run in parallel if the query allows it, etc.
The most important ones for me: there's not much administration overhead, and you don't have to put too much thought into this now, because you can add auto-sharding later.
I would not try to do sharding myself, because it is re-inventing the wheel, and it's not easy either. This was one of the key drivers for NoSQL in the first place.

MongoDb vs CouchDb: write speeds for geographically remote clients

I would like all of my users to be able to read and write to the datastore very quickly. It seems like MongoDb has blazing reads, but the writes seem like they could be very very slow if the one master db needs to be located very far away from the client.
Couchdb seems that it has slow reads, but how about the writes in the case when the client is very far away from the master.
With couchdb, we can have multiple masters, meaning we can always have a write node close to the client. Could couchdb actually be faster for writes than mongodb in the case when our user base is spread very far out geographically?
I would love to use mongoDb due to its blazing fast speed, but some of my users very far away from the only master will have a horrible experience.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDb, if you're listening, why don't you do some simple multi-master setups, where conflict resolution can be part of the update semantic?
This seems to be the only thing standing in between mongoDb completely dominating the nosql marketshare. Everything else is very impressive.
Disclosure: I am a MongoDB fan and user, i have zero experience with CouchDB.
I have a heavy duty app that is very read write intensive. I'd say reads outnumber writes by a factor of around 30:1. The way mongo is designed reads are always going to be much faster than writes the trick (in my experience) is to make your writes so efficient that you can dedicate a higher percentage of your system resources to the writes.
When building a product on top of mongo the key thing to remember is the _id field. This field is automatically generated and added to all of your JSON objects it will look something like 47cc67093475061e3d95369d when you design your queries (Find's) try and query on this field wherever possible as it contains the machine location (and i think also disk location??? - i should check this) where the object lives so when you use a find or update using this field will really speed up your machine. Consider this in the design of your system.
Example:
2 of the clusters in my database are "users" and "posts". A user can create multiple posts. These two collections have to reference each other alot in the implementation of my app.
In each post object i store the _id of the parent user.
In each user object i store an array of all the posts the user has authored.
Now on each user page I can generate a list of all the authored posts without a resource stressful query but with a direct look up of the _id. The bigger the mongo cluster the bigger the difference this is going to make.
If you're at all familiar with oracle's physical location rowids you may understand this concept only in mongo it is much more awesome and powerful.
I was scared last year when we decided to finally ditch MySQL for mongo but I can tell you the following about my experience:
- Data porting is always horrible but it went as well as I could have imagined.
- Mongo is probably the best documented NoSQL DB out there and the Open Source community is fantastic.
- When they say fast and scalable there not kidding, it flies.
- Schema design is very easy and much more natural and orderly than key/value type db's in my opinion.
- The whole system seems designed for minimal user complexity, adding nodes etc is a breeze.
Ok, seriously I swear mongo didn't pay me to write this (I wish) but apologies for the love fest.
Whatever your choice, best of luck.
Here is a detailed article that 10gen has created, and gives examples of when you should choose MongoDB or CouchDB, with reasons as well.
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Edit
The above link was removed, but can be viewed here: http://web.archive.org/web/20120614072025/http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
Your question as of now, is full with speculation and guessing.
...why can't we opt out of consistency for certain writes, so long as we're sure that the person that wrote the data will be able to read it consistently, whereas others will observe eventual consistency
What if those writes effect other writes? What if those writes would prevent other people from doing stuff. It's hard to tell the possible side effect if since you didn't tell us any specifics.
My main suggestion to you is that you do some testing. Unless you've tested it, speculation about bottle necks is a complete waste of time. You don't need to test it via remote machines, set up some local DBs and add some artificial lag, then run your tests.
This way you can test the different options you've got, see where MongoDB is better, or where CouchDB excels at. Then you can either take one of them and go with the contras, or you can try and tweak your Database Model it self and do more tests.
Nobody here will be able to give you a general solution to your specific problem (well unless you give us all your code and you pay us for working on it :P ) databases aren't easy especially if you need to scale them under certain requirements.
For worldwide types of systems, wouldn't couchDb be better. Isn't mongodb completely ruled out in the case where you have users all around the world?
MongoDB supports sharding. So you don't need a single master. In fact, it looks like you have a ready shard key (region).
MongoDB also supports replica sets along with sharding. So if you need to run in multiple data centers (DCs) you put a master and one of the replicas in the same DC. In fact, they also suggest adding a 3rd node to a separate DC as a hot backup failover.
You will have to drill into the more detailed configuration of MongoDB, but you can definitely control where data is stored and you can prioritize that other replicas in a DC are "next in line" for promotion to Master.
At this point however, you're well into the details of MongoDB and you'll need to dig around and "play" quite a bit. However, you'll need lots of "play time" for any solution that's really going to handle masters across data centers.

What is the recommended approach towards multi-tenant databases in MongoDB?

I'm thinking of creating a multi-tenant app using MongoDB. I don't have any guesses in terms of how many tenants I'd have yet, but I would like to be able to scale into the thousands.
I can think of three strategies:
All tenants in the same collection, using tenant-specific fields for security
1 Collection per tenant in a single shared DB
1 Database per tenant
The voice in my head is suggesting that I go with option 2.
Thoughts and implications, anyone?
I have the same problem to solve and also considering variants.
As I have years of experience creating SaaS multi-tenant applicatios I also was going to select the second option based on my previous experience with the relational databases.
While making my research I found this article on mongodb support site (way back added since it's gone):
https://web.archive.org/web/20140812091703/http://support.mongohq.com/use-cases/multi-tenant.html
The guys stated to avoid 2nd options at any cost, which as I understand is not particularly specific to mongodb. My impression is that this is applicable for most of the NoSQL dbs I researched (CoachDB, Cassandra, CouchBase Server, etc.) due to the specifics of the database design.
Collections (or buckets or however they call it in different DBs) are not the same thing as security schemas in RDBMS despite they behave as container for documents they are useless for applying good tenant separation. I couldn't find NoSQL database that can apply security restrictions based on collections.
Of course you can use mongodb role based security to restrict the access on database/server level. (http://docs.mongodb.org/manual/core/authorization/)
I would recommend 1st option when:
You have enough time and resources to deal with the complexity of the
design, implementation and testing of this scenario.
If you are not going to have much differences in structure and
functionality in the database for different tenants.
Your application design will allow tenants to make only minimal
customizations at runtime.
If you want to optimize space and minimize usage of hardware
resources.
If you are going to have thousands of tenants.
If you want to scale out fast and at good cost.
If you are NOT going to backup data based on tenants (keep separate
backups for each tenant). It is possible to do that even in this
scenario but the effort will be huge.
I would go for variant 3 if:
You are going to have small list of tenants (several hundred).
The specifics of the business requires you to be able to support big differences in the database structure for different tenants (e.g. integration with 3rd-party systems, import-export of data).
Your application design will allow customers (tenants) to make significant changes in the application runtime (adding modules, customizing the fields etc.).
If you have enough resources to scale out with new hardware nodes quickly.
If you are required to keep versions/backups of data per tenant. Also the restore will be easy.
There are legal/regulatory restrictions that forces you to keep different tenants in different databases (even data centers).
If you want to fully utilize the out-of-the-box security features of mongodb such as roles.
There are big differences in matter of size between tenants (you have many small tenants and few very large tenants).
If you post additional details about your application, perhaps I can give you more detailed advice.
I found a good answer in the comments in this link:
http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/
Basically option #2 seems to be the best way to go.
Quote from David Mytton's comment:
We decided not to have a database per
customer because of the way MongoDB
allocates its data files. Each
database uses it’s own set of files:
The first file for a database is
dbname.0, then dbname.1, etc. dbname.0
will be 64MB, dbname.1 128MB, etc., up
to 2GB. Once the files reach 2GB in
size, each successive file is also
2GB.
Thus if the last datafile present is
say, 1GB, that file might be 90% empty
if it was recently reached.
from the manual.
As users sign up to the trial and give
things a go, we’d get more and more
databases that were at least 2GB in
size, even if the whole of the data
file wasn’t use. We found this used a
massive amount of disk space compared
to having several databases for all
customers where the disk space can be
used to maximum efficiency.
Sharding will be on a per collection
basis as standard which presents a
problem where the collection never
reaches the minimum size to start
sharding, as is the case for quite a
few of ours (e.g. collections just
storing user login details). However,
we have requested that this will also
be able to be done on a per database
level. See
http://jira.mongodb.org/browse/SHARDING-41
There are no performance tradeoffs
using lots of collections. See
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections
There is a reasonable article on MSDN about multi-tenant data architecture which you might wish to refer to. Some key topics touched on by this article:
Economic considerations
Security
Tenant considerations
Regulatory (legal)
Skill set concerns
Also touched upon are some patterns for Software as a Service (SaaS) configuration.
Additionally, worth a gander is an interesting write-up from the SQL Anywhere guys.
My own personal take - unless you are certain of enforced security / trust, I would go with option 3, or if scalability concerns prohibit fallback to option 2 at a minimum. That said... I'm no pro with MongoDB. I get pretty nervous using a shared "schema" - but I will happily defer to more experienced practitioners.
I would go for option 2.
However you could set mongod.exe command line option --smallfiles. This means that the biggest file size of an extent will be 0.5 gigabyte and not 2 gigabyte. I tested this with mongo 1.42 . So option 3 is not impossible.
According to my research in MongoDB. Trucos y consejos. Aplicaciones multitenant.
that option is not recommended if you do not know how many tenants you can have, it could be thousands and it would be complicated when it comes to sharding, also imagine having thousands of collections in a single database ... So in your case it is recommended to use option one. Now if you are going to have a limited number of users, it is already different and yes, you could use option two as you thought.
While the discussion here is on NoSQL and primarily MongoDB, we at Citus are using PostgreSQL and building a distributed/sharded multi-tenant database.
Our use-case guide walks through an example app, covering the schema and various multi-tenant specific features.
For more unstructured data, we use PostgreSQL's JSONB column to store such and tenant-specific data.