MongoDB data modelling: any drawbacks in using lots of databases? - mongodb

I have recently moved to MongoDB part of the back-end of a web app, the web app itself is a validation tool, and the workflow looks like:
the user uploads a file (typically hundreds of thousands of lines)
the validator checks it outputting a lot of messages (possibly more than one per line)
...and finally provide a few statistics
I modelled my application so that each user has it's own DB containing:
The file (saved through GridFS)
A collection containing the messages (possibly over a million lines, in some cases)
A collection with the statistics
We have a few hundreds of users, so MongoDB will end up having a few hundreds DBs.
Of course I could have hold all the data in the same DB, using namespaces to separate data from different users. However I felt it was handy to send the DB in the connection URI, and I found more intuitive to issue a "drop database" statement to purge a user, rather than searching and removing its data in the large DB.
I am pretty new to MongoDB, so my question is: is there any drawback in having several DBs in the same MongoDB instance? Or is there any special consideration that I should give to the problem?

I'm not familiar with MongoDB specifically. In general, openning a connection to a database is a relatively slow operation and ties up system resources. Whether this is enough to matter in your case I can't say.
Having a different db for each user would make it difficult to perform queries that access data for multiple users. Maybe you have no need to do this.
Still, I would think it would be a whole lot simpler in general to just put a user id in each record rather than create a separate database. What's the gain of separate databases? Okay, deleting a user means saying "drop database". But deleting a user from a single database should mean saying "delete from tableX where user=?; delete from tableY where user=?" etc for however many relevant tables you have. I can't imagine it's hundreds, right? Maybe half a dozen lines of code or so?

Related

Asp Net Boilerplate - Setup Schema-Per-Tenant Multitenancy (EntityFrameworkCore & PostgreSQL)

We are looking into using Asp Net Boilerplate. Looks very promising. We love the framework, but we would like to be able to use a per-schema Multitenancy configuration. Instead of sharing the data in the same db & tables, each tenant would "have" a schema, in which the whole database structure would be replicated.
One of our data tables will be quite big (sometimes +1 million entries / tenant), and we were advised that for performance reasons, it's better to keep the number of entries as low as possible. Also, this particular table will be queried & inserted a lot. It would be unrealistic that this table would hold data for 40+ tenants. For that reason, and others, we would prefer to have a distinct schema per tenant.
Our DB is a single PostgreSQL server (might scale up to more in the future). We use EntityFramework & Npgsql. We already noticed that it is possible to set up a different ConnectionString for specific tenants that would have bigger data requirements.
http://www.summa.com/blog/2013/09/17/approaches-to-multi-tenancy See separate schema per tenant
Any idea on how to acheive a schema-per-tenant multitenancy? There's a lot of moving parts in this, I'm not sure where to start.

How do I share reference PostgreSQL tables between databases?

The system I'm designing has a set of reference tables that rarely have to be updated. New databases will be constantly started to process files that will have to query that information.
What's the best arrangement for coordinating communication between that set of information and the work database? I certainly don't want to duplicate that set of reference information in every new work database. The work databases will likely be deleted once their work is completed.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

Existing Postgres Database vs Solr

We have an app that uses postgres database, that has about 50 tables. Each table contains about 3 Million records (on average). The tables get updated with new data every now and than. Now, we want to implement search feature in our app. The search needs to be performed on one table at a time (no joins needed).
I've read about postgres full text support and that looks promising. But it seems that Solr is Super fast in comparison to it. Can I use my existing postgres database with Solr? If tables get updated would I need to re-index everything again?
It is definitely worth giving Solr a try. We moved many MySQL queries involving JOINs on multiple tables with sorting on different fields to Solr. We are very happy with Solr's search speed, sort speed, faceting capabilities and highly configurable text analysis/tokenization options.
If tables get updated would I need to re-index everything again?
No, you can run delta imports to only re-index your new and updated documents. See https://wiki.apache.org/solr/DataImportHandler.
Get started with https://lucene.apache.org/solr/4_1_0/tutorial.html and all the links in there.
Since nobody has leapt in, I'll answer.
I'm afraid it all depends. It depends on (at least)
how big the text is in each "document"
how flexible you want your searching to be
how much integration you need between database and text-search
how fast is fast enough
how much experience you have with both
When I've had a database that needs some text searching, I've just used PG's built-in options. If I didn't have superuser access to the db, or was already running a big Java setup then Solr might well have appealed.

MongoDB - Single Database or Multiple Databases for SaaS Offering

We have decided to use MongoDB for a SaaS offering we are creating. Each company that signs up gets their own url (mycompany.domain.com) and their own private set of users, projects, etc... Since we are using a NoSQL solution, and wouldn't have to manage pushing out schema updates to every database like we would with MySQL, I am wondering if it would be better to have one huge database containing all the data, or to have one database per client.
Since MongoDB can shard the database across multiple servers, I'm thinking there wouldn't be a huge performance hit if we had a giant database, but I also think backups and exporting data would be much easier if there was one database per client. Any thoughts?
Go with one but make sure to take advantage of some sort of replication for backup purposes!
Look into sharding or look into replica-sets.