Asp Net Boilerplate - Setup Schema-Per-Tenant Multitenancy (EntityFrameworkCore & PostgreSQL) - postgresql

We are looking into using Asp Net Boilerplate. Looks very promising. We love the framework, but we would like to be able to use a per-schema Multitenancy configuration. Instead of sharing the data in the same db & tables, each tenant would "have" a schema, in which the whole database structure would be replicated.
One of our data tables will be quite big (sometimes +1 million entries / tenant), and we were advised that for performance reasons, it's better to keep the number of entries as low as possible. Also, this particular table will be queried & inserted a lot. It would be unrealistic that this table would hold data for 40+ tenants. For that reason, and others, we would prefer to have a distinct schema per tenant.
Our DB is a single PostgreSQL server (might scale up to more in the future). We use EntityFramework & Npgsql. We already noticed that it is possible to set up a different ConnectionString for specific tenants that would have bigger data requirements.
http://www.summa.com/blog/2013/09/17/approaches-to-multi-tenancy See separate schema per tenant
Any idea on how to acheive a schema-per-tenant multitenancy? There's a lot of moving parts in this, I'm not sure where to start.

Related

Multiple database in EF6

We are involved in quite a new development in which we are remaking our current web shop platform.
In the current platform we do not use EF6 neither other ORM but store procedures to access to the db, but in the new building is what we do.
We have a doubt regarding database design of the new platform. In the current platform we use several different databases depending on the content of them.
For example, we have dedicated databases to store information for products catalogs other dedicated db for handling orders.
Currently all data access is done through stored procedures, so we have no problem with the links between different databases.
The problem appears to us now when we have started to use EF6. In this case each DB is associated with a context and it is not possible to know data from one context to another
unless we implement directly in the source code these relationships using various contexts. It looks like these means we will lose the power of EF6.
The questions we have are:
Is it a bad design maintaining different databases for the same application using EF6?
in case this is a poor design and choosing for a single database, is the performance going to be optimum even driving hundreds of tables (almost 1000) with several TBytes of information?
in the other hand, in the case of opting for the design in which several bbdd appear (it would be much better in our case), what is the best way to handle them EF6?
Thank you very much for your help!
First of all EF is not written to be cross database. You can't write cross database (cross context) queries, lazy load does not work and so on.
This is a big limitation in your case.
EF could work with several schema (actually I don't use it and I don't like it but is just my opinion).
You can use your stored procedures with EF but as I understand you are thinking to stop to use them.
In my experience I wrote several applications with more than one database but the use of the different databases was very limited. In this cases I use cross database views (i.e. one database per company and some common tables with views in company databases that selects data in common tables). In your case, if the tables are sharded everywhere I don't think this is a way you can choose.
So, in my opinion you could change the approach.
If you have backups problems you could shard the huge tables (I think facts tables and tables with pictures) and create cross database views. BTW, also, cross database referential integrity is not supported in SQL Server so you need to write triggers to check it.
If you need to split different application functions (i.e. WMS, CRM and so on) you can use namespaces without bothering about how tables are stored in the DB.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

Is Azure table or NoSQL in general not so good when updating data

I have only looked in Azure table, but it may well apply for other NoSQL databases as well.
If I have an entity consisting of these following properties
First name - Last name - Hometown - Country
In Azure table there is no concept of relations therefore if I have thousands of data, and I want to change all entities that has 'Canada' in it, to some other country. Then in this scenario there is a possibility it has to go through thousands of data to find entities with 'Canada' and change it to something else.
I wonder, is the benefit of NoSQL only if you have data that is static and not changed after you have written it? Or could this problem be solved for NoSQLs?
In the case of NoSQL data stores the advantages are different from SQL databases. Things like scalability or availability can be better in a NoSQL database like Azure table, but there are tradeoffs. For example you are generally unable to efficiently query any part of a record, only the keys.
In designing your schema for Azure Table you have to consider the use cases of your data layer and let that dictate the schema. In this example, if I thought I would have to update all records in a given country, I would make that part of the partition or row key. That way your query to get all data in a given country is fast and can be updated quickly.

MongoDB - Single Database or Multiple Databases for SaaS Offering

We have decided to use MongoDB for a SaaS offering we are creating. Each company that signs up gets their own url (mycompany.domain.com) and their own private set of users, projects, etc... Since we are using a NoSQL solution, and wouldn't have to manage pushing out schema updates to every database like we would with MySQL, I am wondering if it would be better to have one huge database containing all the data, or to have one database per client.
Since MongoDB can shard the database across multiple servers, I'm thinking there wouldn't be a huge performance hit if we had a giant database, but I also think backups and exporting data would be much easier if there was one database per client. Any thoughts?
Go with one but make sure to take advantage of some sort of replication for backup purposes!
Look into sharding or look into replica-sets.