Implementing multi tenant data structure using multiple schemas or by customerId table column - postgresql

I am developing multi tenant store web application (software as a service) which will be used by many customers. I would like to use just one database. I would appreciate suggestions/feedback on how to go about this in the database:
Separate schemas for each customer. Whenever new customer signs up, I create separate schema.
Single schema with all the customers. And creating a CUSTOMER table with customerId that is referenced in all other tables (eg. orders, payments, etc). Whenever new customer signs up, I create an entry in CUSTOMER table.
Incase if you want to know what technologies are being used:
Postgres, Spring Boot MVC, REST, Maven, JPA.
Thanks.

There are major tradeoffs here. With customer id's your foreign keys become more complex (the customer id should probably be a part of every foreign key) and that means additional indexes. It also means you have to have some means of enforcing this restriction. The big issue is that bugs in your application can quite easily disclose material from other customers.
With multiple schemas you have an issue that you have many more tables and this can cause performance problems for pg_dump in particular. However with appropriate search paths it is a bit harder to compromise other clients' data. However this is harder to use with a connection pool.
In general I think the schema approach is better because you can always scale out by partitioning by customer set, and the better security is important. However it means you must have a good understanding of search_path and set it to a sensible value on every database transaction.

Related

How to have a sequence status in NoSQL without transaction

This question is only for understanding purpose. This might be a noob question.
Assume that I have a tabular or document NoSQL database which do not support transactions. And I have an orders table/collection which has a status column. Initially the status will be DRAFTED. My objective is to create an invoice, store it in another table invoices and update my orders table status to INVOICE_CREATED. Assume that I cannot store the invoice in my orders table itself(denormalize) due to some reason (large size could be a reason).
How do I handle this logic without a transaction in a NoSQL? Do I need to model my tables in some other way?
Let's assume that:
Invoice creation does not have any third party dependency. It is done inside the system itself.
I cannot use an SQL
I should not use transaction support provided by DBs like Mongo
NOTE: These might be a lot of assumptions and might not be real-world use cases. But I am just trying to understand HOW we should model NoSQL databases.

Relational DB in microservices

I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?
It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.
In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.

Postgres inherit from schema

I am initiating a new project which will be available as a SaaS for multiple customers. So, I am thinking of creating a database and then create individual schema for every customer.
I have defined some rules and the first rule is all the customers must always have the same schema. No matter what. If one customer gets an update, all the other customers will get the update as well.
For this purpose, my question is, is it possible to inherit schema from another schema in the same database? If not, do I have to manually create all the tables and indexes in the new schema and inherit them from the tables in master schema?
I am using Postgresql 9.6 but I can upgrade it as well if needed.
I open to suggestions.
Thanks in advance
There is no automated way to establish inheritance between all tables in two schemas, you'd have to do it one by one (a function can help).
However, I invite you to stop and think about your data model for a bit. How many users do you expect? If there could be many, plan differently, because databases with thousands of schemas become unwieldy (e.g. catalog lookups will become slow).
You might be better off with one schema for all users. If you are concerned with separation of the data and security, row level security might be the solution for you.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

Foreign Key mapping in Core Data

I understand that Core Data is not a relational database but I need to understand how it can be used to support a client/server model where the server uses a Rails, ActiveRecord, Mysql setup.
My app is pulling records from the server using JSON and I am mapping the relationships using Core Data.
The Foreign Key in the SQLLite database is showing the PK field of the related table even though I have set the User Info Key/Value of primaryAttributeKey => id. (I can't remember where I saw this mentioned.)
Is there any way to setup the models so they will use my id as the PK so that it will clean up the export of related data back to the server?
Edward,
The PK is just a field in your object. If you want to maintain them in CD, they are just numbers. As you build your object graph, you have to maintain them in parallel with your relations. Of course, exporting records created on the device back to your server will have difficulty -- FKs and PKs are unique to each table and that uniqueness is determined on the server. Hence, tracking these numbers is not that useful.
May I suggest that your JSON needs to be structured such that it is redundant -- that it has both the data and the various PKs and FKs, if any?
Finally, you appear to be making a CRUD focused API. Generally, those are low performance APIs for remote devices. There are other problems with CRUD APIs, such as inconsistent business logic between servers and clients. I would suggest you to rethink your APIs.
Andrew