Feedback about my database design (multi tenancy) - postgresql

The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?

IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH

Related

DynamoDB model that supports queries on any given attribute

The application we're designing has a function where users can dynamically add new elements to an entity that then need to be efficiently searched. The number of these elements is essentially unlimited. Our team has been looking at DynamoDB as a data store option, and we've been wrestling with the key/value model and how to get this dynamic data under an index for efficient querying.
I think I have a single-table solution that handles the problem elegantly and also allows for querying on any given attribute in the data store, but am disturbed that I can't find an example of it anywhere else. Hopefully it's not fundamentally flawed in some way - I would appreciate any critique!
The model is essentially the Entity-Attribute-Value approach used for adding dynamic or sparse data to RDBMs. So instead of storing different entities/objects in a DynamoDB table like so:
PK SK SK-1 SK-2 SK-3 SK-N... PK SK SK-1 SK-N...
Key Key Key Key --> Name Money
Entity Id Value Value Value Value Person 22 Fred 30000
... which lets me query things like "all persons where name = Fred" but where you would eventually run out of sort key indexes and you would need to know which index goes with which key before you query, the data could be stored in EAV format like so:
PK SK & GSI-PK GSI-SK PK SK & GSI-PK GSI-SK
Id Entity#Key Value 22 Person#Name Fred
Id Entity#Key Value --> 22 Person#Money 30000
Id Entity#Key Value 22 Person#Sex M
Id Entity#Key Value 22 Person#DOB 09/00
Now, with one global secondary index (GSI-1 PK over Entity.Key and GSI-1 SK over Value) I can do a range search on any value for any key and get a list of Ids that match. Users can add their attributes or even entirely new entities and have them persisted in a way that's instantly indexed without us having to revamp the DynamoDB schema.
The one major downside to this approach that I can think of is that data returned from a query on an Entity#Key-Value only contains values for that key and the entity Id, not the entire entity. That's fine for charts and graphs but a problem if you want to get a grid-type result with one query. I also worry about hot partition keys on the index, but hopefully we could solve that with intelligent write sharding.
That's pretty much it. With a few tweaks the model can be extended to support the logging of all changes on each key and allow some nice time series queries against those changes, but my question is if anyone has found it useful to take an EAV type approach to a KV store like DynamoDB, or if there's another way to handle querying a dynamic schema?
You can have pk as the id of the entity. And then a sort key of {attributeName}. You may still want to have the base entity with fields like createdAt, etc.
So you might have:
PK SORT Attributes:
#Entity#22 #Entity#Details createdAt=2020
#Entity#22 #Attribute#name key=name value=Fred
#Entity#22 #Attribute#money key=money value=30000
To get all the attributes of an entity you simply do a query with no filter of pk={id}. You cannot dynamically sort by every given attribute, this is exactly what DynamoDB is not good at, I repeat! That case is exactly what NOSQL performs poorly at.
What you can do is use streaming to do aggregation. So you can for instance store the top 10 wealthiest people:
PK SORT Attributes:
#Money#Highest #1 id=#Entity#22 value=30000
#Money#Highest #2 id=#Entity#52 value=30000
Which you would calculate in a DynamoDB Streams. But you couldn't dynamically index values, DynamoDB works by effectively copying data from one form to another so that it can be efficiently retrieved. So you would be copying your entire database for each new attribute you wanted to search by, or otherwise you would have to use Scans and that wouldn't make any sense to do because you would get no benefit to using DynamoDB if all you ever did was do Scans all the time.
Your processes need to be very well understood to make good use of DynamoDb, if you want to index data at will, and do all sorts of different queries, you probably want an SQL database or elasticsearch.

ssas tabular model use roles so customer can only see suppliers used

I have a ssas tabular model that has dimCustomrer, dimSupplier and a fact table.
I created a role for customers so they can only see themselves and this works perfectly fine.
Is it possible for the role to filter the Dimsupplier table so they can only see the suppliers they have used?
Searching google I can't find anything that seems to help.
Do you want to filter dimSupplier based on the fact table? Yes, that is possible by enabling bi-directional filtering on the relationship between dimSupplier and the fact table, and also setting the security filtering behaviour to "both directions":
However, bi-directional filtering can have some performance impact if your model contains large amounts of data.
Alternatively, you could import an aggregated version of the fact table, that only contains all the distinct combinations of customers and suppliers (2 columns). Call this table BridgeCustomerSupplier, or something like that. Then, you add a row-level filter to your role on the dimSupplier table, and use LOOKUPVALUE, to see if there's a matching record in BridgeCustomerSupplier, for the current supplier (in dimSupplier - the row context), and the current customer.

Is there a way to include a column from one table in many other tables (while maintaining consistency) in PostgreSQL?

I'm trying to build a database (in PostgreSQL 9.6.6) that allows for one "master column" (items.id) to be replicated in to many (automatically generated) tables (e.g. rank1.id, rank2.id, rank3.id, ...). Only items will have INSERT's (or DELETE's) performed and when they are the newly added id's should also show up (or be removed) in the rankX table(s). To be more concrete:
items:
id | name | description
rank1:
id | rank
rank2:
id | rank
...
Where the id's are always the same, and there is always the same number of rows in each of the tables. The rankX.rank values, however, will be different (imagine users ranking how funny a series of images are -- the images all have the same id's but different users might rank them differently).
What I was thinking was that when a new user was added and a new rankX table created I would do the following:
Have rankX.id referencing a foreign key items.id (with ON DELETE CASCADE)
Copy any items.id that already exist
Auto-generate a trigger function that mirrors the INSERT's to items to the rankX table
This seems cumbersome and wasteful of space since all of the xxxx.id columns are identical and I will end up with hundreds or thousands of trigger functions. As someone new to relational databases I was hoping there was an easier way to achieve this.
So, I have a few questions:
Is there a more efficient way to define my tables such that all of this copying isn't necessary?
If this the best way, can you give an example of how you would set up the triggers (and associated functions)?
Do I need to worry about running out of space on the server as I create (potentially many) sets of triggers of this type?

How does PostgreSQL deal with performance when having millions of entries

It may be a silly basic question but as described in the title, I am wondering how PostgreSQL deals with performance when having millions of entries (with the possibility of reaching a billion entries).
To put it in a more concrete way, I want to store data (audio, photos and videos) in my database (I'm only storing their path, files are organised in the file system), but I have to decide wether I use a single table "data" to store all the different types of data, or multiple tables ("data_audio", "data_photos", "data_videos") to separate those types.
The reason why I am asking this question is that I have something like 95% of photos and 5% of audio and videos, and if I want to query my database for an audio entry, I don't want it to be slowed by all the photos entries (searching for a line among a thousand must be different than searching among a million). So I would like to know how PostgreSQL deals with this and if there exists some way to have the best optimisation.
I have read this topic that is really interesting and seems relevant:
How does database indexing work?
Is it the way I should do?
Recap of the core stored informations I will have in my core tables:
1st option:
DATA TABLE (containing audio, photos and videos):
id of type bigserial
_timestamp of type timestamp
path_file of type text
USERS TABLE:
id of type serial
forename of type varchar(255)
surname of type varchar(255)
birthdate of type date
email_address of type varchar(255)
DATA USERS RELATION TABLE:
id_data of type bigserial
id_user of type serial
ACTIVITIES TABLE:
id of type serial
name of type varchar(255)
description of type text
DATA ACTIVITIES RELATION TABLE:
id_data of type bigserial
id_activity of type serial
(SEARCH queries are mainly done on DATA._timestamp and ACTIVITIES.name fields after filtering data by USERS.id)
2nd option (only switching the previous DATA TABLE with the following three tables and keeping all the other tables):
DATA_AUDIO TABLE
DATA_PHOTOS TABLE
DATA_VIDEOS TABLE
Additional question:
Is it a good idea to have a database per user ? (in the storyline, to be able to query the database for data depends on whether you have the permission or not, and if you want to retrieve data from two different users, you have to ask the permission from both users, and the permission process is a process in its own right, it is not handled here, so let’s say that when you query the database, it will always be queries on the same user)
I hope I have been clear, thanks in advance for any help or advices!
Cyrille
Answers:
PostgreSQL is cool with millions and billions of rows.
If the different types of data all have the same attributes and are the same from the database perspective (have the same relationships to other tables etc.), then keep them in one table. If not, use different tables.
The speed of index access to a table does not depend on the size of the table.
If the data of different users have connections, like they use common base tables or you want to be able to join tables for different users, it is best to keep them in different schemas in one database. If it is important that they be separated no matter what, keep them in different databases.
It is also an option to keep data for different users in one table, if you use Row Level Security or let your application take care of it.
This decision depends strongly on your use case and architecture.
Warning: don't create clusters with thousands of databases and databases with thousands of schemas. That causes performance problems in the catalogs.

How to handle many to many in DynamoDB

I am new to NoSql and DynamoDb, but from RDBMS..
My tables are being moved from MySql to DynamoDb. I have tables:
customer (columns: cid [PK], name, contact)
Hardware (columns: hid[PK], name, type )
Rent (columns: rid[PK], cid, hid, time) . => this is the association of customer and Hardware item.
one customer can have many Hardware Items and one Hardware Item can be shared among many customers.
Requirements: seperate lists of customers and hadware items should be able to retrieve.
Rent details- which customer barrowed which Hardeware Item.
I referred this - secondary index table. This is about keeping all columns in one table.
I thought to have 2 DynamoDb tables:
Customer - This has all attributes similar to columns AND set of hardware Item hash keys. (Then my issue is, when customer table is queried to retrieve only customers, all hardware keys are also loaded.)
Any guidance please for table structure? How to save, and load, and even updates ?
Any java samples please? (couldn't find any useful resource which similar to my scenario)
Have a look on DynamoDB's Adjacency List Design Pattern
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
In your case, based on Adjacency List Design Pattern, your schema can be designed as following
The prefix of partition key and sort key indicate the type of record.
If the record type is customer, both partition key and sort key should have the prefix 'customer-'.
If the record is that the customer rents the hardware, the partition key's prefix should be 'customer-' and the sort key's prefix should be 'hardware-'
base table
+------------+------------+-------------+
|PK |SK |Attributes |
|------------|------------|-------------|
|customer-cid|customer-cid|name, contact|
|hardware-hid|hardware-hid|name, type |
|customer-cid|hardware-hid|time |
+------------+------------+-------------+
Global Secondary Index Table
+------------+------------+----------+
|GSI-1-PK |GSI-1-SK |Attributes|
|------------|------------|----------|
|hardware-hid|customer-cid|time |
+------------+------------+----------+
customer and hardware should be stored in the same table. customer can refer to hardware by using
SELECT * FROM base_table WHERE PK=customer-123 AND SK.startsWith('hardware-')
if you hardware want to refer back to customer, you should use GSI table
SELECT * FROM GSI_table WHERE PK=hardware-333 AND SK.startsWith('customer-')
notice: the SQL I wrote is just pseudo code, to provide you an idea only.
Take a look at this answer, as it covers many of the basics which are relevant to you.
DynamoDB does not support foreign keys as such. Each table is independent and there are no special tools for keeping two tables synchronised.
You would probably have an attribute in your customers table called hardwares. The attribute would be a list of hardware ids the customer has. If you wanted to see all hardware items belonging to a customer you would:
Perform GetItem on the customer id. Or use Query depending on how you are looking the customer up.
For each hardware id in the customer's hardware attribute, perform a GetItem on the Hardware table.
With DynamoDB you generally end up doing more in the client application relative to an RDBMS solution. The benefits are that its fast and simple. But you will find you probably move a lot of your work from the database server to your client server.