Guideline when designing a database in Postgresql - postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.

Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

Related

Data modeling in columnar database vs multi-dimensional for reporting

In my way of learning Redshift (my first columnar database), I am struggling to figure out the approach for designing the model. Columnar database does promote flat table design, yet admits that star schema or snowflake could be a better choice for some cases.
Here is a simple example of where I am struggling
As you can see multi-dimensional approach have few dimensions and 1 fact table. I could have made it snowflake design but I kept it simple for star schema.
Approach 1: Used common columns from tables (in this scenario demographics). This could reduce the table size for Customer & Store but will include the extra dimension.
Approach 2: Flat table design with all the columns
My Questions:
Which approach data modeler use to design data model in columnar databases like Redshift? Or they use different approach?
Considering this example, what is the best way to design a data model for data warehousing.
Which approach is good for reporting (considering that client PC\Laptop would have limited memory). Or even cloud reporting may become costly when heavy data set is used.
Approach 3 will produce a massive amount of data set for reporting. This could be a costly affair if doing reporting (using Power BI or Tableau or any other Self reporting tool)
Multidimenion approach is best for self reporting (cost & performance) but then it defeats the purpose of columnar database.
Approach 1 is also good for reporting but with more joins & complexity.
Sorry, late to the party.
I will post is as answer, because it is too long for a comment.
I saw in chat that test results show that star schema is better. But it was tested on regular (MSSQL), not columnar database (just as vertica, redshift, snowflake, bigquery..).
There is some experience from project implementation where I tested both approaches - OBT and star schema while implementing dwh for reporting. Ths was already more than 2 years ago, so don't expect much details.
Database: Redshift 2 nodes of dc2.8xlarge. Might be a bit overkill, but other option was to have a bunch of lower level nodes, which wouldn't be more cost efficient. This example will be just for one data area.
Data: ~ 6 tables which could be joined as somewhat similar to star schemas. Containing of 3 fact tables and based on denormalization level 5-8 dimensions.
With various approaches and different optimization paths, using star schema it would be common to reach SQL times to about 30 seconds. Which is not bad, but also not too responsive from user perspective.
SQLs on flat denormalized fact tables rarely exceed 5 seconds. Some tables contain more than 100 columns, row counts are between 50M and 100M. To not overcomplicate, we use zstd compression for all columns.
In columnar databases data compresses very well as many similar or same values are used in single column.
We took OBT table approach and there are some pros and cons:
pros:
Responsive reports in reporting tool (most important one)
Fewer objects for ETL developers to handle.
Analysts which query database directly can create simpler queries using less tables.
Don't need to worry about data inconsistencies if some dimensions are outdated, which could happen in star schema.
Easier approach for reporting tool cache clearing.
Easier reporting performance tuning.
Easier modeling in reporting tool, do not need to define table join strategies.
cons:
Might take more space. Didn't really tested this closely as storage space is not an issue for us.
Filters in reporting tools might take a bit longer to provide list of values (select distinct one_column from table)
Table refresh might take a bit longer for one big table compared to multiple smaller tables.
Hopefully this helps.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Databases or schemas for an application on Postgres with many tables

I'm in the process of rolling out a new feature on my webapp that will ultimately result in users having the ability to create dynamic tables in the database. Over time I expect that this may result in thousands, or tens of thousands of tables being created.
I understand that postgres doesn't have explicit limits on the number of tables in the database, however that performance might degrade if that number gets too large. In order to mitigate this I'm thinking of breaking up the underlying storage into either different databases or different schemas. My main question is: is one of those choices choices better than the other? If so, why? It seems easier to implement with schemas, however I'm not sure if that will actually solve some of the potential longer term performance issues that might come up.
Note that the tables are completely independent - so there are no concerns about needing to joins with other tables.
Also, assume I'm handing any validation that might get me into trouble with malicious and/or unexpected users being able to create database tables.
From the Database File Layout of the manual:
Each table and index is stored in a separate file.
So, this is the first point to take into account. You should have a filesystem which does a good job with a large number of files in a single directory, unless you use different tablespaces.
Note that you can have different tablespaces even in the same schema or in the same database, so the use of different schemas could by motivated by other reasons, like having tables with the same name (actually, schemas in PostgreSQL are just a way of partitioning the namespace).
For databases, I think the solution with just a database could be good for you, I assume that each database can introduce a non trivial overhead.
Finally: since the system works by using its own catalog, which is a set of relational tables, I suppose you could scale quite well, maybe you will need to add some indexes on the catalog tables, if they are not present.
The last advice: before investing time and resources on the project, do a simulation of it, by generating programmatically a thousand tables, filling them with random data, and simulating their use under the hypotheses of the load of your system.

PostgreSQL: What is the maximum number of tables can store in postgreSQL database?

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

Which NoSQL databases support text array columns (and indexes on this columns) like the postgreSQL text[] type?

I need to move data from a postgreSQL to a NoSQL database, in the process we are evaluating different NoSQL databases and Cassandra came up as a possibility but from the documentation it seems like Cassandra doesn't support having a text array as a column type, is this correct? Which NoSQL databases support this type of columns and support indexes on this type of columns?
For example to store this and have an index on a column with this type of data:
City:['Washington','Washington DC']
Thanks in advance!
Not exactly an answer to your question (not enough reputation to comment (?!?)), but understanding that your problem is scale, and you are coming from PostgreSQL, have you tried PostgresXC yet? That may be a much easier transition than to NoSQL. NoSQL databases, as I assume you know, have very different performance characteristics and nuances that might actually do more harm than good. Postgres-XC is a multi-master write-scalable fork of PostgreSQL and sits somewhere between 9.1 and 9.2 from a PostgreSQL feature standpoint and it is an active project. 9.2 conformance was slated this month or last if I recall correctly. It's relatively easy to set up for what it is - you'll build 2 GTM's, one as a primary and one as a failover, give them enough memory. Then you can scale horizontally by adding pairs of coordinators and data nodes, 1 coordinator and 1 data node per server. Your application tier can talk to any of the coordinators, transactions are shipped to the appropriate coordinators and you can specify the distribution of your data by table - either replicated for small reference tables or distributed for large ones. If you design your queries well, you can get massive performance improvement because your queries can be shipped and executed simultaneously on multiple coordinator/data node pairs.
I know you are looking for NoSQL, but I mention this because we too had a vertical vs horizontal scale problem and in the end I found it was easier to build NoSQL capability into a relational system than it was to build relational capability into a NoSQL system. And of course it all depends on your data, sometimes NoSQL is absolutely the best choice. Sometimes it can be a major headache too, for example some NoSQL databases have problems with filesystem growth so whereas you thought you bought horizontal scalability you wound up eating your SAN out of house and home.
Anyway, hope that helps! I would have left it as a comment but stackoverflow has that strange reputation thing going on.
I forgot to mention also, with Postgres-XC you can specify on which columns you wish to distribute and by what kind of algorithm. I typically distribute by hash, and make sure of two things, first that hash can be generated application-side so that I don't have to do joins on tables that are gadzillions of rows and second that the hash keeps the distribution level across servers correct but while also keeping related information together on the same server so as to increase the shippability of queries. That is, if you have a customer table and a customer orders table, distribute both on a hash of some customer unique information that is in both tables and make sure you can generate that application-side. I hope that makes sense, I'm not sure if I did a good job explaining. If you would like further clarification on that please let me know, the docs are a bit scattered on XC right now, so a lot of what I related is OJT.