PostgreSQL: What is the maximum number of tables can store in postgreSQL database? - postgresql

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?

Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.

Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

Related

Data modeling in columnar database vs multi-dimensional for reporting

In my way of learning Redshift (my first columnar database), I am struggling to figure out the approach for designing the model. Columnar database does promote flat table design, yet admits that star schema or snowflake could be a better choice for some cases.
Here is a simple example of where I am struggling
As you can see multi-dimensional approach have few dimensions and 1 fact table. I could have made it snowflake design but I kept it simple for star schema.
Approach 1: Used common columns from tables (in this scenario demographics). This could reduce the table size for Customer & Store but will include the extra dimension.
Approach 2: Flat table design with all the columns
My Questions:
Which approach data modeler use to design data model in columnar databases like Redshift? Or they use different approach?
Considering this example, what is the best way to design a data model for data warehousing.
Which approach is good for reporting (considering that client PC\Laptop would have limited memory). Or even cloud reporting may become costly when heavy data set is used.
Approach 3 will produce a massive amount of data set for reporting. This could be a costly affair if doing reporting (using Power BI or Tableau or any other Self reporting tool)
Multidimenion approach is best for self reporting (cost & performance) but then it defeats the purpose of columnar database.
Approach 1 is also good for reporting but with more joins & complexity.
Sorry, late to the party.
I will post is as answer, because it is too long for a comment.
I saw in chat that test results show that star schema is better. But it was tested on regular (MSSQL), not columnar database (just as vertica, redshift, snowflake, bigquery..).
There is some experience from project implementation where I tested both approaches - OBT and star schema while implementing dwh for reporting. Ths was already more than 2 years ago, so don't expect much details.
Database: Redshift 2 nodes of dc2.8xlarge. Might be a bit overkill, but other option was to have a bunch of lower level nodes, which wouldn't be more cost efficient. This example will be just for one data area.
Data: ~ 6 tables which could be joined as somewhat similar to star schemas. Containing of 3 fact tables and based on denormalization level 5-8 dimensions.
With various approaches and different optimization paths, using star schema it would be common to reach SQL times to about 30 seconds. Which is not bad, but also not too responsive from user perspective.
SQLs on flat denormalized fact tables rarely exceed 5 seconds. Some tables contain more than 100 columns, row counts are between 50M and 100M. To not overcomplicate, we use zstd compression for all columns.
In columnar databases data compresses very well as many similar or same values are used in single column.
We took OBT table approach and there are some pros and cons:
pros:
Responsive reports in reporting tool (most important one)
Fewer objects for ETL developers to handle.
Analysts which query database directly can create simpler queries using less tables.
Don't need to worry about data inconsistencies if some dimensions are outdated, which could happen in star schema.
Easier approach for reporting tool cache clearing.
Easier reporting performance tuning.
Easier modeling in reporting tool, do not need to define table join strategies.
cons:
Might take more space. Didn't really tested this closely as storage space is not an issue for us.
Filters in reporting tools might take a bit longer to provide list of values (select distinct one_column from table)
Table refresh might take a bit longer for one big table compared to multiple smaller tables.
Hopefully this helps.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Databases or schemas for an application on Postgres with many tables

I'm in the process of rolling out a new feature on my webapp that will ultimately result in users having the ability to create dynamic tables in the database. Over time I expect that this may result in thousands, or tens of thousands of tables being created.
I understand that postgres doesn't have explicit limits on the number of tables in the database, however that performance might degrade if that number gets too large. In order to mitigate this I'm thinking of breaking up the underlying storage into either different databases or different schemas. My main question is: is one of those choices choices better than the other? If so, why? It seems easier to implement with schemas, however I'm not sure if that will actually solve some of the potential longer term performance issues that might come up.
Note that the tables are completely independent - so there are no concerns about needing to joins with other tables.
Also, assume I'm handing any validation that might get me into trouble with malicious and/or unexpected users being able to create database tables.
From the Database File Layout of the manual:
Each table and index is stored in a separate file.
So, this is the first point to take into account. You should have a filesystem which does a good job with a large number of files in a single directory, unless you use different tablespaces.
Note that you can have different tablespaces even in the same schema or in the same database, so the use of different schemas could by motivated by other reasons, like having tables with the same name (actually, schemas in PostgreSQL are just a way of partitioning the namespace).
For databases, I think the solution with just a database could be good for you, I assume that each database can introduce a non trivial overhead.
Finally: since the system works by using its own catalog, which is a set of relational tables, I suppose you could scale quite well, maybe you will need to add some indexes on the catalog tables, if they are not present.
The last advice: before investing time and resources on the project, do a simulation of it, by generating programmatically a thousand tables, filling them with random data, and simulating their use under the hypotheses of the load of your system.