Databases or schemas for an application on Postgres with many tables - postgresql

I'm in the process of rolling out a new feature on my webapp that will ultimately result in users having the ability to create dynamic tables in the database. Over time I expect that this may result in thousands, or tens of thousands of tables being created.
I understand that postgres doesn't have explicit limits on the number of tables in the database, however that performance might degrade if that number gets too large. In order to mitigate this I'm thinking of breaking up the underlying storage into either different databases or different schemas. My main question is: is one of those choices choices better than the other? If so, why? It seems easier to implement with schemas, however I'm not sure if that will actually solve some of the potential longer term performance issues that might come up.
Note that the tables are completely independent - so there are no concerns about needing to joins with other tables.
Also, assume I'm handing any validation that might get me into trouble with malicious and/or unexpected users being able to create database tables.

From the Database File Layout of the manual:
Each table and index is stored in a separate file.
So, this is the first point to take into account. You should have a filesystem which does a good job with a large number of files in a single directory, unless you use different tablespaces.
Note that you can have different tablespaces even in the same schema or in the same database, so the use of different schemas could by motivated by other reasons, like having tables with the same name (actually, schemas in PostgreSQL are just a way of partitioning the namespace).
For databases, I think the solution with just a database could be good for you, I assume that each database can introduce a non trivial overhead.
Finally: since the system works by using its own catalog, which is a set of relational tables, I suppose you could scale quite well, maybe you will need to add some indexes on the catalog tables, if they are not present.
The last advice: before investing time and resources on the project, do a simulation of it, by generating programmatically a thousand tables, filling them with random data, and simulating their use under the hypotheses of the load of your system.

Related

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

PostgreSQL: What is the maximum number of tables can store in postgreSQL database?

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?
I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.
As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...
It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

Which NoSQL databases support text array columns (and indexes on this columns) like the postgreSQL text[] type?

I need to move data from a postgreSQL to a NoSQL database, in the process we are evaluating different NoSQL databases and Cassandra came up as a possibility but from the documentation it seems like Cassandra doesn't support having a text array as a column type, is this correct? Which NoSQL databases support this type of columns and support indexes on this type of columns?
For example to store this and have an index on a column with this type of data:
City:['Washington','Washington DC']
Thanks in advance!
Not exactly an answer to your question (not enough reputation to comment (?!?)), but understanding that your problem is scale, and you are coming from PostgreSQL, have you tried PostgresXC yet? That may be a much easier transition than to NoSQL. NoSQL databases, as I assume you know, have very different performance characteristics and nuances that might actually do more harm than good. Postgres-XC is a multi-master write-scalable fork of PostgreSQL and sits somewhere between 9.1 and 9.2 from a PostgreSQL feature standpoint and it is an active project. 9.2 conformance was slated this month or last if I recall correctly. It's relatively easy to set up for what it is - you'll build 2 GTM's, one as a primary and one as a failover, give them enough memory. Then you can scale horizontally by adding pairs of coordinators and data nodes, 1 coordinator and 1 data node per server. Your application tier can talk to any of the coordinators, transactions are shipped to the appropriate coordinators and you can specify the distribution of your data by table - either replicated for small reference tables or distributed for large ones. If you design your queries well, you can get massive performance improvement because your queries can be shipped and executed simultaneously on multiple coordinator/data node pairs.
I know you are looking for NoSQL, but I mention this because we too had a vertical vs horizontal scale problem and in the end I found it was easier to build NoSQL capability into a relational system than it was to build relational capability into a NoSQL system. And of course it all depends on your data, sometimes NoSQL is absolutely the best choice. Sometimes it can be a major headache too, for example some NoSQL databases have problems with filesystem growth so whereas you thought you bought horizontal scalability you wound up eating your SAN out of house and home.
Anyway, hope that helps! I would have left it as a comment but stackoverflow has that strange reputation thing going on.
I forgot to mention also, with Postgres-XC you can specify on which columns you wish to distribute and by what kind of algorithm. I typically distribute by hash, and make sure of two things, first that hash can be generated application-side so that I don't have to do joins on tables that are gadzillions of rows and second that the hash keeps the distribution level across servers correct but while also keeping related information together on the same server so as to increase the shippability of queries. That is, if you have a customer table and a customer orders table, distribute both on a hash of some customer unique information that is in both tables and make sure you can generate that application-side. I hope that makes sense, I'm not sure if I did a good job explaining. If you would like further clarification on that please let me know, the docs are a bit scattered on XC right now, so a lot of what I related is OJT.

PostgreSQL temporary table cache in memory?

Context:
I want to store some temporary results in some temporary tables. These tables may be reused in several queries that may occur close in time, but at some point the evolutionary algorithm I'm using may not need some old tables any more and keep generating new tables. There will be several queries, possibly concurrently, using those tables. Only one user doing all those queries. I don't know if that clarifies everything about sessions and so on, I'm still uncertain about how that works.
Objective:
What I would like to do is to create temporary tables (if they don't exist already), store them on memory as far as that is possible and if at some point there is not enough memory, delete those that would be committed to the HDD (I guess those will be the least recently used).
Examples:
The client will be doing queries for EMAs with different parameters and an aggregation of them with different coefficients, each individual may vary in terms of the coefficients used and so the parameters for the EMAs may repeat as they are still in the gene pool, and may not be needed after a while. There will be similar queries with more parameters and the genetic algorithm will find the right values for the parameters.
Questions:
Is that what "on commit drop" means? I've seen descriptions about
sessions and transactions but I don't really understand those
concepts. Sorry if the question is stupid.
If it is not, do you know about any simple way to get Postgres to do
this?
Workaround:
In the worst case I should be able to make a guesstimation about how many tables I can keep on memory and try to implement the LRU by myself, but it's never going to be as good as what Postgres could do.
Thank you very much.
This is a complicated topic and probably one to discuss in some depth. I think it is worth both explaining why PostgreSQL doesn't support this and also what you can do instead with recent versions to approach what you are trying to do.
PostgreSQL has a pretty good approach to caching diverse data sets across multiple users. In general you don't want to allow a programmer to specify that a temporary table must be kept in memory if it becomes very large. Temporary tables however are managed quite differently from normal tables in that they are:
Buffered by the individual back-end, not the shared buffers
Locally visible only, and
Unlogged.
What this means is that typically you aren't generating a lot of disk I/O for temporary tables. The tables do not normally flush WAL segments, and they are managed by the local back-end so they don't affect shared buffer usage. This means that only occasionally is data going to be written to disk and only when necessary to free memory for other (usually more frequent) tasks. You certainly aren't forcing disk writes and only need disk reads when something else has used up memory.
The end result is that you don't really need to worry about this. PostgreSQL already tries, to a certain extent, to do what you are asking it to do, and temporary tables have much lower disk I/O requirements than standard tables do. It does not force the tables to stay in memory though and if they become large enough, the pages may expire into the OS disk cache, and eventually on to disk. This is an important feature because it ensures that performance gracefully degrades when many people create many large temporary tables.