Why to create empty (no rows, no columns) table in PostgreSQL - postgresql

In answer to this question I've learned that you can create empty table in PostgreSQL.
create table t();
Is there any real use case for this? Why would you create empty table? Because you don't know what columns it will have?

These are the things from my point of view that a column less table is good for. They probably fall more into the warm and fuzzy category.
1.
One practical use of creating a table before you add any user
defined columns to it is that it allows you to iterate fast when
creating a new system or just doing rapid dev iterations in general.
2.
Kind of more of 1, but lets you stub out tables that your app logic or procedure can make reference too, even if the columns have
yet to
be put in place.
3.
I could see it coming in handing in a case where your at a big company with lots of developers. Maybe you want to reserve a name
months in advance before your work is complete. Just add the new
column-less table to the build. Of course they could still high
jack it, but you may be able to win the argument that you had it in
use well before they came along with their other plans. Kind of
fringe, but a valid benefit.
All of these are handy and I miss them when I'm not working in PostgreSQL.

I don't know the precise reason for its inclusion in PostgreSQL, but a zero-column table - or rather a zero-attribute relation - plays a role in the theory of relational algebra, on which SQL is (broadly) based.
Specifically, a zero-attribute relation with no tuples (in SQL terms, a table with no columns and no rows) is the relational equivalent of zero or false, while a relation with no attributes but one tuple (SQL: no columns, but one row, which isn't possible in PostgreSQL as far as I know) is true or one. Hugh Darwen, an outspoken advocate of relational theory and critic of SQL, dubbed these "Table Dum" and "Table Dee", respectively.
In normal algebra x + 0 == x and x * 0 == 0, whereas x * 1 == x; the idea is that in relational algebra, Table Dum and Table Dee can be used as similar primitives for joins, unions, etc.
PostgreSQL internally refers to tables (as well as views and sequences) as "relations", so although it is geared around implementing SQL, which isn't defined by this kind of pure relation algebra, there may be elements of that in its design or history.

It is not empty table - only empty result. PostgreSQL rows contains some invisible (in default) columns. I am not sure, but it can be artifact from dark age, when Postgres was Objected Relational database - and PG supported language POSTQUEL. This empty table can work as abstract ancestor in class hierarchy.
List of system columns

I don't think mine is the intended usage however recently I've used an empty table as a lock for a view which I create and change dynamically with EXECUTE. The function which creates/replace the view has ACCESS EXCLUSIVE on the empty table and the other functions which uses the view has ACCESS.

Related

Design question for a table with too many joins OR polymorphic relations in Postgres 11.7

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.

Relations With No Attributes

Aheo asks if it is ok to have a table with just one column. How about one with no columns, or, given that this seems difficult to do in most modern "relational" DBMSes, a relation with no attributes?
There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing roles equivalent to 1 and 0 in ordinary algebra.
A table with a single column is a set -- as long as you don't care about ordering the values, or associating any other info with them, it seems fine. You can check for membership in it, and basically that's all you can do. (If you don't have a UNIQUE constraint on the single column I guess you could also count number of occurrences... a multiset).
But what in blazes would a table with no columns (or a relation with no attributes) mean -- or, how would it be any good?!
DEE and cartesian product form a monoid. In practice, if you have Date's relational summarize operator, you'd use DEE as your grouping relation to obtain grand-totals. There are many other examples where DEE is practically useful, e.g. in a functional setting with a binary join operator you'd get n-ary join = foldr join dee
"There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing a roles equivalent to 1 and 0 in ordinary algebra."
And of course they also play the role of "TRUE" and "FALSE" in boolean algebra. Meaning that they are useful when propositions such as "The shop is open" and "The alarm is set" are to be represented in a database.
A consequence of this is that they can also be usefully employed in any expression of the relational algebra for their properties of "acting as an IF/ELSE" : joining to TABLE_DUM means retaining no tuples at all from the other argument, joining to TABLE_DEE means retaining them all. So joining R to a relvar S which can be equal to either TABLE_DEE or TABLE_DUM, is the RA equivalent of "if S then R else FI", with FI standing for the empty relation.
Hm. So the lack of "real-world examples" got to me, and I tried my best. Perhaps surprisingly, I got half way there!
cjs=> CREATE TABLE D ();
CREATE TABLE
cjs=> SELECT COUNT (*) FROM D;
count
-------
0
(1 row)
cjs=> INSERT INTO D () VALUES ();
ERROR: syntax error at or near ")"
LINE 1: INSERT INTO D () VALUES ();
A table with a single column would make sense as a simple lookup. Let's say you have a list of strings you want to filter against for user inputed text. That table would store the words you would want to filter out.
It is difficult to see utility of TABLE_DEE and TABLE_DUM from SQL Database perspective. After all it is not guaranteed that your favorite db vendor allows you creating one or the other.
It is also difficult to see utility of TABLE_DEE and TABLE_DUM in relational algebra. One have to look beyond that. To get you a flavor how these constants can come alive consider relational algebra put into proper mathematical shape, that is as close as it is possible to Boolean algebra. D&D Algebra A is a step in this direction. Then, one can express classic relational algebra operations via more fundamental ones and those two constants become really handy.