Use case for hstore against multiple columns - postgresql

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.

(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.

In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

Related

Binary to binary cast with JSONb

How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.

Postgres case sensitivity : Implications of Lower()

I have run into the issue of case sensitive searching in postgres, and have started to deal with this by using LOWER on each side of every WHERE test.
So far so good. However, I understand that in order to make use of indexes, they should be created use LOWER too, which makes sense.
However, what of the PK? presumably these are not going to be effective because it does not seem possible to create a PK using a function on the chosen PK field. Isnt this a concern for any filtering or joining which is done on PKs?
Is there a way of working around this ?
Here are some thoughts on this subjects.
First, you could add a constraint for any column requiring that the data stored be lower case. That would solve the problem inside the database.
Second, you could use a trigger to convert any value to lower case.
Third, you can use ilike. This can make use of indexes for case-insensitive searches.
And fourth, if all your primary keys are synthetic numeric primary keys, then you don't need to worry about case sensitivity.
You can still create a functional index on PK (even consisting of many columns):
CREATE TABLE test(a text, b text, c text, d text, primary key (a,b,c));
CREATE INDEX ON test (lower(a), lower(b), lower(c));
Though, it sounds like there is need for some data improvement operations to be done if you are experiencing this kind of behaviour almost everywhere in your database (like store everything in lower case).

Why to create empty (no rows, no columns) table in PostgreSQL

In answer to this question I've learned that you can create empty table in PostgreSQL.
create table t();
Is there any real use case for this? Why would you create empty table? Because you don't know what columns it will have?
These are the things from my point of view that a column less table is good for. They probably fall more into the warm and fuzzy category.
1.
One practical use of creating a table before you add any user
defined columns to it is that it allows you to iterate fast when
creating a new system or just doing rapid dev iterations in general.
2.
Kind of more of 1, but lets you stub out tables that your app logic or procedure can make reference too, even if the columns have
yet to
be put in place.
3.
I could see it coming in handing in a case where your at a big company with lots of developers. Maybe you want to reserve a name
months in advance before your work is complete. Just add the new
column-less table to the build. Of course they could still high
jack it, but you may be able to win the argument that you had it in
use well before they came along with their other plans. Kind of
fringe, but a valid benefit.
All of these are handy and I miss them when I'm not working in PostgreSQL.
I don't know the precise reason for its inclusion in PostgreSQL, but a zero-column table - or rather a zero-attribute relation - plays a role in the theory of relational algebra, on which SQL is (broadly) based.
Specifically, a zero-attribute relation with no tuples (in SQL terms, a table with no columns and no rows) is the relational equivalent of zero or false, while a relation with no attributes but one tuple (SQL: no columns, but one row, which isn't possible in PostgreSQL as far as I know) is true or one. Hugh Darwen, an outspoken advocate of relational theory and critic of SQL, dubbed these "Table Dum" and "Table Dee", respectively.
In normal algebra x + 0 == x and x * 0 == 0, whereas x * 1 == x; the idea is that in relational algebra, Table Dum and Table Dee can be used as similar primitives for joins, unions, etc.
PostgreSQL internally refers to tables (as well as views and sequences) as "relations", so although it is geared around implementing SQL, which isn't defined by this kind of pure relation algebra, there may be elements of that in its design or history.
It is not empty table - only empty result. PostgreSQL rows contains some invisible (in default) columns. I am not sure, but it can be artifact from dark age, when Postgres was Objected Relational database - and PG supported language POSTQUEL. This empty table can work as abstract ancestor in class hierarchy.
List of system columns
I don't think mine is the intended usage however recently I've used an empty table as a lock for a view which I create and change dynamically with EXECUTE. The function which creates/replace the view has ACCESS EXCLUSIVE on the empty table and the other functions which uses the view has ACCESS.

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.