Postgres ENUM data type or CHECK CONSTRAINT? - postgresql

I have been migrating a MySQL db to Pg (9.1), and have been emulating MySQL ENUM data types by creating a new data type in Pg, and then using that as the column definition. My question -- could I, and would it be better to, use a CHECK CONSTRAINT instead? The MySQL ENUM types are implemented to enforce specific values entries in the rows. Could that be done with a CHECK CONSTRAINT? and, if yes, would it be better (or worse)?

Based on the comments and answers here, and some rudimentary research, I have the following summary to offer for comments from the Postgres-erati. Will really appreciate your input.
There are three ways to restrict entries in a Postgres database table column. Consider a table to store "colors" where you want only 'red', 'green', or 'blue' to be valid entries.
Enumerated data type
CREATE TYPE valid_colors AS ENUM ('red', 'green', 'blue');
CREATE TABLE t (
color VALID_COLORS
);
Advantages are that the type can be defined once and then reused in as many tables as needed. A standard query can list all the values for an ENUM type, and can be used to make application form widgets.
SELECT n.nspname AS enum_schema,
t.typname AS enum_name,
e.enumlabel AS enum_value
FROM pg_type t JOIN
pg_enum e ON t.oid = e.enumtypid JOIN
pg_catalog.pg_namespace n ON n.oid = t.typnamespace
WHERE t.typname = 'valid_colors'
enum_schema | enum_name | enum_value
-------------+---------------+------------
public | valid_colors | red
public | valid_colors | green
public | valid_colors | blue
Disadvantages are, the ENUM type is stored in system catalogs, so a query as above is required to view its definition. These values are not apparent when viewing the table definition. And, since an ENUM type is actually a data type separate from the built in NUMERIC and TEXT data types, the regular numeric and string operators and functions don't work on it. So, one can't do a query like
SELECT FROM t WHERE color LIKE 'bl%';
Check constraints
CREATE TABLE t (
colors TEXT CHECK (colors IN ('red', 'green', 'blue'))
);
Two advantage are that, one, "what you see is what you get," that is, the valid values for the column are recorded right in the table definition, and two, all native string or numeric operators work.
Foreign keys
CREATE TABLE valid_colors (
id SERIAL PRIMARY KEY NOT NULL,
color TEXT
);
INSERT INTO valid_colors (color) VALUES
('red'),
('green'),
('blue');
CREATE TABLE t (
color_id INTEGER REFERENCES valid_colors (id)
);
Essentially the same as creating an ENUM type, except, the native numeric or string operators work, and one doesn't have to query system catalogs to discover the valid values. A join is required to link the color_id to the desired text value.

As other answers point out, check constraints have flexibility issues, but setting a foreign key on an integer id requires joining during lookups. Why not just use the allowed values as natural keys in the reference table?
To adapt the schema from punkish's answer:
CREATE TABLE valid_colors (
color TEXT PRIMARY KEY
);
INSERT INTO valid_colors (color) VALUES
('red'),
('green'),
('blue');
CREATE TABLE t (
color TEXT REFERENCES valid_colors (color) ON UPDATE CASCADE
);
Values are stored inline as in the check constraint case, so there are no joins, but new valid value options can be easily added and existing values instances can be updated via ON UPDATE CASCADE (e.g. if it's decided "red" should actually be "Red", update valid_colors accordingly and the change propagates automatically).

One of the big disadvantages of Foreign keys vs Check constraints is that any reporting or UI displays will have to perform a join to resolve the id to the text.
In a small system this is not a big deal but if you are working on a HR or similar system with very many small lookup tables then this can be a very big deal with lots of joins taking place just to get the text.
My recommendation would be that if the values are few and rarely changing, then use a constraint on a text field otherwise use a lookup table against an integer id field.

PostgreSQL has enum types, works as it should. I don't know if an enum is "better" than a constraint, they just both work.

From my point of view, given a same set of values
Enum is a better solution if you will use it on multiple column
If you want to limit the values of only one column in your application, a check constraint is a better solution.
Of course, there is a whole lot of other parameters which could creep in your decision process (typically, the fact that builtin operators are not available), but I think these two are the most prevalent ones.

I'm hoping somebody will chime in with a good answer from the PostgreSQL database side as to why one might be preferable to the other.
From a software developer point of view, I have a slight preference for using check constraints, since PostgreSQL enum's require a cast in your SQL to do an update/insert, such as:
INSERT INTO table1 (colA, colB) VALUES('foo', 'bar'::myenum)
where "myenum" is the enum type you specified in PostgreSQL.
This certainly makes the SQL non-portable (which may not be a big deal for most people), but also is just yet another thing you have to deal with while developing applications, so I prefer having VARCHARs (or other typical primitives) with check constraints.
As a side note, I've noticed that MySQL enums do not require this type of cast, so this is something particular to PostgreSQL in my experience.

Related

Can the foreign data wrapper fdw_postgres handle the GEOMETRY data type of PostGIS?

I am accessing data from a different DB via fdw_postgres. It works well:
CREATE FOREIGN TABLE fdw_table
(
name TEXT,
area double precision,
use TEXT,
geom GEOMETRY
)
SERVER foreign_db
OPTIONS (schema_name 'schema_A', table_name 'table_B')
However, when I query for the data_type of the fdw_table I get the following result:
name text
area double precision
use text
geom USER-DEFINED
Can fdw_postgres not handle the GEOMETRY data type of PostGIS? What does USER-DEFINED mean in this context?
From the documentation on the data_type column:
Data type of the column, if it is a built-in type, or ARRAY if it is
some array (in that case, see the view element_types), else
USER-DEFINED (in that case, the type is identified in udt_name and
associated columns).
So this is not specific to FDWs; you'd see the same definition for a physical table.
postgres_fdw can handle custom datatypes just fine, but there is currently one caveat: if you query the foreign table with a WHERE condition involving a user-defined type, it will not push this condition to the foreign server.
In other words, if your WHERE clause only references built-in types, e.g.:
SELECT *
FROM fdw_table
WHERE name = $1
... then the WHERE clause will be sent to the foreign server, and only the matching rows will be retrieved. But when a user-defined type is involved, e.g.:
SELECT *
FROM fdw_table
WHERE geom = $1
... then the entire table is retrieved from the foreign server, and the filtering is performed locally.
Postgres 9.6 will resolve this, by allowing you to attach a list of extensions to your foreign server object.
Well, obviously you are going to need any non-standard types defined at both ends. Don't forget the FDW functionality is supposed to support a variety of different database platforms, so there isn't any magic way to import remote operations on a datatype. Actually, given that one end could be running on MS-Windows and the other on ARM-based Linux there's not even a sensible way of doing it just with PostgreSQL.

Many-to-Many in Postgres?

I went with PostgreSQL because it is an ORDMBS rather than a standard relational DBMS. I have a class/object (below) that I would like to implement into the database.
class User{
int id;
String name;
ArrayList<User> friends;
}
Now, a user has many friends, so, logically, the table should be declared like so:
CREATE TABLE user_table(
id INT,
name TEXT,
friends TYPEOF(user_table)[]
)
However, to my knowledge, it is not possible to use a row of a table as a type (-10 points for postgreSQL), so, instead, my array of friends is stored as integers:
CREATE TABLE user_table(
id INT,
name TEXT,
friends INT[]
)
This is an issue because elements of an array cannot reference - only the array itself can. Added to this, there seems to be no way to import the whole user (that is to say, the user and all the user's friends) without doing multiple queries.
Am I using postgreSQL wrong? It seems to me that the only efficient way to use it is by using a relational approach.
I want a cleaner object-oriented approach similar to that of Java.
I'm afraid you are indeed using PostgreSQL wrong, and possibly misunderstanding the purpose of Object-relational databases as opposed to classic relational databases. Both classes of database are still inherently relational, but the former provides allowances for inheritance and user-defined types that the latter does not.
This answer to one of your previous questions provides you with some great pointers to achieve what you're trying to do using the Postgres pattern.
Well, first off PostgreSQL absolutely supports arrays of complex types like you describe (although I don't think it has a TYPEOF operator). How would the declaration you describe work, though? You are trying to use the table type in the declaration of the table. If what you want is a composite type in an array (and I'm not really sure that it is) you would declare this in two steps:
CREATE TYPE ima_type AS ( some_id integer, some_val text);
CREATE TABLE ima_table
( some_other_id serial NOT NULL
, friendz ima_type []
)
;
That runs fine. You can also create arrays of table types, because every table definition is a type definition in Postgres.
However, in a relational database, a more traditional model would use two tables:
CREATE TABLE persons
( person_id serial NOT NULL PRIMARY KEY
, person_name text NOT NULL
)
;
CREATE TABLE friend_lookup
( person_id integer FOREIGN KEY REFERENCES persons
, friend_id integer FOREIGN KEY REFERENCES persons(person_id)
, CONSTRAINT uq_person_friend UNIQUE (person_id, friend_id)
)
;
Ignoring the fact that the persons table has absolutely no way to prevent duplicate persons (what about misspellings, middle initials, spacing, honorifics, etc?; also two different people can have the same name), this will do what you want and allow for a simple query that lists all friends.

In Postgres, is it performance critical to define low cardinality column as int and not text?

I have a column with 4 options.
The column is define as text.
The table is big table 100 millions of record and keep going.
The table use as report table.
The index on the table is - provider_id,date,enum_field.
I wonder if i should change the enum_filed from text to int and how much this is performance critical.
Using postgres 9.1
Table:
provider_report:
id bigserial NOT NULL,
provider_id bigint,
date timestamp without time zone,
enum_field character varying,
....
Index:
provider_id,date,enum_field
TL;DR version: worrying about this is probably not worth your time.
Long version:
There is an enum type in Postgres:
create type myenum as enum('foo', 'bar');
There are pros and cons related to using it vs a varchar or an integer field. Mostly pros imho.
In terms of size, it's stored as an oid, so int32 type. This makes it smaller than a varchar populated with typical values (e.g. 'draft', 'published', 'pending', 'completed', whatever your enum is about), and the same size as an int type. If you've very few values, a smallint / int16 will be admittedly be smaller. Some of your performance change will come from there (smaller vs larger field, i.e. mostly negligible).
Validation is possible in each case, be it through the built-in catalog lookup for the enum, or a check constraint or a foreign key for a varchar or an int. Some of your performance change will come from there, and it'll probably not be worth your time either.
Another benefit of the enum type, is that it is ordered. In the above example, 'foo'::myenum < 'bar'::myenum', making it possible to order by enumcol. To achieve the same using a varchar or an int, you'll need a separate table with a sortidx column or something... In this case, the enum can yield an enormous benefit if you ever want to order by your enum's values. This brings us to (imho) the only gotcha, which is related to how the enum type is stored in the catalog...
Internally, each enum's value carries an oid, and the latter are stored as is within the table. So it's technically an int32. When you create the enum type, its values are stored in the correct order within the catalog. In the above example, 'foo' would have an oid lower than 'bar'. This makes it very efficient for Postgres to order by an enum's value, since it amounts to sorting int32 values.
When you ALTER your enum, however, you may end up in a situation where you change that order. For instance, imagine you alter the above enum in such a way that myenum is now ('foo', 'baz', 'bar'). For reasons tied to efficiency, Postgres does not assign a new oid for existing values and rewrite the tables that use them, let alone invalidate cached query plans that use them. What it does instead, is populate a separate field in the the pg_catalog, so as to make it yield the correct sort order. From that point forward, ordering by the enum field requires an extra lookup, which de facto amounts to joining the table with a separate values table that carries a sortidx field -- much like you would do with a varchar or an int if you ever wanted to sort them.
This is usually fine and perfectly acceptable. Occasionally, it's not. When not there is a solution: alter the tables with the enum type, and change their values to varchar. Also locate and adjust functions and triggers that make use of it as you do. Then drop the type entirely, and then recreate it to get fresh oid values. And finally alter the tables back to where they were, and readjust the functions and triggers. Not trivial, but certainly feasible.
It will be best to define an enum_field as ENUM type. It will take minimal space and check, which values are allowed.
As for performance: the only reliable way if it really affects performance - to test it (with proper set of correct tests). My guess - the difference will be less than 5%.
And if you really want to change the table - don't forget to VACUUM it after the change.

postgresql hstore key/value vs traditional SQL performance

I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.

When to use inherited tables in PostgreSQL?

In which situations you should use inherited tables? I tried to use them very briefly and inheritance didn't seem like in OOP world.
I thought it worked like this:
Table users has all fields required for all user levels. Tables like moderators, admins, bloggers, etc but fields are not checked from parent. For example users has email field and inherited bloggers has it now too but it's not unique for both users and bloggers at the same time. ie. same as I add email field to both tables.
The only usage I could think of is fields that are usually used, like row_is_deleted, created_at, modified_at. Is this the only usage for inherited tables?
There are some major reasons for using table inheritance in postgres.
Let's say, we have some tables needed for statistics, which are created and filled each month:
statistics
- statistics_2010_04 (inherits statistics)
- statistics_2010_05 (inherits statistics)
In this sample, we have 2.000.000 rows in each table. Each table has a CHECK constraint to make sure only data for the matching month gets stored in it.
So what makes the inheritance a cool feature - why is it cool to split the data?
PERFORMANCE: When selecting data, we SELECT * FROM statistics WHERE date BETWEEN x and Y, and Postgres only uses the tables, where it makes sense. Eg. SELECT * FROM statistics WHERE date BETWEEN '2010-04-01' AND '2010-04-15' only scans the table statistics_2010_04, all other tables won't get touched - fast!
Index size: We have no big fat table with a big fat index on column date. We have small tables per month, with small indexes - faster reads.
Maintenance: We can run vacuum full, reindex, cluster on each month table without locking all other data
For the correct use of table inheritance as a performance booster, look at the postgresql manual.
You need to set CHECK constraints on each table to tell the database, on which key your data gets split (partitioned).
I make heavy use of table inheritance, especially when it comes to storing log data grouped by month. Hint: If you store data, which will never change (log data), create or indexes with CREATE INDEX ON () WITH(fillfactor=100); This means no space for updates will be reserved in the index - index is smaller on disk.
UPDATE:
fillfactor default is 100, from http://www.postgresql.org/docs/9.1/static/sql-createtable.html:
The fillfactor for a table is a percentage between 10 and 100. 100 (complete packing) is the default
"Table inheritance" means something different than "class inheritance" and they serve different purposes.
Postgres is all about data definitions. Sometimes really complex data definitions. OOP (in the common Java-colored sense of things) is about subordinating behaviors to data definitions in a single atomic structure. The purpose and meaning of the word "inheritance" is significantly different here.
In OOP land I might define (being very loose with syntax and semantics here):
import life
class Animal(life.Autonomous):
metabolism = biofunc(alive=True)
def die(self):
self.metabolism = False
class Mammal(Animal):
hair_color = color(foo=bar)
def gray(self, mate):
self.hair_color = age_effect('hair', self.age)
class Human(Mammal):
alcoholic = vice_boolean(baz=balls)
The tables for this might look like:
CREATE TABLE animal
(name varchar(20) PRIMARY KEY,
metabolism boolean NOT NULL);
CREATE TABLE mammal
(hair_color varchar(20) REFERENCES hair_color(code) NOT NULL,
PRIMARY KEY (name))
INHERITS (animal);
CREATE TABLE human
(alcoholic boolean NOT NULL,
FOREIGN KEY (hair_color) REFERENCES hair_color(code),
PRIMARY KEY (name))
INHERITS (mammal);
But where are the behaviors? They don't fit anywhere. This is not the purpose of "objects" as they are discussed in the database world, because databases are concerned with data, not procedural code. You could write functions in the database to do calculations for you (often a very good idea, but not really something that fits this case) but functions are not the same thing as methods -- methods as understood in the form of OOP you are talking about are deliberately less flexible.
There is one more thing to point out about inheritance as a schematic device: As of Postgres 9.2 there is no way to reference a foreign key constraint across all of the partitions/table family members at once. You can write checks to do this or get around it another way, but its not a built-in feature (it comes down to issues with complex indexing, really, and nobody has written the bits necessary to make that automatic). Instead of using table inheritance for this purpose, often a better match in the database for object inheritance is to make schematic extensions to tables. Something like this:
CREATE TABLE animal
(name varchar(20) PRIMARY KEY,
ilk varchar(20) REFERENCES animal_ilk NOT NULL,
metabolism boolean NOT NULL);
CREATE TABLE mammal
(animal varchar(20) REFERENCES animal PRIMARY KEY,
ilk varchar(20) REFERENCES mammal_ilk NOT NULL,
hair_color varchar(20) REFERENCES hair_color(code) NOT NULL);
CREATE TABLE human
(mammal varchar(20) REFERENCES mammal PRIMARY KEY,
alcoholic boolean NOT NULL);
Now we have a canonical reference for the instance of the animal that we can reliably use as a foreign key reference, and we have an "ilk" column that references a table of xxx_ilk definitions which points to the "next" table of extended data (or indicates there is none if the ilk is the generic type itself). Writing table functions, views, etc. against this sort of schema is so easy that most ORM frameworks do exactly this sort of thing in the background when you resort to OOP-style class inheritance to create families of object types.
Inheritance can be used in an OOP paradigm as long as you do not need to create foreign keys on the parent table. By example, if you have an abstract class vehicle stored in a vehicle table and a table car that inherits from it, all cars will be visible in the vehicle table but a foreign key from a driver table on the vehicle table won't match theses records.
Inheritance can be also used as a partitionning tool. This is especially usefull when you have tables meant to be growing forever (log tables etc).
Main use of inheritance is for partitioning, but sometimes it's useful in other situations. In my database there are many tables differing only in a foreign key. My "abstract class" table "image" contains an "ID" (primary key for it must be in every table) and PostGIS 2.0 raster. Inherited tables such as "site_map" or "artifact_drawing" have a foreign key column ("site_name" text column for "site_map", "artifact_id" integer column for the "artifact_drawing" table etc.) and primary and foreign key constraints; the rest is inherited from the the "image" table. I suspect I might have to add a "description" column to all the image tables in the future, so this might save me quite a lot of work without making real issues (well, the database might run little slower).
EDIT: another good use: with two-table handling of unregistered users, other RDBMSs have problems with handling the two tables, but in PostgreSQL it is easy - just add ONLY when you are not interrested in data in the inherited "unregistered user" table.
The only experience I have with inherited tables is in partitioning. It works fine, but it's not the most sophisticated and easy to use part of PostgreSQL.
Last week we were looking the same OOP issue, but we had too many problems with Hibernate - we didn't like our setup, so we didn't use inheritance in PostgreSQL.
I use inheritance when I have more than 1 on 1 relationships between tables.
Example: suppose you want to store object map locations with attributes x, y, rotation, scale.
Now suppose you have several different kinds of objects to display on the map and each object has its own map location parameters, and map parameters are never reused.
In these cases table inheritance would be quite useful to avoid having to maintain unnormalised tables or having to create location id’s and cross referencing it to other tables.
I tried some operations on it, I will not point out if is there any actual use case for database inheritance, but I will give you some detail for making your decision. Here is an example of PostgresQL: https://www.postgresql.org/docs/15/tutorial-inheritance.html
You can try below SQL script.
CREATE TABLE IF NOT EXISTS cities (
name text,
population real,
elevation int -- (in ft)
);
CREATE TABLE IF NOT EXISTS capitals (
state char(2) UNIQUE NOT NULL
) INHERITS (cities);
ALTER TABLE cities
ADD test_id varchar(255); -- Both table would contains test col
DROP TABLE cities; -- Cannot drop because capitals depends on it
ALTER TABLE cities
ADD CONSTRAINT fk_test FOREIGN KEY (test_id) REFERENCES sometable (id);
As you can see my comments, let me summarize:
When you add/delete/update fields -> the inheritance table would also be affected.
Cannot drop the parent table.
Foreign keys would not be inherited.
From my perspective, in growing applications, we cannot easily predict the changes in the future, for me I would avoid applying this to early database developing.
When features are stable as well and we want to create some database model which much likely the same as the existing one, we can consider that use case.
Use it as little as possible. And that usually means never, it boiling down to a way of creating structures that violate the relational model, for instance by breaking the information principle and by creating bags instead of relations.
Instead, use table partitioning combined with proper relational modelling, including further normal forms.