In Postgres, is it performance critical to define low cardinality column as int and not text? - postgresql

I have a column with 4 options.
The column is define as text.
The table is big table 100 millions of record and keep going.
The table use as report table.
The index on the table is - provider_id,date,enum_field.
I wonder if i should change the enum_filed from text to int and how much this is performance critical.
Using postgres 9.1
Table:
provider_report:
id bigserial NOT NULL,
provider_id bigint,
date timestamp without time zone,
enum_field character varying,
....
Index:
provider_id,date,enum_field

TL;DR version: worrying about this is probably not worth your time.
Long version:
There is an enum type in Postgres:
create type myenum as enum('foo', 'bar');
There are pros and cons related to using it vs a varchar or an integer field. Mostly pros imho.
In terms of size, it's stored as an oid, so int32 type. This makes it smaller than a varchar populated with typical values (e.g. 'draft', 'published', 'pending', 'completed', whatever your enum is about), and the same size as an int type. If you've very few values, a smallint / int16 will be admittedly be smaller. Some of your performance change will come from there (smaller vs larger field, i.e. mostly negligible).
Validation is possible in each case, be it through the built-in catalog lookup for the enum, or a check constraint or a foreign key for a varchar or an int. Some of your performance change will come from there, and it'll probably not be worth your time either.
Another benefit of the enum type, is that it is ordered. In the above example, 'foo'::myenum < 'bar'::myenum', making it possible to order by enumcol. To achieve the same using a varchar or an int, you'll need a separate table with a sortidx column or something... In this case, the enum can yield an enormous benefit if you ever want to order by your enum's values. This brings us to (imho) the only gotcha, which is related to how the enum type is stored in the catalog...
Internally, each enum's value carries an oid, and the latter are stored as is within the table. So it's technically an int32. When you create the enum type, its values are stored in the correct order within the catalog. In the above example, 'foo' would have an oid lower than 'bar'. This makes it very efficient for Postgres to order by an enum's value, since it amounts to sorting int32 values.
When you ALTER your enum, however, you may end up in a situation where you change that order. For instance, imagine you alter the above enum in such a way that myenum is now ('foo', 'baz', 'bar'). For reasons tied to efficiency, Postgres does not assign a new oid for existing values and rewrite the tables that use them, let alone invalidate cached query plans that use them. What it does instead, is populate a separate field in the the pg_catalog, so as to make it yield the correct sort order. From that point forward, ordering by the enum field requires an extra lookup, which de facto amounts to joining the table with a separate values table that carries a sortidx field -- much like you would do with a varchar or an int if you ever wanted to sort them.
This is usually fine and perfectly acceptable. Occasionally, it's not. When not there is a solution: alter the tables with the enum type, and change their values to varchar. Also locate and adjust functions and triggers that make use of it as you do. Then drop the type entirely, and then recreate it to get fresh oid values. And finally alter the tables back to where they were, and readjust the functions and triggers. Not trivial, but certainly feasible.

It will be best to define an enum_field as ENUM type. It will take minimal space and check, which values are allowed.
As for performance: the only reliable way if it really affects performance - to test it (with proper set of correct tests). My guess - the difference will be less than 5%.
And if you really want to change the table - don't forget to VACUUM it after the change.

Related

Is there efficient difference between varchar and int as PK

Could somebody tell is it good idea use varchar as PK. I mean is it less efficient or equal to int/uuid?
In example: car VIN I want to use it as PK but I'm not sure as good it will be indexed or work as FK or maybe there is some pitfalls.
It depends on which kind of data you are going to store.
In some cases (I would say in most cases) it is better to use integer-based primary keys:
for instance, bigint needs only 8 bytes, varchar can require more space. For this reason, a varchar comparison is often more costly than a bigint comparison.
while joining tables it would be more efficient to join them using integer-based values rather that strings
an integer-based key as a unique key is more appropriate for table relations. For instance, if you are going to store this primary key in another tables as a separate column. Again, varchar will require more space in other table too (see p.1).
This post on stackexchange compares non-integer types of primary keys on a particular example.

Fast way to check if PostgreSQL jsonb column contains certain string

The past two days I've been reading a lot about jsonb, full text search, gin index, trigram index and what not but I still can not find a definitive or at least a good enough answer on how to fastly search if a row of type JSONB contains certain string as a value. Since it's a search functionality the behavior should be like that of ILIKE
What I have is:
Table, lets call it app.table_1 which contains a lot of columns one of which is of type JSONB, so lets call it column_jsonb
The data inside column_jsonb will always be flatten (no nested objects, etc) but the keys can vary. An example of the data in the column with obfuscated values looks like this:
"{""Key1"": ""Value1"", ""Key2"": ""Value2"", ""Key3"": null, ""Key4"": ""Value4"", ""Key5"": ""Value5""}"
I have a GIN index for this column which doesn't seems to affect the search time significantly (I am testing with 20k records now which takes about 550ms). The indes looks like this:
CREATE INDEX ix_table_1_column_jsonb_gin
ON app.table_1 USING gin
(column_jsonb jsonb_path_ops)
TABLESPACE pg_default;
I am interested only in the VALUES and the way I am searching them now is this:
EXISTS(SELECT value FROM jsonb_each(column_jsonb) WHERE value::text ILIKE search_term)
Here search_term is variable coming from the front end with the string that the user is searching for
I have the following questions:
Is it possible to make the check faster without modifying the data model? I've read that trigram index might be usfeul for similar cases but at least for me it seems that converting jsonb to text and then checking will be slower and actually I am not sure if the trigram index will actually work if the column original type is JSONB and I explicitly cast each row to text? If I'm wroing I would really appreciate some explanation with example if possible.
Is there some JSONB function that I am not aware of which offers what I am searching for out of the box, I'm constrained to PostgreSQL v 11.9 so some new things coming with version 12 are not available for me.
If it's not possible to achieve significant improvement with the current data structure can you propose a way to restructure the data in column_jsonb maybe another column of some other type with data persisted in some other way, I don't know...
Thank you very much in advance!
If the data structure is flat, and you regularly need to search the values, and the values are all the same type, a traditional key/value table would seem more appropriate.
create table table1_options (
table1_id bigint not null references table1(id),
key text not null,
value text not null
);
create index table1_options_key on table1_options(key);
create index table1_options_value on table1_options(value);
select *
from table1_options
where value ilike 'some search%';
I've used simple B-Tree indexes, but you can use whatever you need to speed up your particular searches.
The downsides are that all values must have the same type (doesn't seem to be a problem here) and you need an extra table for each table. That last one can be mitigated somewhat with table inheritance.

Prevent non-collation characters in a NVarChar column using constraint?

Little weird requirement, but here it goes. We have a CustomerId VarChar(25) column in a table. We need to make it NVarChar(25) to work around issues with type conversions.
CHARINDEX vs LIKE search gives very different performance, why?
But, we don't want to allow non-latin characters to be stored in this column. Is there any way to place such a constraint on column? I'd rather let database handle this check. In general we OK with NVarChar for all of our strings, but some columns like ID's is not a good candidates for this because of possibility of look alike strings from different languages
Example:
CustomerId NVarChar(1) - PK
Value 1: BOPOH
Value 2: ВОРОН
Those 2 strings different (second one is Cyrillic)
I want to prevent this entry scenario. I want to make sure Value 2 can not be saved into the field.
Just in case it helps somebody. Not sure it's most "elegant" solution but I placed constraint like this on those fields:
ALTER TABLE [dbo].[Carrier] WITH CHECK ADD CONSTRAINT [CK_Carrier_CarrierId] CHECK ((CONVERT([varchar](25),[CarrierId],(0))=[CarrierId]))
GO

Postgres ENUM data type or CHECK CONSTRAINT?

I have been migrating a MySQL db to Pg (9.1), and have been emulating MySQL ENUM data types by creating a new data type in Pg, and then using that as the column definition. My question -- could I, and would it be better to, use a CHECK CONSTRAINT instead? The MySQL ENUM types are implemented to enforce specific values entries in the rows. Could that be done with a CHECK CONSTRAINT? and, if yes, would it be better (or worse)?
Based on the comments and answers here, and some rudimentary research, I have the following summary to offer for comments from the Postgres-erati. Will really appreciate your input.
There are three ways to restrict entries in a Postgres database table column. Consider a table to store "colors" where you want only 'red', 'green', or 'blue' to be valid entries.
Enumerated data type
CREATE TYPE valid_colors AS ENUM ('red', 'green', 'blue');
CREATE TABLE t (
color VALID_COLORS
);
Advantages are that the type can be defined once and then reused in as many tables as needed. A standard query can list all the values for an ENUM type, and can be used to make application form widgets.
SELECT n.nspname AS enum_schema,
t.typname AS enum_name,
e.enumlabel AS enum_value
FROM pg_type t JOIN
pg_enum e ON t.oid = e.enumtypid JOIN
pg_catalog.pg_namespace n ON n.oid = t.typnamespace
WHERE t.typname = 'valid_colors'
enum_schema | enum_name | enum_value
-------------+---------------+------------
public | valid_colors | red
public | valid_colors | green
public | valid_colors | blue
Disadvantages are, the ENUM type is stored in system catalogs, so a query as above is required to view its definition. These values are not apparent when viewing the table definition. And, since an ENUM type is actually a data type separate from the built in NUMERIC and TEXT data types, the regular numeric and string operators and functions don't work on it. So, one can't do a query like
SELECT FROM t WHERE color LIKE 'bl%';
Check constraints
CREATE TABLE t (
colors TEXT CHECK (colors IN ('red', 'green', 'blue'))
);
Two advantage are that, one, "what you see is what you get," that is, the valid values for the column are recorded right in the table definition, and two, all native string or numeric operators work.
Foreign keys
CREATE TABLE valid_colors (
id SERIAL PRIMARY KEY NOT NULL,
color TEXT
);
INSERT INTO valid_colors (color) VALUES
('red'),
('green'),
('blue');
CREATE TABLE t (
color_id INTEGER REFERENCES valid_colors (id)
);
Essentially the same as creating an ENUM type, except, the native numeric or string operators work, and one doesn't have to query system catalogs to discover the valid values. A join is required to link the color_id to the desired text value.
As other answers point out, check constraints have flexibility issues, but setting a foreign key on an integer id requires joining during lookups. Why not just use the allowed values as natural keys in the reference table?
To adapt the schema from punkish's answer:
CREATE TABLE valid_colors (
color TEXT PRIMARY KEY
);
INSERT INTO valid_colors (color) VALUES
('red'),
('green'),
('blue');
CREATE TABLE t (
color TEXT REFERENCES valid_colors (color) ON UPDATE CASCADE
);
Values are stored inline as in the check constraint case, so there are no joins, but new valid value options can be easily added and existing values instances can be updated via ON UPDATE CASCADE (e.g. if it's decided "red" should actually be "Red", update valid_colors accordingly and the change propagates automatically).
One of the big disadvantages of Foreign keys vs Check constraints is that any reporting or UI displays will have to perform a join to resolve the id to the text.
In a small system this is not a big deal but if you are working on a HR or similar system with very many small lookup tables then this can be a very big deal with lots of joins taking place just to get the text.
My recommendation would be that if the values are few and rarely changing, then use a constraint on a text field otherwise use a lookup table against an integer id field.
PostgreSQL has enum types, works as it should. I don't know if an enum is "better" than a constraint, they just both work.
From my point of view, given a same set of values
Enum is a better solution if you will use it on multiple column
If you want to limit the values of only one column in your application, a check constraint is a better solution.
Of course, there is a whole lot of other parameters which could creep in your decision process (typically, the fact that builtin operators are not available), but I think these two are the most prevalent ones.
I'm hoping somebody will chime in with a good answer from the PostgreSQL database side as to why one might be preferable to the other.
From a software developer point of view, I have a slight preference for using check constraints, since PostgreSQL enum's require a cast in your SQL to do an update/insert, such as:
INSERT INTO table1 (colA, colB) VALUES('foo', 'bar'::myenum)
where "myenum" is the enum type you specified in PostgreSQL.
This certainly makes the SQL non-portable (which may not be a big deal for most people), but also is just yet another thing you have to deal with while developing applications, so I prefer having VARCHARs (or other typical primitives) with check constraints.
As a side note, I've noticed that MySQL enums do not require this type of cast, so this is something particular to PostgreSQL in my experience.

postgresql hstore key/value vs traditional SQL performance

I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.