Does the value of a indexed field affects performance? - postgresql

I have the following table:
create type size_type as enum('tiny', 'small', 'tall');
CREATE TABLE IF NOT EXISTS people (
size size_type NOT NULL
);
Imagine it has tons of data. If I index the size field, the length of the strings on the enum will affect the performance of the database when executing queries? For example, ('ti','s',ta') will be more performatic than ('tiny', 'small', 'tall') or it doesn't matter?

It does not matter at all. Under the hood, enum values are stored as real (4-byte floating point numbers), regardless of the label text.
The documentation offers the following explanation:
An enum value occupies four bytes on disk. The length of an enum value's textual label is limited by the NAMEDATALEN setting compiled into PostgreSQL; in standard builds this means at most 63 bytes.
The translations from internal enum values to textual labels are kept in the system catalog pg_enum. Querying this catalog directly can be useful.

Related

Is it safe to use "char" (note the quotes) data type in PostgreSQL for production application?

I have a column that stores a numeric value ranging from 1 to 5.
I considered using smallint and character(1), but I feel like there might be a better data type that is "char" (note the quotes).
"char" requires only 1 byte, which led me to thinking a performance gain over smallint and character(1).
However, in PostgreSQL's own documentation, it says that "char" is not intended for general-purpose use, only for use in the internal system catalogs.
Does this mean I shouldn't use "char" data type for my production application?
If so, which data type would you recommend to store a numeric value ranging from 1 to 5.
The data type "char" is a Postgres type that's not in standard SQL. It's "safe" and you can use it freely, even if "not intended for general-purpose use". There are no restrictions, other than it being non-standard. The type is used across many system catalogs, so it's not going away. If you ever need to migrate to another RDBMS, make the target varchar(1) or similar.
"char" lends itself to 1-letter enumeration types that never grow beyond a hand full of distinct plain ASCII letters. That's how I use it - including productive DBs. (You may want to add a CHECK or FOREIGN KEY constraint to columns enforcing valid values.)
For a "numeric value ranging from 1 to 5" I would still prefer smallint, seems more appropriate for numeric data.
A storage benefit (1 byte vs. 2 bytes) only kicks in for multiple columns - and/or (more importantly!) indexes that can retain a smaller footprint after applying alignment padding.
Notable updates in Postgres 15, quoting the release notes:
Change the I/O format of type "char" for non-ASCII characters (Tom Lane)
Bytes with the high bit set are now output as a backslash and three
octal digits, to avoid encoding issues.
And:
Create a new pg_type.typcategory value for "char" (Tom Lane)
Some other internal-use-only types have also been assigned to this
category.
(No effect on enumeration with plain ASCII letters.)
Related:
Shall I use enum when are too many "categories" with PostgreSQL?
How to store one-byte integer in PostgreSQL?
Calculating and saving space in PostgreSQL

DECIMAL Types in tsql

I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.

Optimal approach for large bulk updates to indexed bit sets

The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.

PostgreSql storing a value as integer vs varchar

I want to store a 15 digit number in a table.
In terms of lookup speed should I use bigint or varchar?
Additionally, if it's populated with millions of records will the different data types have any impact on storage?
In terms of space, a bigint takes 8 bytes while a 15-character string will take up to 19 bytes (a 4 bytes header and up to another 15 bytes for the data), depending on its value. Also, generally speaking, equality checks should be slightly faster for numerical operations. Other than that, it widely depends on what you're intending to do with this data. If you intend to use numerical/mathematical operations or query according to ranges, you should use a bigint.
You can store them as BIGINT as the comparison using the INT are faster when compared with varchar. I would also advise to create an index on the column if you are expecting millions of records to make the data retrieval faster.
You can also check this forum:
Indexing is fastest on integer types. Char types are probably stored
underneath as integers (guessing.) VARCHARs are variable allocation
and when you change the string length it may have to reallocate.
Generally integers are fastest, then single characters then CHAR[x]
types and then VARCHARS.

Change varchar to boolean in PostgreSQL

I've started working on a project where there is a fairly large table (about 82,000,000 rows) that I think is very bloated. One of the fields is defined as:
consistency character varying NOT NULL DEFAULT 'Y'::character varying
It's used as a boolean, the values should always either be ('Y'|'N').
Note: there is no check constraint, etc.
I'm trying to come up with reasons to justify changing this field. Here is what I have:
It's being used as a boolean, so make it that. Explicit is better than implicit.
It will protect against coding errors because right now there anything that can be converted to text will go blindly in there.
Here are my question(s).
What about size/storage? The db is UTF-8. So, I think there really isn't much of a savings in that regard. It should be 1 byte for a boolean, but also 1 byte for a 'Y' in UTF-8 (at least that's what I get when I check the length in Python). Is there any other storage overhead here that would be saved?
Query performance? Will Postgres get any performance gains for a where cause of "=TRUE" vs. "='Y'"?
PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!
What about size/storage?
A boolean column occupies 1 byte on disk.
(The manual) about text or character varying:
the storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string
That's at least 2 bytes for a single character.
Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:
How many records can I store in 5 MB of PostgreSQL on Heroku?
Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.
In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.
Query performance?
Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.
Instead of:
WHERE consistency = 'Y'
You could write:
WHERE consistency = true
But rather simplify to just:
WHERE consistency
No further evaluation needed.
Change type
Transforming your table is simple:
ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;
This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.
Neither storage size nor query performance will be significantly better switching from a single VARCHAR to a BOOLEAN. Although you are right that it's technically cleaner to use boolean when you are talking about a binary value the cost to change is probably significantly higher than the benefit. If you're worried about correctness then you could put a check on the column, for example
ALTER TABLE tablename ADD CONSTRAINT consistency CHECK (consistency IN ('Y', 'N'));