I've started working on a project where there is a fairly large table (about 82,000,000 rows) that I think is very bloated. One of the fields is defined as:
consistency character varying NOT NULL DEFAULT 'Y'::character varying
It's used as a boolean, the values should always either be ('Y'|'N').
Note: there is no check constraint, etc.
I'm trying to come up with reasons to justify changing this field. Here is what I have:
It's being used as a boolean, so make it that. Explicit is better than implicit.
It will protect against coding errors because right now there anything that can be converted to text will go blindly in there.
Here are my question(s).
What about size/storage? The db is UTF-8. So, I think there really isn't much of a savings in that regard. It should be 1 byte for a boolean, but also 1 byte for a 'Y' in UTF-8 (at least that's what I get when I check the length in Python). Is there any other storage overhead here that would be saved?
Query performance? Will Postgres get any performance gains for a where cause of "=TRUE" vs. "='Y'"?
PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!
What about size/storage?
A boolean column occupies 1 byte on disk.
(The manual) about text or character varying:
the storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string
That's at least 2 bytes for a single character.
Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:
How many records can I store in 5 MB of PostgreSQL on Heroku?
Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.
In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.
Query performance?
Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.
Instead of:
WHERE consistency = 'Y'
You could write:
WHERE consistency = true
But rather simplify to just:
WHERE consistency
No further evaluation needed.
Change type
Transforming your table is simple:
ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;
This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.
Neither storage size nor query performance will be significantly better switching from a single VARCHAR to a BOOLEAN. Although you are right that it's technically cleaner to use boolean when you are talking about a binary value the cost to change is probably significantly higher than the benefit. If you're worried about correctness then you could put a check on the column, for example
ALTER TABLE tablename ADD CONSTRAINT consistency CHECK (consistency IN ('Y', 'N'));
Related
I have a column that stores a numeric value ranging from 1 to 5.
I considered using smallint and character(1), but I feel like there might be a better data type that is "char" (note the quotes).
"char" requires only 1 byte, which led me to thinking a performance gain over smallint and character(1).
However, in PostgreSQL's own documentation, it says that "char" is not intended for general-purpose use, only for use in the internal system catalogs.
Does this mean I shouldn't use "char" data type for my production application?
If so, which data type would you recommend to store a numeric value ranging from 1 to 5.
The data type "char" is a Postgres type that's not in standard SQL. It's "safe" and you can use it freely, even if "not intended for general-purpose use". There are no restrictions, other than it being non-standard. The type is used across many system catalogs, so it's not going away. If you ever need to migrate to another RDBMS, make the target varchar(1) or similar.
"char" lends itself to 1-letter enumeration types that never grow beyond a hand full of distinct plain ASCII letters. That's how I use it - including productive DBs. (You may want to add a CHECK or FOREIGN KEY constraint to columns enforcing valid values.)
For a "numeric value ranging from 1 to 5" I would still prefer smallint, seems more appropriate for numeric data.
A storage benefit (1 byte vs. 2 bytes) only kicks in for multiple columns - and/or (more importantly!) indexes that can retain a smaller footprint after applying alignment padding.
Notable updates in Postgres 15, quoting the release notes:
Change the I/O format of type "char" for non-ASCII characters (Tom Lane)
Bytes with the high bit set are now output as a backslash and three
octal digits, to avoid encoding issues.
And:
Create a new pg_type.typcategory value for "char" (Tom Lane)
Some other internal-use-only types have also been assigned to this
category.
(No effect on enumeration with plain ASCII letters.)
Related:
Shall I use enum when are too many "categories" with PostgreSQL?
How to store one-byte integer in PostgreSQL?
Calculating and saving space in PostgreSQL
I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.
How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.
The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.
Using IBM DB2 9.7, in a 32k tablespace, assuming a 10000b (ten thousand bytes) long column fits nicely in a tablespce. Is there a difference between these two, and is one preferred over the other?
VARCHAR(10000)
CLOB(536870912) INLINE LENGTH 10000
Is either or preferred in terms of functionality and performance? A quick look at the two would be that the CLOB is actually more versatile; all content shorter than 10000 is stored in stablespace, but IF bigger content is required then that is fine too, it is just stored elsewhere on disk.
There are a number of restrictions on the way CLOB can be used in a query:
Special restrictions apply to expressions resulting in a CLOB data
type, and to structured type columns; such expressions and columns are
not permitted in:
A SELECT list preceded by the DISTINCT clause
A GROUP BY clause An ORDER BY clause A subselect of a set operator other
than UNION ALL
A basic, quantified, BETWEEN, or IN predicate
An aggregate function
VARGRAPHIC, TRANSLATE, and datetime scalar functions
The pattern operand in a LIKE predicate, or the search
string operand in a POSSTR function
The string representation of a
datetime value.
So if you need to do any of those things, VARCHAR is to be preferred.
I don't have a definitive answer about performance (unfortunately, information like this just doesn't seem to be in the documentation--or at least, it is not easily locatable). However, logically speaking, the DB has more work to do with a CLOB. It has to decide whether to return the CLOB directly in the result or not. That has to mean at least some overhead. Here is a good discussion of some of the issues, though it doesn't give a clear answer on performance, either.
My default position would be to use VARCHAR unless CLOB is really needed (data in the column can be bigger than the VARCHAR limit).