I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.
Related
I have the following table:
create type size_type as enum('tiny', 'small', 'tall');
CREATE TABLE IF NOT EXISTS people (
size size_type NOT NULL
);
Imagine it has tons of data. If I index the size field, the length of the strings on the enum will affect the performance of the database when executing queries? For example, ('ti','s',ta') will be more performatic than ('tiny', 'small', 'tall') or it doesn't matter?
It does not matter at all. Under the hood, enum values are stored as real (4-byte floating point numbers), regardless of the label text.
The documentation offers the following explanation:
An enum value occupies four bytes on disk. The length of an enum value's textual label is limited by the NAMEDATALEN setting compiled into PostgreSQL; in standard builds this means at most 63 bytes.
The translations from internal enum values to textual labels are kept in the system catalog pg_enum. Querying this catalog directly can be useful.
How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.
In Java I can say Integer.MAX_VALUE to get the largest number that the int type can hold.
Is there a similar constant/function in Postgres? I'd like to avoid hard-coding the number.
Edit: the reason I am asking is this. There is a legacy table with an ID of type integer, backed by a sequence. There is a lot of incoming rows into this table. I want to calculate how much time before the integer runs out, so I need to know "how many IDs are left" divided by "how fast we are spending them".
There's no constant for this, but I think it's more reasonable to hard-code the number in Postgres than it is in Java.
In Java, the philosophical goal is for Integer to be an abstract value, so it makes sense that you'd want to behave as if you don't know what the max value is.
In Postgres, you're much closer to the bare metal and the definition of the integer type is that it is a 4-byte signed integer.
There is a legacy table with an ID of type integer, backed by a sequence.
In that case, you can get the max value of the sequence by:
select seqmax from pg_sequence where seqrelid = 'your_sequence_name'::regclass.
This might be better than getting the MAX_INT, because sequence may have been created/altered with a specific max value that is different from MAX_INT.
Using IBM DB2 9.7, in a 32k tablespace, assuming a 10000b (ten thousand bytes) long column fits nicely in a tablespce. Is there a difference between these two, and is one preferred over the other?
VARCHAR(10000)
CLOB(536870912) INLINE LENGTH 10000
Is either or preferred in terms of functionality and performance? A quick look at the two would be that the CLOB is actually more versatile; all content shorter than 10000 is stored in stablespace, but IF bigger content is required then that is fine too, it is just stored elsewhere on disk.
There are a number of restrictions on the way CLOB can be used in a query:
Special restrictions apply to expressions resulting in a CLOB data
type, and to structured type columns; such expressions and columns are
not permitted in:
A SELECT list preceded by the DISTINCT clause
A GROUP BY clause An ORDER BY clause A subselect of a set operator other
than UNION ALL
A basic, quantified, BETWEEN, or IN predicate
An aggregate function
VARGRAPHIC, TRANSLATE, and datetime scalar functions
The pattern operand in a LIKE predicate, or the search
string operand in a POSSTR function
The string representation of a
datetime value.
So if you need to do any of those things, VARCHAR is to be preferred.
I don't have a definitive answer about performance (unfortunately, information like this just doesn't seem to be in the documentation--or at least, it is not easily locatable). However, logically speaking, the DB has more work to do with a CLOB. It has to decide whether to return the CLOB directly in the result or not. That has to mean at least some overhead. Here is a good discussion of some of the issues, though it doesn't give a clear answer on performance, either.
My default position would be to use VARCHAR unless CLOB is really needed (data in the column can be bigger than the VARCHAR limit).
I've started working on a project where there is a fairly large table (about 82,000,000 rows) that I think is very bloated. One of the fields is defined as:
consistency character varying NOT NULL DEFAULT 'Y'::character varying
It's used as a boolean, the values should always either be ('Y'|'N').
Note: there is no check constraint, etc.
I'm trying to come up with reasons to justify changing this field. Here is what I have:
It's being used as a boolean, so make it that. Explicit is better than implicit.
It will protect against coding errors because right now there anything that can be converted to text will go blindly in there.
Here are my question(s).
What about size/storage? The db is UTF-8. So, I think there really isn't much of a savings in that regard. It should be 1 byte for a boolean, but also 1 byte for a 'Y' in UTF-8 (at least that's what I get when I check the length in Python). Is there any other storage overhead here that would be saved?
Query performance? Will Postgres get any performance gains for a where cause of "=TRUE" vs. "='Y'"?
PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!
What about size/storage?
A boolean column occupies 1 byte on disk.
(The manual) about text or character varying:
the storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string
That's at least 2 bytes for a single character.
Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:
How many records can I store in 5 MB of PostgreSQL on Heroku?
Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.
In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.
Query performance?
Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.
Instead of:
WHERE consistency = 'Y'
You could write:
WHERE consistency = true
But rather simplify to just:
WHERE consistency
No further evaluation needed.
Change type
Transforming your table is simple:
ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;
This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.
Neither storage size nor query performance will be significantly better switching from a single VARCHAR to a BOOLEAN. Although you are right that it's technically cleaner to use boolean when you are talking about a binary value the cost to change is probably significantly higher than the benefit. If you're worried about correctness then you could put a check on the column, for example
ALTER TABLE tablename ADD CONSTRAINT consistency CHECK (consistency IN ('Y', 'N'));