The size physically occuped by a in the database column depend on my string size or on my column max size? - postgresql

For example, I have a column of varchar(2000) for messages.
If most of my messages have a length of 50 char, the "real place in memory" occupied is optimized?
Or each of them occupies 2000 char?
I use PostgreSQL

The storage (and memory) space needed only depends on the actual data stored in the column. A column defined as varchar(2000) that only contains at most 50 characters, does not need more storage or memory than a column defined as varchar(50)
Quote from the manual
If the string to be stored is shorter than the declared length, [...] values of type character varying will simply store the shorter string
(Emphasis mine)
Note that this is different for the character data type - but that shouldn't be used anyway

Related

What is the best way to store varying size columns in postres for language translation?

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.
EVENT:
id
EVENT_NAME (fk to LANGUAGE.short)
EVENT_DESCRIPTION (fk to LANGUAGE.long)
0
1
2
LANGUAGE:
language_id
language_type
short (char varying 200)
med (char varying 50)
long (char varying 2000)
1
english
game of soccer
null
null
1
spanish
partido de footbol
null
null
2
english
null
null
A really great game of soccer
2
spanish
null
null
Un gran partido de footbol
If I want the language specific version I would create a parameterized statement and pass in the language like this:
select event.id, name.short, desc.long
from event e, language name, language desc
where e.id = 0
and e.event_name = name.language_id
and name.language_type = ?
and e.event_desc = desc.language_id
and desc.language_type = ?
My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).
In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?
It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?
Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.
In the words of the documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.
Use text.
With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.
I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).
This is incorrect with PostgreSQL 14. text is the most performant.
There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].
Storage for varchar and text is the same.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Do you need to reinvent this wheel?
Internationalization and localization tools already exist; they're built into many application frameworks.
Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.
Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.
If you really must do it in the database, here's some notes:
language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.
create table languages (
id serial primary key,
tag text unique not null
);
-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
id bigserial primary key
);
create table message_localizations (
-- If a message or language is deleted, so will its localization.
message_id bigint not null references messages on delete cascade,
language_id int not null references languages on delete cascade,
localization text not null,
-- This serves as both a primary key and enforces one localization
-- per message and language.
primary key(message_id, language_id)
);
create table events (
id bigserial primary key,
name_id bigint not null references messages,
description_id bigint not null references messages
);
Then join each message with its localization.
select
events.id,
ml1.localization as name,
ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
on ml2.message_id = description_id and ml2.language_id = lt.id
Demonstration.
But, again, you probably want to use an existing localization tool.

I want to know what is the maximum size of NVARCHAR that can be supported by DB2 unique index

I want to understand what is the maximum column size (page size) of NVARCHAR column supported by unique index in DB2 LUW 11.1.x.
For e.g. I need an answer like:
(1) NVARCHAR(512) is maximum size of NVARCHAR column supported by unique index.
(2) X is the maximum byte size for any unique index
I have this link, but its not very clear to me:
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0001029.html
The page you show indicates that the maximum size of a single column in an index key is pagesize/4 in bytes, including overhead; so you need to understand how NVARCHAR(x) maps to bytes.
This depends on the setting of the nchar_mapping database configuration parameter.
The default setting is to use 4 bytes per character (CODEUNITS32), so you'd be limited to NVARCHAR(255) for a 4k page, NVARCHAR(511) for an 8k page, NVARCHAR(1023) for a 16k page, and NVARCHAR(2047) for a 32k page table space.

Can we create a column of character varying(MAX) with PostgreSQL database

I am unable to set the max size of particular column in PostgreSQL with MAX keyword. Is there any keyword like MAX. If not how can we create the column with the maximum size?
If you want to created an "unbounded" varchar column just use varchar without a length restriction.
From the manual:
If character varying is used without length specifier, the type accepts strings of any size
So you can use:
create table foo
(
unlimited varchar
);
Another alternative is to use text:
create table foo
(
unlimited text
);
More details about character data types are in the manual:
http://www.postgresql.org/docs/current/static/datatype-character.html
You should use TEXT data type for this use-case IMO.
https://www.postgresql.org/docs/9.1/datatype-character.html

Does the order of columns in a Postgres table impact performance?

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.

How many records can I store in 5 MB of PostgreSQL on Heroku?

I'm going to store records in a single table with 2 fields:
id -> 4 characters
password_hash -> 64 characters
How many records like the one above will I be able to store in a 5mb PostgreSQL on Heroku?
P.S.: given a single table with x columns and a length of y - how can I calculate the space it will take in a database?
Disk space occupied
Calculating the space on disk is not trivial. You have to take into account:
The overhead per table. Small, basically the entries in the system catalog.
The overhead per row (HeapTupleHeader) and per data page (PageHeaderData). Details about page layout in the manual.
Space lost to column alignment, depending on data types.
Space for a NULL bitmap. Effectively free for tables of 8 columns or less, irrelevant for your case.
Dead rows after UPDATE / DELETE. (Until the space is eventually vacuumed and reused.)
Size of index(es). You'll have a primary key, right? Index size is similar to that of a table with just the indexed columns and less overhead per row.
The actual space requirement of the data, depending on respective data types. Details for character types (incl. fixed length types) in the manual:
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1
More details for all types in the system catalog pg_type.
The database encoding in particular for character types. UTF-8 uses up to four bytes to store one character (But 7-Bit-ASCII characters always occupy just one byte, even in UTF-8.)
Other small things that may affect your case, like TOAST - which should not affect you with 64-character strings.
Calculate with test case
A simple method to find an estimate is to create a test table, fill it with dummy data and measure with database object size functions::
SELECT pg_size_pretty(pg_relation_size('tbl'));
Including indexes:
SELECT pg_size_pretty(pg_total_relation_size('tbl'));
See:
Measure the size of a PostgreSQL table row
A quick test shows the following results:
CREATE TABLE test(a text, b text);
INSERT INTO test -- quick fake of matching rows
SELECT chr((g/1000 +32)) || to_char(g%1000, 'FM000')
, repeat (chr(g%120 + 32), 64)
FROM generate_series(1,50000) g;
SELECT pg_size_pretty(pg_relation_size('test')); -- 5640 kB
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 5648 kB
After adding a primary key:
ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY(a);
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 6760 kB
So, I'd expect a maximum of around 44k rows without and around 36k rows with primary key.