A table with a XML field with null value, consumes some space? How much? - postgresql

There are an "official benchmark" or a simple rule of thumb to decide when space or performance will be affected?
My table have many simple and indexed fields,
CREATE TABLE t (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
xcontent xml, -- the NULL use disk space?? cut performance?
...
UNIQUE(name)
);
and it is a kind of "sparse content", many xcontent values will be NULL... So, these XML NULLs consumes some disk space?
Notes
I can normalize, the table t now will be nt,
CREATE TABLE nt (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
UNIQUE(name)
);
CREATE TABLE nt2 (
t_id int REFERENCES nt(id),
xcontent xml NOT NULL
);
CREATE VIEW nt_full AS
SELECT nt.*, nt2.xcontnt FROM nt LEFT JOIN nt2 ON id=t_id;
So, I need this complexity? this new table arrange will consume less disk spacess. The use of
SELECT id, name FROM nt WHERE name>'john'; -- Q1A
SELECT id, name FROM nt_full WHERE name>'john'; -- Q1B
SELECT id, name FROM t WHERE name>'john'; -- Q1C
SELECT id, xcontent FROM nt_full WHERE name>'john'; -- Q2A
SELECT id, xcontent FROM t WHERE name>'john'; -- Q2B
So, in theory, all the performances of Q1A vs Q1B vs Q1C will be the same?
And Q2A vs Q2B?

The answer to the question "how much space does a null value take" is: no space at all - at least not in the "data" area.
For each nullable column in the table there is one bit in the row header that marks the column value as null (or not null). So the "space" that the null values takes is already present in the row header - regardless whether the column is null or not.
Thus the null "value" does not occupy any space in the data block storing the row.
This is documented in the manual: http://www.postgresql.org/docs/current/static/storage-page-layout.html
Postgres will not store long string values (xml, varchar, text, json, ...) in the actual data block if it exceeds a certain threshold (about 2000 bytes). If the value is longer than that, it will be stored in a special storage area "away" from your actual data. So splitting up the table into two tables with a 1:1 relationship doesn't really by you that much. Unless you are storing a lot of rows (hundreds of millions), I doubt you will be able to notice the difference - but this also depends on your usage patterns.
The data that is stored "out-of-line" is also automatically compressed.
Details about this can be found in the manual: http://www.postgresql.org/docs/current/static/storage-toast.html
One reason why the separate table might be an advantage is the necessary "vacuum" cleanup. If you update the XML data a lot but the rest of the table hardly ever changes, then splitting this up in two tables might improve the overall performance because "XML table" will need less "maintenance" and the "main" table won't be changed at all.

A varchar field consumes 2 bytes more than the content.
so if you define it as varchar(250)
and put 10 chars in, it consumes 12 bytes
100 chars consumes 102 bytes
NULL consumes 2 bytes. no problem.
if you are in some situation where you need to store large amounts of xml data
and end up using a (for instance) blob type, you should put that in another table
and keep your primary table lean and fast

Related

What is the best way to store varying size columns in postres for language translation?

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.
EVENT:
id
EVENT_NAME (fk to LANGUAGE.short)
EVENT_DESCRIPTION (fk to LANGUAGE.long)
0
1
2
LANGUAGE:
language_id
language_type
short (char varying 200)
med (char varying 50)
long (char varying 2000)
1
english
game of soccer
null
null
1
spanish
partido de footbol
null
null
2
english
null
null
A really great game of soccer
2
spanish
null
null
Un gran partido de footbol
If I want the language specific version I would create a parameterized statement and pass in the language like this:
select event.id, name.short, desc.long
from event e, language name, language desc
where e.id = 0
and e.event_name = name.language_id
and name.language_type = ?
and e.event_desc = desc.language_id
and desc.language_type = ?
My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).
In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?
It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?
Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.
In the words of the documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.
Use text.
With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.
I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).
This is incorrect with PostgreSQL 14. text is the most performant.
There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].
Storage for varchar and text is the same.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Do you need to reinvent this wheel?
Internationalization and localization tools already exist; they're built into many application frameworks.
Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.
Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.
If you really must do it in the database, here's some notes:
language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.
create table languages (
id serial primary key,
tag text unique not null
);
-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
id bigserial primary key
);
create table message_localizations (
-- If a message or language is deleted, so will its localization.
message_id bigint not null references messages on delete cascade,
language_id int not null references languages on delete cascade,
localization text not null,
-- This serves as both a primary key and enforces one localization
-- per message and language.
primary key(message_id, language_id)
);
create table events (
id bigserial primary key,
name_id bigint not null references messages,
description_id bigint not null references messages
);
Then join each message with its localization.
select
events.id,
ml1.localization as name,
ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
on ml2.message_id = description_id and ml2.language_id = lt.id
Demonstration.
But, again, you probably want to use an existing localization tool.

Postgresql - Optimize ordering of columns in table range partitioning with multiple columns range

I am testing with creating a data warehouse for a relatively big dataset. Based on ~10% sample of the data I decided to partition some tables that are expected to exceed memory which currently 16GB
Based on the recommendation in postgreSQL docs: (edited)
These benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
One particular table I am not sure how to partition, this table is frequently queried in 2 different ways, with WHERE clause that may include primary key OR another indexed column, so figured I need a range partition using the existing primary key and the other column (with the other column added to the primary key).
Knowing that the order of columns matters, and given the below information my question is:
What is the best order for primary key and range partitioning columns?
Original table:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL, -- primary key
src_doc_id bigint NOT NULL, -- every item can exist in one src_doc only and a src_doc can have multiple items
item_name character varying(50) NOT NULL, -- used in `WHERE` clause with src_doc_id and guaranteed to be unique from source
attr_1 bool,
attr_2 bool, -- +15 other columns all bool or integer types
PRIMARY KEY (item_id)
);
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
Table size for 10% of the dataset is ~2GB (result of pg_total_relation_size) with 3M+ rows, loading or querying performance is excellent, but thinking that this table is expected to grow to 30M rows and size 20GB I do not know what to expect in terms of performance.
Partitioned table being considered:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL,
src_doc_id bigint NOT NULL,
item_name character varying(50) NOT NULL,
attr_1 bool,
attr_2 bool,
PRIMARY KEY (item_id, src_doc_id) -- should the order be reversed?
) PARTITION BY RANGE (item_id, src_doc_id); -- should the order be reversed?
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
-- Ranges are not initially known, so maxvalue is used as upper bound,
-- when upper bound is known, partition is detached and reattached using
-- known known upper bound and a new partition is added for the next range
CREATE TABLE items_00 PARTITION OF items FOR VALUES FROM (MINVALUE, MINVALUE) TO (MAXVALUE, MAXVALUE);
Table usage
On loading data, the load process (python script) looks up existing items based on src_doc_id and item_name and stores item_id, so it does not reinsert existing items. Item_id gets referenced in lot of other tables, no foreign keys are used.
On querying for analytics item information is always looked up based on item_id.
So I can't decide the suitable order for the table PRIMARY KEY and PARTITION BY RANGE,
Should it be (item_id, src_doc_id) or (src_doc_id, item_id)?

Fewer rows versus fewer columns

I am currently modeling a table schema for PostgreSQL that has a lot of columns and is intended to hold a lot of rows. I don't know if it is faster to have more columns or to split the data into more rows.
The schema looks like this (shortened):
CREATE TABLE child_table (
PRIMARY KEY(id, position),
id bigint REFERENCES parent_table(id) ON DELETE CASCADE,
position integer,
account_id bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1 integer,
attribute_2 integer,
attribute_3 integer,
-- about 60 more columns
);
Exactly 10 rows of child_table are at maximum related to one row of parent_table. The order is given by the value in position which ranges from 1 to 10. parent_table is intended to hold 650 million rows. With this schema I would end up with 6.5 billion rows in child_table.
Is it smart to do this? Or is it better to model it this way so that I only have 650 million rows:
CREATE TABLE child_table (
PRIMARY KEY(id),
id bigint,
parent_id bigint REFERENCES other_table(id) ON DELETE CASCADE,
account_id_1 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1_1 integer,
attribute_1_2 integer,
attribute_1_3 integer,
account_id_2 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_2_1 integer,
attribute_2_2 integer,
attribute_2_3 integer,
-- [...]
);
The number of columns and rows matters less than how well they are indexed. Indexes drastically reduce the number of rows which need to be searched. In a well-indexed table, the total number of rows is irrelevant. If you try to smash 10 rows into one row you'll make indexing much harder. It will also make writing efficient queries which use those indexes harder.
Postgres has many different types of indexes to cover many different types of data and searches. You can even write your own (though that shouldn't be necessary).
Exactly 10 rows of child_table are at maximum related to one row of parent_table.
Avoid encoding business logic in your schema. Business logic changes all the time, especially arbitrary numbers like 10.
One thing you might consider is reducing the number of attribute columns, 60 is a lot, especially if they are actually named attribute_1, attribute_2, etc. Instead, if your attributes are not well defined, store them as a single JSON column with keys and values. Postgres' JSON operations are very efficient (provided you use the jsonb type) and provide a nice middle ground between a key/value store and a relational database.
Similarly, if any sets of attributes are simple lists (like address1, address2, address3), you can also consider using Postgres arrays.
I can't give better advice than this without specifics.

T-SQL implicit conversion between 2 varchars

I have some T-SQL (SQL Server 2008) that I inherited and am trying to find out why some of queries are running really slow. In the Actual Execution Plan I have three clustered index scans which are costing me 19%, 21% and 26%, so this seems to be the source of my problem.
The contents of the fields are usually numeric (but some job numbers have an alpha prefix)
The database design (vendor supplied) is pretty poor. The max length of a job number in their application is 12 chars, but in the tables that are joined it is defined as varchar(50) in some places and varchar(15) in others. My parameter is a varchar(12), but I get same thing if I change it to a varchar(50)
The node contains this:
Predicate: [Live_Costing].[dbo].[TSTrans].[JobNo] as [sts1].[JobNo]=CONVERT_IMPLICIT(varchar(50),[#JobNo],0)
sts1 is a derived table, but the table it pulls jobno from is a varchar(50)
I don't understand why it's doing an implicit conversion between 2 varchars. Is it just because they are different lengths?
I'm fairly new to the execution plan
Is there an easy way to figure out which node in the exec plan relates to which part of the query?
Is the predicate, the join clause?
Regards
Mark
Some variables can have collation: enter link description here
Regardless you need to verify your collations, which can be specified at server, DB, table, and column level.
First, check your collation between tempdb and the vendor supplied database. It should match. If it doesn't, it will tend to do implicit conversions.
Assuming you cannot modify the vendor supplied code base, one or more of the following should help you:
1) Predefine your temp tables and specify the same collation for the key field as in the db in use, rather than tempdb.
2) Provide collations when doing string comparisons.
3) Specify collation for key values if using "select into" with a temp table
4) Make sure your collations on your tables and columns match your database collation (VERY important if you imported only specific tables from a vendor into an existing database.)
If you can change the vendor supplied code base, I would suggest reviewing the cost for making all of your char keys the same length and NOT varchar. Varchar has an overhead of 10. The caveat is that if you create a fixed length character field not null, it will be padded to the right (unavoidable).
Ideally, you would have int keys, and only use varchar fields for user interaction/lookup:
create table Products(ProductID int not null identity(1,1) primary key clustered, ProductNumber varchar(50) not null)
alter table Products add constraint uckProducts_ProductNumber unique(ProductNumber)
Then do all joins on ProductID, rather than ProductNumber. Just filter on ProductNumber.
would be perfectly fine.

Does the order of columns in a Postgres table impact performance?

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.