What is the best way to store varying size columns in postres for language translation? - postgresql

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.
EVENT:
id
EVENT_NAME (fk to LANGUAGE.short)
EVENT_DESCRIPTION (fk to LANGUAGE.long)
0
1
2
LANGUAGE:
language_id
language_type
short (char varying 200)
med (char varying 50)
long (char varying 2000)
1
english
game of soccer
null
null
1
spanish
partido de footbol
null
null
2
english
null
null
A really great game of soccer
2
spanish
null
null
Un gran partido de footbol
If I want the language specific version I would create a parameterized statement and pass in the language like this:
select event.id, name.short, desc.long
from event e, language name, language desc
where e.id = 0
and e.event_name = name.language_id
and name.language_type = ?
and e.event_desc = desc.language_id
and desc.language_type = ?
My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).
In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?
It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?

Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.
In the words of the documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

Use text.
With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.
I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).
This is incorrect with PostgreSQL 14. text is the most performant.
There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].
Storage for varchar and text is the same.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Do you need to reinvent this wheel?
Internationalization and localization tools already exist; they're built into many application frameworks.
Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.
Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.
If you really must do it in the database, here's some notes:
language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.
create table languages (
id serial primary key,
tag text unique not null
);
-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
id bigserial primary key
);
create table message_localizations (
-- If a message or language is deleted, so will its localization.
message_id bigint not null references messages on delete cascade,
language_id int not null references languages on delete cascade,
localization text not null,
-- This serves as both a primary key and enforces one localization
-- per message and language.
primary key(message_id, language_id)
);
create table events (
id bigserial primary key,
name_id bigint not null references messages,
description_id bigint not null references messages
);
Then join each message with its localization.
select
events.id,
ml1.localization as name,
ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
on ml2.message_id = description_id and ml2.language_id = lt.id
Demonstration.
But, again, you probably want to use an existing localization tool.

Related

Queries using LIKE are exceptionally slow

I have a database with over 30,000,000 entries. When performing queries (including an ORDER BY clause) on a text field, the = operator results in relatively fast results. However we have noticed that when using the LIKE operator, the query becomes remarkably slow, taking minutes to complete. For example:
SELECT * FROM work_item_summary WHERE manager LIKE '%manager' ORDER BY created;
Creating indices on the keywords being searched will of course greatly speed up the query. The problem is we must support queries on any arbitrary pattern, and on any column, making this solution not viable.
My questions are:
Why are LIKE queries this much slower than = queries?
Is there any other way these generic queries can be optimized, or is about as good as one can get for a database with so many entries?
Your query plan shows a sequential scan, which is slow for big tables, and also not surprising since your LIKE pattern has a leading wildcard that cannot be supported with a plain B-tree index.
You need to add index support. Either a trigram GIN index to support any and all patterns, or a COLLATE "C" B-tree expression index on the reversed string to specifically target leading wildcards.
See:
PostgreSQL LIKE query performance variations
How to index a column for leading wildcard search and check progress?
One technic to speed up queries that search partial word (eg '%something%') si to use rotational indexing technic wich is not implementedin most of RDBMS.
This technique consists of cutting out each word of a sentence and then cutting it in "rotation", ie creating a list of parts of this word from which the first letter is successively removed.
As an example the word "electricity" will be exploded into 10 words :
lectricity
ectricity
ctricity
tricity
ricity
icity
city
ity
ty
y
Then you put all those words into a dictionnary that is a simple table with an index... and reference the root word.
Tables for this are :
CREATE TABLE T_WRD
(WRD_ID BIGINT IDENTITY PRIMARY KEY,
WRD_WORD VARCHAR(64) NOT NULL UNIQUE,
WRD_DROW GENERATED ALWAYS AS (REVERSE(WRD_WORD) NOT NULL UNIQUE) ;
GO
CREATE TABLE T_WORD_ROTATE_STRING_WRS
(WRD_ID BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
WRS_ROTATE SMALLINT NOT NULL,
WRD_ID_PART BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
PRIMARY KEY (WRD_ID, WRS_ROTATE));
GO

A table with a XML field with null value, consumes some space? How much?

There are an "official benchmark" or a simple rule of thumb to decide when space or performance will be affected?
My table have many simple and indexed fields,
CREATE TABLE t (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
xcontent xml, -- the NULL use disk space?? cut performance?
...
UNIQUE(name)
);
and it is a kind of "sparse content", many xcontent values will be NULL... So, these XML NULLs consumes some disk space?
Notes
I can normalize, the table t now will be nt,
CREATE TABLE nt (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
UNIQUE(name)
);
CREATE TABLE nt2 (
t_id int REFERENCES nt(id),
xcontent xml NOT NULL
);
CREATE VIEW nt_full AS
SELECT nt.*, nt2.xcontnt FROM nt LEFT JOIN nt2 ON id=t_id;
So, I need this complexity? this new table arrange will consume less disk spacess. The use of
SELECT id, name FROM nt WHERE name>'john'; -- Q1A
SELECT id, name FROM nt_full WHERE name>'john'; -- Q1B
SELECT id, name FROM t WHERE name>'john'; -- Q1C
SELECT id, xcontent FROM nt_full WHERE name>'john'; -- Q2A
SELECT id, xcontent FROM t WHERE name>'john'; -- Q2B
So, in theory, all the performances of Q1A vs Q1B vs Q1C will be the same?
And Q2A vs Q2B?
The answer to the question "how much space does a null value take" is: no space at all - at least not in the "data" area.
For each nullable column in the table there is one bit in the row header that marks the column value as null (or not null). So the "space" that the null values takes is already present in the row header - regardless whether the column is null or not.
Thus the null "value" does not occupy any space in the data block storing the row.
This is documented in the manual: http://www.postgresql.org/docs/current/static/storage-page-layout.html
Postgres will not store long string values (xml, varchar, text, json, ...) in the actual data block if it exceeds a certain threshold (about 2000 bytes). If the value is longer than that, it will be stored in a special storage area "away" from your actual data. So splitting up the table into two tables with a 1:1 relationship doesn't really by you that much. Unless you are storing a lot of rows (hundreds of millions), I doubt you will be able to notice the difference - but this also depends on your usage patterns.
The data that is stored "out-of-line" is also automatically compressed.
Details about this can be found in the manual: http://www.postgresql.org/docs/current/static/storage-toast.html
One reason why the separate table might be an advantage is the necessary "vacuum" cleanup. If you update the XML data a lot but the rest of the table hardly ever changes, then splitting this up in two tables might improve the overall performance because "XML table" will need less "maintenance" and the "main" table won't be changed at all.
A varchar field consumes 2 bytes more than the content.
so if you define it as varchar(250)
and put 10 chars in, it consumes 12 bytes
100 chars consumes 102 bytes
NULL consumes 2 bytes. no problem.
if you are in some situation where you need to store large amounts of xml data
and end up using a (for instance) blob type, you should put that in another table
and keep your primary table lean and fast

Create big integer from the big end of a uuid in PostgreSQL

I have a third-party application connecting to a view in my PostgreSQL database. It requires the view to have a primary key but can't handle the UUID type (which is the primary key for the view). It also can't handle the UUID as the primary key if it is served as text from the view.
What I'd like to do is convert the UUID to a number and use that as the primary key instead. However,
SELECT x'14607158d3b14ac0b0d82a9a5a9e8f6e'::bigint
Fails because the number is out of range.
So instead, I want to use SQL to take the big end of the UUID and create an int8 / bigint. I should clarify that maintaining order is 'desirable' but I understand that some of the order will change by doing this.
I tried:
SELECT x(substring(UUID::text from 1 for 16))::bigint
but the x operator for converting hex doesn't seem to like brackets. I abstracted it into a function but
SELECT hex_to_int(substring(UUID::text from 1 for 16))::bigint
still fails.
How can I get a bigint from the 'big end' half of a UUID?
Fast and without dynamic SQL
Cast the leading 16 hex digits of a UUID in text representation as bitstring bit(64) and cast that to bigint. See:
Convert hex in text representation to decimal number
Conveniently, excess hex digits to the right are truncated in the cast to bit(64) automatically - exactly what we need.
Postgres accepts various formats for input. Your given string literal is one of them:
14607158d3b14ac0b0d82a9a5a9e8f6e
The default text representation of a UUID (and the text output in Postgres for data type uuid) adds hyphens at predefined places:
14607158-d3b1-4ac0-b0d8-2a9a5a9e8f6e
The manual:
A UUID is written as a sequence of lower-case hexadecimal digits, in
several groups separated by hyphens, specifically a group of 8 digits
followed by three groups of 4 digits followed by a group of 12 digits,
for a total of 32 digits representing the 128 bits.
If input format can vary, strip hyphens first to be sure:
SELECT ('x' || translate(uuid_as_string, '-', ''))::bit(64)::bigint;
Cast actual uuid input with uuid::text.
db<>fiddle here
Note that Postgres uses signed integer, so the bigint overflows to negative numbers in the upper half - which should be irrelevant for this purpose.
DB design
If at all possible add a bigserial column to the underlying table and use that instead.
This is all very shaky, both the problem and the solution you describe in your self-answer.
First, a mismatch between a database design and a third-party application is always possible, but usually indicative of a deeper problem. Why does your database use the uuid data type as a PK in the first place? They are not very efficient compared to a serial or a bigserial. Typically you would use a UUID if you are working in a distributed environment where you need to "guarantee" uniqueness over multiple installations.
Secondly, why does the application require the PK to begin with (incidentally: views do not have a PK, the underlying tables do)? If it is only to view the data then a PK is rather useless, particularly if it is based on a UUID (and there is thus no conceivable relationship between the PK and the rest of the tuple). If it is used to refer to other data in the same database or do updates or deletes of existing data, then you need the exact UUID and not some extract of it because the underlying table or other relations in your database would have the exact UUID. Of course you can convert all UUID's with the same hex_to_int() function, but that leads straight back to my point above: why use uuids in the first place?
Thirdly, do not mess around with things you have little or no knowledge of. This is not intended to be offensive, take it as well-meant advice (look around on the internet for programmers who tried to improve on cryptographic algorithms or random number generation by adding their own twists of obfuscation; quite entertaining reads). There are 5 algorithms for generating UUID's in the uuid-ossp package and while you know or can easily find out which algorithm is used in your database (the uuid_generate_vX() functions in your table definitions, most likely), do you know how the algorithm works? The claim of practical uniqueness of a UUID is based on its 128 bits, not a 64-bit extract of it. Are you certain that the high 64-bits are random? My guess is that 64 consecutive bits are less random than the "square root of the randomness" (for lack of a better way to phrase the theoretical drop in periodicity of a 64-bit number compared to a 128-bit number) of the full UUID. Why? Because all but one of the algorithms are made up of randomized blocks of otherwise non-random input (such as the MAC address of a network interface, which is always the same on a machine generating millions of UUIDs). Had 64 bits been enough for randomized value uniqueness, then a uuid would have been that long.
What a better solution would be in your case is hard to say, because it is unclear what the third-party application does with the data from your database and how dependent it is on the uniqueness of the "PK" column in the view. An approach that is likely to work if the application does more than trivially display the data without any further use of the "PK" would be to associate a bigint with every retrieved uuid in your database in a (temporary) table and include that bigint in your view by linking on the uuids in your (temporary) tables. Since you can not trigger on SELECT statements, you would need a function to generate the bigint for every uuid the application retrieves. On updates or deletes on the underlying tables of the view or upon selecting data from related tables, you look up the uuid corresponding to the bigint passed in from the application. The lookup table and function would look somewhat like this:
CREATE TEMPORARY TABLE temp_table(
tempint bigserial PRIMARY KEY,
internal_uuid uuid);
CREATE INDEX ON temp_table(internal_uuid);
CREATE FUNCTION temp_int_for_uuid(pk uuid) RETURNS bigint AS $$
DECLARE
id bigint;
BEGIN
SELECT tempint INTO id FROM temp_table WHERE internal_uuid = pk;
IF NOT FOUND THEN
INSERT INTO temp_table(internal_uuid) VALUES (pk)
RETURNING tempint INTO id;
END IF;
RETURN id;
END; $$ LANGUAGE plpgsql STRICT;
Not pretty, not efficient, but fool-proof.
Use the bit() function to parse a decimal number from hex literal built from a substr of the UUID:
select ('x'||substr(UUID, 1, 16))::bit(64)::bigint
See SQLFiddle
Solution found.
UUID::text will return a string with hyphens. In order for substring(UUID::text from 1 for 16) to create a string that x can parse as hex the hyphens need to be stripped first.
The final query looks like:
SELECT hex_to_int(substring((select replace(id::text,'-','')) from 1 for 16))::bigint FROM table
The hext_to_int function needs to be able to handle a bigint, not just int. It looks like:
CREATE OR REPLACE FUNCTION hex_to_int(hexval character varying)
RETURNS bigint AS
$BODY$
DECLARE
result bigint;
BEGIN
EXECUTE 'SELECT x''' || hexval || '''::bigint' INTO result;
RETURN result;
END;
$BODY$`

T-SQL implicit conversion between 2 varchars

I have some T-SQL (SQL Server 2008) that I inherited and am trying to find out why some of queries are running really slow. In the Actual Execution Plan I have three clustered index scans which are costing me 19%, 21% and 26%, so this seems to be the source of my problem.
The contents of the fields are usually numeric (but some job numbers have an alpha prefix)
The database design (vendor supplied) is pretty poor. The max length of a job number in their application is 12 chars, but in the tables that are joined it is defined as varchar(50) in some places and varchar(15) in others. My parameter is a varchar(12), but I get same thing if I change it to a varchar(50)
The node contains this:
Predicate: [Live_Costing].[dbo].[TSTrans].[JobNo] as [sts1].[JobNo]=CONVERT_IMPLICIT(varchar(50),[#JobNo],0)
sts1 is a derived table, but the table it pulls jobno from is a varchar(50)
I don't understand why it's doing an implicit conversion between 2 varchars. Is it just because they are different lengths?
I'm fairly new to the execution plan
Is there an easy way to figure out which node in the exec plan relates to which part of the query?
Is the predicate, the join clause?
Regards
Mark
Some variables can have collation: enter link description here
Regardless you need to verify your collations, which can be specified at server, DB, table, and column level.
First, check your collation between tempdb and the vendor supplied database. It should match. If it doesn't, it will tend to do implicit conversions.
Assuming you cannot modify the vendor supplied code base, one or more of the following should help you:
1) Predefine your temp tables and specify the same collation for the key field as in the db in use, rather than tempdb.
2) Provide collations when doing string comparisons.
3) Specify collation for key values if using "select into" with a temp table
4) Make sure your collations on your tables and columns match your database collation (VERY important if you imported only specific tables from a vendor into an existing database.)
If you can change the vendor supplied code base, I would suggest reviewing the cost for making all of your char keys the same length and NOT varchar. Varchar has an overhead of 10. The caveat is that if you create a fixed length character field not null, it will be padded to the right (unavoidable).
Ideally, you would have int keys, and only use varchar fields for user interaction/lookup:
create table Products(ProductID int not null identity(1,1) primary key clustered, ProductNumber varchar(50) not null)
alter table Products add constraint uckProducts_ProductNumber unique(ProductNumber)
Then do all joins on ProductID, rather than ProductNumber. Just filter on ProductNumber.
would be perfectly fine.

Does the order of columns in a Postgres table impact performance?

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.