Create big integer from the big end of a uuid in PostgreSQL

Create big integer from the big end of a uuid in PostgreSQL - postgresql

I have a third-party application connecting to a view in my PostgreSQL database. It requires the view to have a primary key but can't handle the UUID type (which is the primary key for the view). It also can't handle the UUID as the primary key if it is served as text from the view.
What I'd like to do is convert the UUID to a number and use that as the primary key instead. However,
SELECT x'14607158d3b14ac0b0d82a9a5a9e8f6e'::bigint
Fails because the number is out of range.
So instead, I want to use SQL to take the big end of the UUID and create an int8 / bigint. I should clarify that maintaining order is 'desirable' but I understand that some of the order will change by doing this.
I tried:
SELECT x(substring(UUID::text from 1 for 16))::bigint
but the x operator for converting hex doesn't seem to like brackets. I abstracted it into a function but
SELECT hex_to_int(substring(UUID::text from 1 for 16))::bigint
still fails.
How can I get a bigint from the 'big end' half of a UUID?

Fast and without dynamic SQL
Cast the leading 16 hex digits of a UUID in text representation as bitstring bit(64) and cast that to bigint. See:
Convert hex in text representation to decimal number
Conveniently, excess hex digits to the right are truncated in the cast to bit(64) automatically - exactly what we need.
Postgres accepts various formats for input. Your given string literal is one of them:
14607158d3b14ac0b0d82a9a5a9e8f6e
The default text representation of a UUID (and the text output in Postgres for data type uuid) adds hyphens at predefined places:
14607158-d3b1-4ac0-b0d8-2a9a5a9e8f6e
The manual:
A UUID is written as a sequence of lower-case hexadecimal digits, in
several groups separated by hyphens, specifically a group of 8 digits
followed by three groups of 4 digits followed by a group of 12 digits,
for a total of 32 digits representing the 128 bits.
If input format can vary, strip hyphens first to be sure:
SELECT ('x' || translate(uuid_as_string, '-', ''))::bit(64)::bigint;
Cast actual uuid input with uuid::text.
db<>fiddle here
Note that Postgres uses signed integer, so the bigint overflows to negative numbers in the upper half - which should be irrelevant for this purpose.
DB design
If at all possible add a bigserial column to the underlying table and use that instead.

This is all very shaky, both the problem and the solution you describe in your self-answer.
First, a mismatch between a database design and a third-party application is always possible, but usually indicative of a deeper problem. Why does your database use the uuid data type as a PK in the first place? They are not very efficient compared to a serial or a bigserial. Typically you would use a UUID if you are working in a distributed environment where you need to "guarantee" uniqueness over multiple installations.
Secondly, why does the application require the PK to begin with (incidentally: views do not have a PK, the underlying tables do)? If it is only to view the data then a PK is rather useless, particularly if it is based on a UUID (and there is thus no conceivable relationship between the PK and the rest of the tuple). If it is used to refer to other data in the same database or do updates or deletes of existing data, then you need the exact UUID and not some extract of it because the underlying table or other relations in your database would have the exact UUID. Of course you can convert all UUID's with the same hex_to_int() function, but that leads straight back to my point above: why use uuids in the first place?
Thirdly, do not mess around with things you have little or no knowledge of. This is not intended to be offensive, take it as well-meant advice (look around on the internet for programmers who tried to improve on cryptographic algorithms or random number generation by adding their own twists of obfuscation; quite entertaining reads). There are 5 algorithms for generating UUID's in the uuid-ossp package and while you know or can easily find out which algorithm is used in your database (the uuid_generate_vX() functions in your table definitions, most likely), do you know how the algorithm works? The claim of practical uniqueness of a UUID is based on its 128 bits, not a 64-bit extract of it. Are you certain that the high 64-bits are random? My guess is that 64 consecutive bits are less random than the "square root of the randomness" (for lack of a better way to phrase the theoretical drop in periodicity of a 64-bit number compared to a 128-bit number) of the full UUID. Why? Because all but one of the algorithms are made up of randomized blocks of otherwise non-random input (such as the MAC address of a network interface, which is always the same on a machine generating millions of UUIDs). Had 64 bits been enough for randomized value uniqueness, then a uuid would have been that long.
What a better solution would be in your case is hard to say, because it is unclear what the third-party application does with the data from your database and how dependent it is on the uniqueness of the "PK" column in the view. An approach that is likely to work if the application does more than trivially display the data without any further use of the "PK" would be to associate a bigint with every retrieved uuid in your database in a (temporary) table and include that bigint in your view by linking on the uuids in your (temporary) tables. Since you can not trigger on SELECT statements, you would need a function to generate the bigint for every uuid the application retrieves. On updates or deletes on the underlying tables of the view or upon selecting data from related tables, you look up the uuid corresponding to the bigint passed in from the application. The lookup table and function would look somewhat like this:
CREATE TEMPORARY TABLE temp_table(
tempint bigserial PRIMARY KEY,
internal_uuid uuid);
CREATE INDEX ON temp_table(internal_uuid);
CREATE FUNCTION temp_int_for_uuid(pk uuid) RETURNS bigint AS $$
DECLARE
id bigint;
BEGIN
SELECT tempint INTO id FROM temp_table WHERE internal_uuid = pk;
IF NOT FOUND THEN
INSERT INTO temp_table(internal_uuid) VALUES (pk)
RETURNING tempint INTO id;
END IF;
RETURN id;
END; $$ LANGUAGE plpgsql STRICT;
Not pretty, not efficient, but fool-proof.

Use the bit() function to parse a decimal number from hex literal built from a substr of the UUID:
select ('x'||substr(UUID, 1, 16))::bit(64)::bigint
See SQLFiddle

Solution found.
UUID::text will return a string with hyphens. In order for substring(UUID::text from 1 for 16) to create a string that x can parse as hex the hyphens need to be stripped first.
The final query looks like:
SELECT hex_to_int(substring((select replace(id::text,'-','')) from 1 for 16))::bigint FROM table
The hext_to_int function needs to be able to handle a bigint, not just int. It looks like:
CREATE OR REPLACE FUNCTION hex_to_int(hexval character varying)
RETURNS bigint AS
$BODY$
DECLARE
result bigint;
BEGIN
EXECUTE 'SELECT x''' || hexval || '''::bigint' INTO result;
RETURN result;
END;
$BODY$`

Related

What is the best way to store varying size columns in postres for language translation?

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.
EVENT:
id
EVENT_NAME (fk to LANGUAGE.short)
EVENT_DESCRIPTION (fk to LANGUAGE.long)
0
1
2
LANGUAGE:
language_id
language_type
short (char varying 200)
med (char varying 50)
long (char varying 2000)
1
english
game of soccer
null
null
1
spanish
partido de footbol
null
null
2
english
null
null
A really great game of soccer
2
spanish
null
null
Un gran partido de footbol
If I want the language specific version I would create a parameterized statement and pass in the language like this:
select event.id, name.short, desc.long
from event e, language name, language desc
where e.id = 0
and e.event_name = name.language_id
and name.language_type = ?
and e.event_desc = desc.language_id
and desc.language_type = ?
My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).
In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?
It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?

Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.
In the words of the documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

Use text.
With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.
I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).
This is incorrect with PostgreSQL 14. text is the most performant.
There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].
Storage for varchar and text is the same.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Do you need to reinvent this wheel?
Internationalization and localization tools already exist; they're built into many application frameworks.
Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.
Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.
If you really must do it in the database, here's some notes:
language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.
create table languages (
id serial primary key,
tag text unique not null
);
-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
id bigserial primary key
);
create table message_localizations (
-- If a message or language is deleted, so will its localization.
message_id bigint not null references messages on delete cascade,
language_id int not null references languages on delete cascade,
localization text not null,
-- This serves as both a primary key and enforces one localization
-- per message and language.
primary key(message_id, language_id)
);
create table events (
id bigserial primary key,
name_id bigint not null references messages,
description_id bigint not null references messages
);
Then join each message with its localization.
select
events.id,
ml1.localization as name,
ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
on ml2.message_id = description_id and ml2.language_id = lt.id
Demonstration.
But, again, you probably want to use an existing localization tool.

SERIAL pseudo-type in POSTGRESQL?

Currently I'm using serial datatype for my project. I want to know the following details to change previously Created SEQUENCE datatype to SERAIL datatype.
From Which version onwards POSTGRESQL supports SERAIL DATATYPE, also If I change my id into SERAIL it wont affect my future code.
What is the max size of serial and its impact?
Can ALTER SEQUENCE impact the Sequence numbers?
any drawback on serail datatype in future?
How to create gap-less sequence?

All answers can be found in the manual
serial goes back to Postgres 7.2
It's a bigint, the max size is documented in the manual. Also see this note in the documenation of CREATE SEQUENCE
Sequences are based on bigint arithmetic, so the range cannot exceed the range of an eight-byte integer (-9223372036854775808 to 9223372036854775807).
Obviously. As documented in the manual that command can set minvalue or restart with a new value and manipulate many other properties that affect the number generation.
You should use identity columns instead
Not possible - that's not what sequences are intended for. See e.g. here for a possible implementation of a gapless number generator.

Duration of PostgreSQL ALTER COLUMN TYPE int to bigint

Let's say I have a table that has an id that is an INTEGER GENERATED BY DEFAULT AS IDENTITY
I'm looking to document how to change the type, if in the future an integer is too small and I need to change the id type from integer to bigint. I'm mainly worried about the time complexity of the change, since it will likely occur when there number of rows in the table would be near the maximum number an integer type can store.
What would the time complexity for the following command be?
ALTER TABLE project ALTER COLUMN id TYPE BIGINT;

This command will have to rewrite the whole table, because bigint takes 8 bytes of storage rather than the 4 of an integer. The table will be locked from concurrent access while this is taking place, so with a big table you should be prepared for a longer downtime.
If you expect that this could be necessary, perform the change as soon as possible, while the table is still small.

How do I use string as a key to PostgreSQL advisory lock?

There is a locking mechanism in PostgreSQL called advisory locking. It provides the following API functions.
The function that allows us to obtain such a lock accepts an big integer argument: pg_advisory_lock(key bigint) or two integer keys: pg_advisory_lock(key1 int, key2 int) (second form).
What abstraction mechanism can I use to be able to use string keys instead of integer ones? Maybe some hashing functions will be able to do the job?
Is it possible to implement this solely in the PostgreSQL without the need to cast string to integer on the application level?
If the desired goal is hard to achieve maybe I can use two integer to identify the row in the table. The second integer could be a primary key of the row, but what integer can I use as a table identifier?

You've already hit upon the most likely candidate: using a synthetic primary key of a table plus a table identifier as a key.
You can use the table's oid (object identifier) from pg_class to specify the table. The convenience cast to the pseudo-type regclass looks this up for you, or you can select c.oid from pg_class c inner join pg_namespace n where n.nspname = 'public' and c.relname = 'mytable' to get it by-schema.
There's a small problem, because oid is internally an unsigned 32-bit integer, but the two-arg form of pg_advisory_lock takes a signed integer. This isn't likely to be a problem in practice as you need to go through a lot of OIDs before that's an issue.
e.g.
SELECT pg_advisory_lock('mytable'::regclass::integer, 42);
However, if you're going to do that, you're basically emulating row locking using advisory locks. So why not just use row locking?
SELECT 1
FROM mytable
WHERE id = 42
FOR UPDATE OF mytable;
Now, if you really had to use a string key, you're going to have to accept that there'll be collisions because you'll be using a fairly small hash.
PostgreSQL has built-in hash functions which are used for hash joins. They are not cryptographic hashes - they're designed to be fast and they produce a fairly small result. That's what you need for this purpose.
They actually hash to int4, and you'd really prefer int8, so you run an even higher risk of collisions. The alternative is to take a slow cryptographic hash like md5 and truncate it, though, and that's just ugly.
So if you really, really feel you must, you could do something like:
select pg_advisory_lock( hashtext('fredfred') );
... but only if your app can cope with the fact that it is inevitable that other strings can produce the same hash, so you might see a row as "locked" that is not truly locked.

T-SQL implicit conversion between 2 varchars

I have some T-SQL (SQL Server 2008) that I inherited and am trying to find out why some of queries are running really slow. In the Actual Execution Plan I have three clustered index scans which are costing me 19%, 21% and 26%, so this seems to be the source of my problem.
The contents of the fields are usually numeric (but some job numbers have an alpha prefix)
The database design (vendor supplied) is pretty poor. The max length of a job number in their application is 12 chars, but in the tables that are joined it is defined as varchar(50) in some places and varchar(15) in others. My parameter is a varchar(12), but I get same thing if I change it to a varchar(50)
The node contains this:
Predicate: [Live_Costing].[dbo].[TSTrans].[JobNo] as [sts1].[JobNo]=CONVERT_IMPLICIT(varchar(50),[#JobNo],0)
sts1 is a derived table, but the table it pulls jobno from is a varchar(50)
I don't understand why it's doing an implicit conversion between 2 varchars. Is it just because they are different lengths?
I'm fairly new to the execution plan
Is there an easy way to figure out which node in the exec plan relates to which part of the query?
Is the predicate, the join clause?
Regards
Mark

Some variables can have collation: enter link description here
Regardless you need to verify your collations, which can be specified at server, DB, table, and column level.
First, check your collation between tempdb and the vendor supplied database. It should match. If it doesn't, it will tend to do implicit conversions.
Assuming you cannot modify the vendor supplied code base, one or more of the following should help you:
1) Predefine your temp tables and specify the same collation for the key field as in the db in use, rather than tempdb.
2) Provide collations when doing string comparisons.
3) Specify collation for key values if using "select into" with a temp table
4) Make sure your collations on your tables and columns match your database collation (VERY important if you imported only specific tables from a vendor into an existing database.)
If you can change the vendor supplied code base, I would suggest reviewing the cost for making all of your char keys the same length and NOT varchar. Varchar has an overhead of 10. The caveat is that if you create a fixed length character field not null, it will be padded to the right (unavoidable).
Ideally, you would have int keys, and only use varchar fields for user interaction/lookup:
create table Products(ProductID int not null identity(1,1) primary key clustered, ProductNumber varchar(50) not null)
alter table Products add constraint uckProducts_ProductNumber unique(ProductNumber)
Then do all joins on ProductID, rather than ProductNumber. Just filter on ProductNumber.
would be perfectly fine.