How do I use string as a key to PostgreSQL advisory lock? - postgresql

There is a locking mechanism in PostgreSQL called advisory locking. It provides the following API functions.
The function that allows us to obtain such a lock accepts an big integer argument: pg_advisory_lock(key bigint) or two integer keys: pg_advisory_lock(key1 int, key2 int) (second form).
What abstraction mechanism can I use to be able to use string keys instead of integer ones? Maybe some hashing functions will be able to do the job?
Is it possible to implement this solely in the PostgreSQL without the need to cast string to integer on the application level?
If the desired goal is hard to achieve maybe I can use two integer to identify the row in the table. The second integer could be a primary key of the row, but what integer can I use as a table identifier?

You've already hit upon the most likely candidate: using a synthetic primary key of a table plus a table identifier as a key.
You can use the table's oid (object identifier) from pg_class to specify the table. The convenience cast to the pseudo-type regclass looks this up for you, or you can select c.oid from pg_class c inner join pg_namespace n where n.nspname = 'public' and c.relname = 'mytable' to get it by-schema.
There's a small problem, because oid is internally an unsigned 32-bit integer, but the two-arg form of pg_advisory_lock takes a signed integer. This isn't likely to be a problem in practice as you need to go through a lot of OIDs before that's an issue.
e.g.
SELECT pg_advisory_lock('mytable'::regclass::integer, 42);
However, if you're going to do that, you're basically emulating row locking using advisory locks. So why not just use row locking?
SELECT 1
FROM mytable
WHERE id = 42
FOR UPDATE OF mytable;
Now, if you really had to use a string key, you're going to have to accept that there'll be collisions because you'll be using a fairly small hash.
PostgreSQL has built-in hash functions which are used for hash joins. They are not cryptographic hashes - they're designed to be fast and they produce a fairly small result. That's what you need for this purpose.
They actually hash to int4, and you'd really prefer int8, so you run an even higher risk of collisions. The alternative is to take a slow cryptographic hash like md5 and truncate it, though, and that's just ugly.
So if you really, really feel you must, you could do something like:
select pg_advisory_lock( hashtext('fredfred') );
... but only if your app can cope with the fact that it is inevitable that other strings can produce the same hash, so you might see a row as "locked" that is not truly locked.

Related

Is there efficient difference between varchar and int as PK

Could somebody tell is it good idea use varchar as PK. I mean is it less efficient or equal to int/uuid?
In example: car VIN I want to use it as PK but I'm not sure as good it will be indexed or work as FK or maybe there is some pitfalls.
It depends on which kind of data you are going to store.
In some cases (I would say in most cases) it is better to use integer-based primary keys:
for instance, bigint needs only 8 bytes, varchar can require more space. For this reason, a varchar comparison is often more costly than a bigint comparison.
while joining tables it would be more efficient to join them using integer-based values rather that strings
an integer-based key as a unique key is more appropriate for table relations. For instance, if you are going to store this primary key in another tables as a separate column. Again, varchar will require more space in other table too (see p.1).
This post on stackexchange compares non-integer types of primary keys on a particular example.

Create big integer from the big end of a uuid in PostgreSQL

I have a third-party application connecting to a view in my PostgreSQL database. It requires the view to have a primary key but can't handle the UUID type (which is the primary key for the view). It also can't handle the UUID as the primary key if it is served as text from the view.
What I'd like to do is convert the UUID to a number and use that as the primary key instead. However,
SELECT x'14607158d3b14ac0b0d82a9a5a9e8f6e'::bigint
Fails because the number is out of range.
So instead, I want to use SQL to take the big end of the UUID and create an int8 / bigint. I should clarify that maintaining order is 'desirable' but I understand that some of the order will change by doing this.
I tried:
SELECT x(substring(UUID::text from 1 for 16))::bigint
but the x operator for converting hex doesn't seem to like brackets. I abstracted it into a function but
SELECT hex_to_int(substring(UUID::text from 1 for 16))::bigint
still fails.
How can I get a bigint from the 'big end' half of a UUID?
Fast and without dynamic SQL
Cast the leading 16 hex digits of a UUID in text representation as bitstring bit(64) and cast that to bigint. See:
Convert hex in text representation to decimal number
Conveniently, excess hex digits to the right are truncated in the cast to bit(64) automatically - exactly what we need.
Postgres accepts various formats for input. Your given string literal is one of them:
14607158d3b14ac0b0d82a9a5a9e8f6e
The default text representation of a UUID (and the text output in Postgres for data type uuid) adds hyphens at predefined places:
14607158-d3b1-4ac0-b0d8-2a9a5a9e8f6e
The manual:
A UUID is written as a sequence of lower-case hexadecimal digits, in
several groups separated by hyphens, specifically a group of 8 digits
followed by three groups of 4 digits followed by a group of 12 digits,
for a total of 32 digits representing the 128 bits.
If input format can vary, strip hyphens first to be sure:
SELECT ('x' || translate(uuid_as_string, '-', ''))::bit(64)::bigint;
Cast actual uuid input with uuid::text.
db<>fiddle here
Note that Postgres uses signed integer, so the bigint overflows to negative numbers in the upper half - which should be irrelevant for this purpose.
DB design
If at all possible add a bigserial column to the underlying table and use that instead.
This is all very shaky, both the problem and the solution you describe in your self-answer.
First, a mismatch between a database design and a third-party application is always possible, but usually indicative of a deeper problem. Why does your database use the uuid data type as a PK in the first place? They are not very efficient compared to a serial or a bigserial. Typically you would use a UUID if you are working in a distributed environment where you need to "guarantee" uniqueness over multiple installations.
Secondly, why does the application require the PK to begin with (incidentally: views do not have a PK, the underlying tables do)? If it is only to view the data then a PK is rather useless, particularly if it is based on a UUID (and there is thus no conceivable relationship between the PK and the rest of the tuple). If it is used to refer to other data in the same database or do updates or deletes of existing data, then you need the exact UUID and not some extract of it because the underlying table or other relations in your database would have the exact UUID. Of course you can convert all UUID's with the same hex_to_int() function, but that leads straight back to my point above: why use uuids in the first place?
Thirdly, do not mess around with things you have little or no knowledge of. This is not intended to be offensive, take it as well-meant advice (look around on the internet for programmers who tried to improve on cryptographic algorithms or random number generation by adding their own twists of obfuscation; quite entertaining reads). There are 5 algorithms for generating UUID's in the uuid-ossp package and while you know or can easily find out which algorithm is used in your database (the uuid_generate_vX() functions in your table definitions, most likely), do you know how the algorithm works? The claim of practical uniqueness of a UUID is based on its 128 bits, not a 64-bit extract of it. Are you certain that the high 64-bits are random? My guess is that 64 consecutive bits are less random than the "square root of the randomness" (for lack of a better way to phrase the theoretical drop in periodicity of a 64-bit number compared to a 128-bit number) of the full UUID. Why? Because all but one of the algorithms are made up of randomized blocks of otherwise non-random input (such as the MAC address of a network interface, which is always the same on a machine generating millions of UUIDs). Had 64 bits been enough for randomized value uniqueness, then a uuid would have been that long.
What a better solution would be in your case is hard to say, because it is unclear what the third-party application does with the data from your database and how dependent it is on the uniqueness of the "PK" column in the view. An approach that is likely to work if the application does more than trivially display the data without any further use of the "PK" would be to associate a bigint with every retrieved uuid in your database in a (temporary) table and include that bigint in your view by linking on the uuids in your (temporary) tables. Since you can not trigger on SELECT statements, you would need a function to generate the bigint for every uuid the application retrieves. On updates or deletes on the underlying tables of the view or upon selecting data from related tables, you look up the uuid corresponding to the bigint passed in from the application. The lookup table and function would look somewhat like this:
CREATE TEMPORARY TABLE temp_table(
tempint bigserial PRIMARY KEY,
internal_uuid uuid);
CREATE INDEX ON temp_table(internal_uuid);
CREATE FUNCTION temp_int_for_uuid(pk uuid) RETURNS bigint AS $$
DECLARE
id bigint;
BEGIN
SELECT tempint INTO id FROM temp_table WHERE internal_uuid = pk;
IF NOT FOUND THEN
INSERT INTO temp_table(internal_uuid) VALUES (pk)
RETURNING tempint INTO id;
END IF;
RETURN id;
END; $$ LANGUAGE plpgsql STRICT;
Not pretty, not efficient, but fool-proof.
Use the bit() function to parse a decimal number from hex literal built from a substr of the UUID:
select ('x'||substr(UUID, 1, 16))::bit(64)::bigint
See SQLFiddle
Solution found.
UUID::text will return a string with hyphens. In order for substring(UUID::text from 1 for 16) to create a string that x can parse as hex the hyphens need to be stripped first.
The final query looks like:
SELECT hex_to_int(substring((select replace(id::text,'-','')) from 1 for 16))::bigint FROM table
The hext_to_int function needs to be able to handle a bigint, not just int. It looks like:
CREATE OR REPLACE FUNCTION hex_to_int(hexval character varying)
RETURNS bigint AS
$BODY$
DECLARE
result bigint;
BEGIN
EXECUTE 'SELECT x''' || hexval || '''::bigint' INTO result;
RETURN result;
END;
$BODY$`

Integer array lookup using Postgres

Let's say I have a table called tag:
CREATE TABLE tag (
id SERIAL PRIMARY KEY,
text TEXT NOT NULL UNIQUE
);
And I use integer arrays on other tables to reference those tags:
CREATE TABLE device (
id SERIAL PRIMARY KEY,
tag_ids INTEGER[] NOT NULL DEFAULT '{}',
);
What is the simplest and most efficient way that I can map the tag_ids to the appropriate rows in tag such that I can query the device table and the results will include a tags column with the text of each tag in a text array?
I understand that this is not the preferred technique and has a number of significant disadvantages. I understand that there is no way to enforce referential integrity in arrays. I understand that a many-to-many join would be much simpler and probably better way to implement tagging.
The database normalization lectures aside, is there a graceful way to do this in postgres? Would it make sense to write a function to accomplish this?
Untested, but IIRC:
SELECT
device.*, t."text"
FROM
device d
left outer join tag t on ( ARRAY[t.id] #> d.tag_ids)
should be able to use a GiST or GIN index on d.tag_ids. That's also useful for queries where you want to say "find rows containing tag [x]".
I might've got the direction of the #> operator wrong, I always get it muddled. See the array operators docs for details.
The intarray module provides a gist opclass for arrays of integer which I'd recommend using; it's more compact and faster to update.
I would recommend a combination of unnest and join for this.
i.e. something of the form:
select unnest(t.tag_ids) as tag_id, d.*
from device as d
join tag as t ON (d.tag_id = d.id)
order by d.tag_id;

In Postgres, is it performance critical to define low cardinality column as int and not text?

I have a column with 4 options.
The column is define as text.
The table is big table 100 millions of record and keep going.
The table use as report table.
The index on the table is - provider_id,date,enum_field.
I wonder if i should change the enum_filed from text to int and how much this is performance critical.
Using postgres 9.1
Table:
provider_report:
id bigserial NOT NULL,
provider_id bigint,
date timestamp without time zone,
enum_field character varying,
....
Index:
provider_id,date,enum_field
TL;DR version: worrying about this is probably not worth your time.
Long version:
There is an enum type in Postgres:
create type myenum as enum('foo', 'bar');
There are pros and cons related to using it vs a varchar or an integer field. Mostly pros imho.
In terms of size, it's stored as an oid, so int32 type. This makes it smaller than a varchar populated with typical values (e.g. 'draft', 'published', 'pending', 'completed', whatever your enum is about), and the same size as an int type. If you've very few values, a smallint / int16 will be admittedly be smaller. Some of your performance change will come from there (smaller vs larger field, i.e. mostly negligible).
Validation is possible in each case, be it through the built-in catalog lookup for the enum, or a check constraint or a foreign key for a varchar or an int. Some of your performance change will come from there, and it'll probably not be worth your time either.
Another benefit of the enum type, is that it is ordered. In the above example, 'foo'::myenum < 'bar'::myenum', making it possible to order by enumcol. To achieve the same using a varchar or an int, you'll need a separate table with a sortidx column or something... In this case, the enum can yield an enormous benefit if you ever want to order by your enum's values. This brings us to (imho) the only gotcha, which is related to how the enum type is stored in the catalog...
Internally, each enum's value carries an oid, and the latter are stored as is within the table. So it's technically an int32. When you create the enum type, its values are stored in the correct order within the catalog. In the above example, 'foo' would have an oid lower than 'bar'. This makes it very efficient for Postgres to order by an enum's value, since it amounts to sorting int32 values.
When you ALTER your enum, however, you may end up in a situation where you change that order. For instance, imagine you alter the above enum in such a way that myenum is now ('foo', 'baz', 'bar'). For reasons tied to efficiency, Postgres does not assign a new oid for existing values and rewrite the tables that use them, let alone invalidate cached query plans that use them. What it does instead, is populate a separate field in the the pg_catalog, so as to make it yield the correct sort order. From that point forward, ordering by the enum field requires an extra lookup, which de facto amounts to joining the table with a separate values table that carries a sortidx field -- much like you would do with a varchar or an int if you ever wanted to sort them.
This is usually fine and perfectly acceptable. Occasionally, it's not. When not there is a solution: alter the tables with the enum type, and change their values to varchar. Also locate and adjust functions and triggers that make use of it as you do. Then drop the type entirely, and then recreate it to get fresh oid values. And finally alter the tables back to where they were, and readjust the functions and triggers. Not trivial, but certainly feasible.
It will be best to define an enum_field as ENUM type. It will take minimal space and check, which values are allowed.
As for performance: the only reliable way if it really affects performance - to test it (with proper set of correct tests). My guess - the difference will be less than 5%.
And if you really want to change the table - don't forget to VACUUM it after the change.

postgresql hstore key/value vs traditional SQL performance

I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.