Hashing a String to a Numeric Value in PostgreSQL

Hashing a String to a Numeric Value in PostgreSQL - postgresql

I need to Convert Strings stored in my Database to a Numeric value. Result can be Integer (preferred) or Bigint. This conversion is to be done at Database side in a PL/pgSQL function.
Can someone please point me to some algorithm or any API's that can be used to achieve this?
I have been searching for this on Google for hours now, could not find anything useful so far :(

Just keep the first 32 bits or 64 bits of the MD5 hash. Of course, it voids the main property of md5 (=the probability of collision being infinitesimal) but you'll still get a wide dispersion of values which presumably is good enough for your problem.
SQL functions derived from the other answers:
For bigint:
create function h_bigint(text) returns bigint as $$
select ('x'||substr(md5($1),1,16))::bit(64)::bigint;
$$ language sql;
For int:
create function h_int(text) returns int as $$
select ('x'||substr(md5($1),1,8))::bit(32)::int;
$$ language sql;

You can create a md5 hash value without problems:
select md5('hello, world');
This returns a string with a hex number.
Unfortunately there is no built-in function to convert hex to integer but as you are doing that in PL/pgSQL anyway, this might help:
https://stackoverflow.com/a/8316731/330315

PostgreSQL has hash functions for many column types. You can use hashtext if you want integer hash values, or hashtextextended if you prefer bigint hash values.
Note that hashXXXextended functions take one extra parameter for seed and 0 means that no seed will be used.
In the commit message the author Robert Haas says:
Just in case somebody wants a 64-bit hash value that is compatible
with the existing 32-bit hash values, make the low 32-bits of the
64-bit hash value match the 32-bit hash value when the seed is 0.
Example
postgres=# SELECT hashtextextended('test string of type text', 0);
hashtextextended
----------------------
-6578719834206879717
(1 row)
postgres=# SELECT hashtext('test string of type text');
hashtext
-------------
-1790427109
(1 row)
What about other data types?
You can check all available hash functions on your backend by checking the output for \df hash*. Below you can see the functions available in PG 14.0.
hanefi=# \df hash*
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+--------------------------+------------------+--------------------------+------
pg_catalog | hash_aclitem | integer | aclitem | func
pg_catalog | hash_aclitem_extended | bigint | aclitem, bigint | func
pg_catalog | hash_array | integer | anyarray | func
pg_catalog | hash_array_extended | bigint | anyarray, bigint | func
pg_catalog | hash_multirange | integer | anymultirange | func
pg_catalog | hash_multirange_extended | bigint | anymultirange, bigint | func
pg_catalog | hash_numeric | integer | numeric | func
pg_catalog | hash_numeric_extended | bigint | numeric, bigint | func
pg_catalog | hash_range | integer | anyrange | func
pg_catalog | hash_range_extended | bigint | anyrange, bigint | func
pg_catalog | hash_record | integer | record | func
pg_catalog | hash_record_extended | bigint | record, bigint | func
pg_catalog | hashbpchar | integer | character | func
pg_catalog | hashbpcharextended | bigint | character, bigint | func
pg_catalog | hashchar | integer | "char" | func
pg_catalog | hashcharextended | bigint | "char", bigint | func
pg_catalog | hashenum | integer | anyenum | func
pg_catalog | hashenumextended | bigint | anyenum, bigint | func
pg_catalog | hashfloat4 | integer | real | func
pg_catalog | hashfloat4extended | bigint | real, bigint | func
pg_catalog | hashfloat8 | integer | double precision | func
pg_catalog | hashfloat8extended | bigint | double precision, bigint | func
pg_catalog | hashhandler | index_am_handler | internal | func
pg_catalog | hashinet | integer | inet | func
pg_catalog | hashinetextended | bigint | inet, bigint | func
pg_catalog | hashint2 | integer | smallint | func
pg_catalog | hashint2extended | bigint | smallint, bigint | func
pg_catalog | hashint4 | integer | integer | func
pg_catalog | hashint4extended | bigint | integer, bigint | func
pg_catalog | hashint8 | integer | bigint | func
pg_catalog | hashint8extended | bigint | bigint, bigint | func
pg_catalog | hashmacaddr | integer | macaddr | func
pg_catalog | hashmacaddr8 | integer | macaddr8 | func
pg_catalog | hashmacaddr8extended | bigint | macaddr8, bigint | func
pg_catalog | hashmacaddrextended | bigint | macaddr, bigint | func
pg_catalog | hashname | integer | name | func
pg_catalog | hashnameextended | bigint | name, bigint | func
pg_catalog | hashoid | integer | oid | func
pg_catalog | hashoidextended | bigint | oid, bigint | func
pg_catalog | hashoidvector | integer | oidvector | func
pg_catalog | hashoidvectorextended | bigint | oidvector, bigint | func
pg_catalog | hashtext | integer | text | func
pg_catalog | hashtextextended | bigint | text, bigint | func
pg_catalog | hashtid | integer | tid | func
pg_catalog | hashtidextended | bigint | tid, bigint | func
pg_catalog | hashvarlena | integer | internal | func
pg_catalog | hashvarlenaextended | bigint | internal, bigint | func
(47 rows)
Caveats
If you want to have consistent hashes across different systems, make sure you have the same collation behaviour.
The built-in collatable data types are text, varchar, and char. If you have different collation options, you can see different hash values. For example the sort order of the strings ‘a-a’ and ‘a+a’ flipped in glibc 2.28 (Debian 10, RHEL 8) compared to earlier releases.
If you wish to use the hashes on the same machine, you need not worry, as long as you do not update glibc or use a different collation.
See more details at: https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/

Must it be an integer? The pg_crypto module provides a number of standard hash functions (md5, sha1, etc). They all return bytea. I suppose you could throw away some bits and convert bytea to integer.
bigint is too small to store a cryptographic hash. The largest non-bytea binary type Pg supports is uuid. You could cast a digest to uuid like this:
select ('{'||encode( substring(digest('foobar','sha256') from 1 for 16), 'hex')||'}')::uuid;
uuid
--------------------------------------
c3ab8ff1-3720-e8ad-9047-dd39466b3c89

This is an implementation of Java's String.hashCode():
CREATE OR REPLACE FUNCTION hashCode(_string text) RETURNS INTEGER AS $$
DECLARE
val_ CHAR[];
h_ INTEGER := 0;
ascii_ INTEGER;
c_ char;
BEGIN
val_ = regexp_split_to_array(_string, '');
FOR i in 1 .. array_length(val_, 1)
LOOP
c_ := (val_)[i];
ascii_ := ascii(c_);
h_ = 31 * h_ + ascii_;
raise info '%: % = %', i, c_, h_;
END LOOP;
RETURN h_;
END;
$$ LANGUAGE plpgsql;

Related

Postgres 13 permission issues with Revoke

I am a relatively new user of Postgres 13. Let me first tell you, that the Postgres database is hosted on AWS aurora. I have a user that owns a schema and I have a specific table that this user should only be able to SELECT and INSERT rows to this table and execute TRIGGERS.
I have REVOKED ALL on this table for this user and GRANTED SELECT, INSERT, TRIGGER ON TABLE TO USER. The INSERT, SELECT, and TRIGGER work as expected. However, when I execute a SQL UPDATE on that table it still lets me update a row in that table! I also forgot to tell you I REVOKED ALL and performed the same GRANTS to rds_superuser on this table since this user is referenced to rds_superuser.
Any help would be greatly appreciated!
Following are the results of \d:
Column | Type | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+------------------------
id | uuid | | not null | uuid_generate_v4()
patient_medication_id | bigint | | not null |
raw_xml | character varying | | not null |
digital_signature | character varying | | not null |
create_date | timestamp with time zone | | not null | CURRENT_TIMESTAMP
created_by | character varying | | |
update_date | timestamp with time zone | | |
updated_by | character varying | | |
deleted_at | timestamp with time zone | | |
deleted_by | character varying | | |
status | character varying(1) | | | 'A'::character varying
message_type | character varying | | not null |
Indexes:
"rx_cryptographic_signature_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"rx_cryptographic_signature_fk" FOREIGN KEY (patient_medication_id) REFERENCES tryon.patient_medication(id)
Triggers:
audit_trigger_row AFTER INSERT ON tryon.rx_cryptographic_signature FOR EACH ROW EXECUTE FUNCTION td_audit.if_changed_fn('id', '{}')
audit_trigger_stm AFTER TRUNCATE ON tryon.rx_cryptographic_signature FOR EACH STATEMENT EXECUTE FUNCTION td_audit.if_changed_fn() tr_log_delete_attempt BEFORE DELETE ON tryon.rx_cryptographic_signature FOR EACH STATEMENT EXECUTE FUNCTION tryon.fn_log_update_delete_attempt()
tr_log_update_attempt BEFORE UPDATE ON tryon.rx_cryptographic_signature FOR EACH STATEMENT EXECUTE FUNCTION tryon.fn_log_update_delete_attempt()
Following are the results of \z:
Schema | Name | Type | Access privileges | Column privileges | Policies
--------+----------------------------+-------+---------------------------------------+-------------------+----------
tryon | rx_cryptographic_signature | table | TD_Administrator=art/TD_Administrator+| |
| | | td_administrator=art/TD_Administrator+| |
| | | rds_pgaudit=art/TD_Administrator +| |
| | | rds_superuser=art/TD_Administrator | |
(1 row)
Thanks so much for your help!!

What Postgres 13 index types support distance searches?

Original Question
We've had great results using a K-NN search with a GiST index with gist_trgm_ops. Pure magic. I've got other situations, with other datatypes like timestamp where distance functions would be quite useful. If I didn't dream it, this is, or was, available through pg_catalog. Looking around, I can't find a way to search on indexes by such properties. I think what I'm after, in this case, is AMPROP_DISTANCE_ORDERABLE under-the-hood.
Just checked, and pg_am did have a lot more attributes than it does now, prior to 9.6.
Is there another way to figure out what options various indexes have with a catalog search?
Catalogs
jjanes' answer inspired me to look at the system information functions some more, and to spend a day in the pg_catalog tables. The catalogs for indexes and operators are complicated. The system information functions are a big help. This piece proved super useful for getting a handle on things:
https://postgrespro.com/blog/pgsql/4161264
I think the conclusion is "no, you can't readily figure out what data types and indexes support proximity searches." The relevant attribute is a property of a column in a specific index. However, it looks like nearest-neighbor searching requires a GiST index, and that there are readily-available index operator classes to add K-NN searching to a huge range of common types. Happy for corrections on these conclusions, or the details below.
Built-in Distance Support
https://www.postgresql.org/docs/current/gist-builtin-opclasses.html
From various bits of the docs, it sounds like there are distance (proximity, nearest neighbor, K-NN) operators for GiST indexes on a handful of built-in geometric types.
box
circle
point
poly
B-tree Operator Classes
Not listed as such in the docs, but visible with this query:
select am.amname AS index_method
, opc.opcname AS opclass_name
, opc.opcintype::regtype AS indexed_type
, opc.opcdefault AS is_default
from pg_am am
, pg_opclass opc
where opc.opcmethod = am.oid
and am.amname = 'btree'
order by 1,2;
B-tree GiST Distance Support
https://www.postgresql.org/docs/current/btree-gist.html
I guess a B-tree is a special case of a GiST, and there's a B-tree operator class to match. The docs say these native types are supported:
int2
int4
int8
float4
float8
timestamp with time zone
timestamp without time zone
time without time zone
date
interval
oid
money
BRIN Built-in Operator Classes
https://www.postgresql.org/docs/current/brin-builtin-opclasses.html
There are over 70 listed in the internals docs.
GIN Built-in Operator Classes
https://www.postgresql.org/docs/12/gin-builtin-opclasses.html
array_ops
jsonb_ops
jsonb_path_ops
tsvector_ops
Alternative Text Opts
https://www.postgresql.org/docs/current/indexes-opclass.html
There are special operator classes for text comparisons made character-by-character, rather than through a collation. Or so the docs say:
text_pattern_ops
varchar_pattern_ops
bpchar_pattern_ops
pg_trgm
Beyond this, the included pg_trgm module includes operators for GIN and GiST, with the GiST version optimizing <->. I think this shows up as:
text
Note: Postgres 14 modifies pg_trgm to allow you to adjust the "signature length" for the index entry. Longer is possibly more accurate, shorter signatures are smaller on disk. If you've been using pg_trgm, it might be worth experimenting with the signature length in PG 14.
https://www.postgresql.org/docs/current/pgtrgm.html
SP-GiST Built-in Operator Classes
box_ops
kd_point_ops
network_ops
poly_ops
quad_point_ops
range_ops
text_ops
pg_operator search
Here's a search on pg_operator to look for matches starting from the <-> operator itself:
select oprnamespace::regnamespace::text as schema_name,
oprowner::regrole as owner,
oprname as operator,
oprleft::regtype as left,
oprright::regtype as right,
oprresult::regtype as result,
oprcom::regoperator as commutator
from pg_operator
where oprname = '<->'
order by 1
Output from one of our severs:
| schema_name | owner | operator | left | right | result | commutator |
+-------------+----------+----------+-----------------------------+-----------------------------+------------------+--------------------------------------------------------------+
| extensions | postgres | <-> | text | text | real | <->(text,text) |
| extensions | postgres | <-> | money | money | money | <->(money,money) |
| extensions | postgres | <-> | date | date | integer | <->(date,date) |
| extensions | postgres | <-> | real | real | real | <->(real,real) |
| extensions | postgres | <-> | double precision | double precision | double precision | <->(double precision,double precision) |
| extensions | postgres | <-> | smallint | smallint | smallint | <->(smallint,smallint) |
| extensions | postgres | <-> | integer | integer | integer | <->(integer,integer) |
| extensions | postgres | <-> | bigint | bigint | bigint | <->(bigint,bigint) |
| extensions | postgres | <-> | interval | interval | interval | <->(interval,interval) |
| extensions | postgres | <-> | oid | oid | oid | <->(oid,oid) |
| extensions | postgres | <-> | time without time zone | time without time zone | interval | <->(time without time zone,time without time zone) |
| extensions | postgres | <-> | timestamp without time zone | timestamp without time zone | interval | <->(timestamp without time zone,timestamp without time zone) |
| extensions | postgres | <-> | timestamp with time zone | timestamp with time zone | interval | <->(timestamp with time zone,timestamp with time zone) |
| pg_catalog | postgres | <-> | box | box | double precision | <->(box,box) |
| pg_catalog | postgres | <-> | path | path | double precision | <->(path,path) |
| pg_catalog | postgres | <-> | line | line | double precision | <->(line,line) |
| pg_catalog | postgres | <-> | lseg | lseg | double precision | <->(lseg,lseg) |
| pg_catalog | postgres | <-> | polygon | polygon | double precision | <->(polygon,polygon) |
| pg_catalog | postgres | <-> | circle | circle | double precision | <->(circle,circle) |
| pg_catalog | postgres | <-> | point | circle | double precision | <->(circle,point) |
| pg_catalog | postgres | <-> | circle | point | double precision | <->(point,circle) |
| pg_catalog | postgres | <-> | point | polygon | double precision | <->(polygon,point) |
| pg_catalog | postgres | <-> | polygon | point | double precision | <->(point,polygon) |
| pg_catalog | postgres | <-> | circle | polygon | double precision | <->(polygon,circle) |
| pg_catalog | postgres | <-> | polygon | circle | double precision | <->(circle,polygon) |
| pg_catalog | postgres | <-> | point | point | double precision | <->(point,point) |
| pg_catalog | postgres | <-> | box | line | double precision | <->(line,box) |
| pg_catalog | postgres | <-> | tsquery | tsquery | tsquery | 0 |
| pg_catalog | postgres | <-> | line | box | double precision | <->(box,line) |
| pg_catalog | postgres | <-> | point | line | double precision | <->(line,point) |
| pg_catalog | postgres | <-> | line | point | double precision | <->(point,line) |
| pg_catalog | postgres | <-> | point | lseg | double precision | <->(lseg,point) |
| pg_catalog | postgres | <-> | lseg | point | double precision | <->(point,lseg) |
| pg_catalog | postgres | <-> | point | box | double precision | <->(box,point) |
| pg_catalog | postgres | <-> | box | point | double precision | <->(point,box) |
| pg_catalog | postgres | <-> | lseg | line | double precision | <->(line,lseg) |
| pg_catalog | postgres | <-> | line | lseg | double precision | <->(lseg,line) |
| pg_catalog | postgres | <-> | lseg | box | double precision | <->(box,lseg) |
| pg_catalog | postgres | <-> | box | lseg | double precision | <->(lseg,box) |
| pg_catalog | postgres | <-> | point | path | double precision | <->(path,point) |
| pg_catalog | postgres | <-> | path | point | double precision | <->(point,path) |
+-------------+----------+----------+-----------------------------+-----------------------------+------------------+--------------------------------------------------------------+
Did I miss any index opts worth knowing about?
Checking Out Live Indexes
Here's a longer-than-it-should-be-because-I-still-find-the-catalogs-confusing query to pull out the columns from each user index, and figure out their more interesting properties. For a nice, short catalog search of much utility, see https://dba.stackexchange.com/questions/186944/how-to-list-all-the-indexes-along-with-their-type-btree-brin-hash-etc
with
basic_details as (
select relnamespace::regnamespace::text as schema_name,
indrelid::regclass::text as table_name,
indexrelid::regclass::text as index_name,
unnest(indkey) as column_ordinal_position , -- WITH ORDINALITY would be nice here, didn't get it working.
generate_subscripts(indkey, 1) + 1 as column_position_in_index --
from pg_index
join pg_class on pg_class.oid = pg_index.indrelid
),
enriched_details as (
select basic_details.schema_name,
basic_details.table_name,
basic_details.index_name,
basic_details.column_ordinal_position,
basic_details.column_position_in_index,
columns.column_name,
columns.udt_name as column_type_name
from basic_details
join information_schema.columns as columns
on columns.table_schema = basic_details.schema_name
and columns.table_name = basic_details.table_name
and columns.ordinal_position = basic_details.column_ordinal_position
where schema_name not like 'pg_%'
)
select *,
-- https://postgrespro.com/blog/pgsql/4161264
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'distance_orderable'), false) as supports_knn_searches,
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'search_array'), false) as supports_in_searches,
coalesce(pg_index_column_has_property(index_name,column_position_in_index,'returnable'), false) as supports_index_only_scans,
(select indexdef
from pg_indexes
where pg_indexes.schemaname = enriched_details.schema_name
and pg_indexes.indexname = enriched_details.index_name) as index_definition
from enriched_details
order by supports_in_searches desc,
schema_name,
table_name,
index_name

timestamp type supports KNN with GiST indexes using the <-> operator created by the btree_gist extension.
You can check if a specific column of a specific index supports it, like this:
select pg_index_column_has_property('pgbench_history_mtime_idx'::regclass,1,'distance_orderable');

As best as I can tell, here's the state of play as of PG 14:
GiST indexes may support nearest-neighbor (K-NN) proximity <--> search, and always have.
SP-GiST added such support as of PG 12.
RUM indexes (not in core) also support K-NN.
In all cases, support is done in the operator class:
https://www.postgresql.org/docs/current/indexes-opclass.html
That's what determines if distance_orderable works for a specific data type on a specific kind of index. Built-in, some of the geometric and text vector types work out-of-the box. Other than that small set, many more types are supported via specific operator classes, such as:
https://www.postgresql.org/docs/current/btree-gist.html
https://www.postgresql.org/docs/current/pgtrgm.html
In the case of SP-GiST, there are a lot fewer types supported than with GiST, once you've installed btree_gist:
https://www.postgresql.org/docs/14/spgist-builtin-opclasses.html
It looks like text_opts and range_opts do not support proximity searches. However, for tsrange, etc., there are likely enough options with other tools.

PostgreSQL: what is the difference between 0 and (0)::smallint as default?

I'm playing around with column types and default values, and came across two columns, both smallint but with different defaults:
has 0
has (0)::smallint
When I change 2 to the value 1 has, both are the same and nothing seems changed. Of course, I googled to find the answer to my question, and found that :: is used for type casting. But I wonder, if it is a default value, why would you need type casting?

There is no difference, and the casting is not necessary.
# create table somesmallints (a smallint default 0, b smallint default 0::smallint, c smallint default (0)::smallint);
CREATE TABLE
# \d+ somesmallints;
Table "public.somesmallints"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+----------+-----------+----------+---------------+---------+--------------+-------------
a | smallint | | | 0 | plain | |
b | smallint | | | (0)::smallint | plain | |
c | smallint | | | (0)::smallint | plain | |
# select 0 = 0::smallint;
?column?
----------
t
(1 row)

how to change the value of a column of a sequence?

Which is the best way to
1)rename this sequence from episode_id_seq to sequence_id_seq
2)rename the value of sequence_name to sequence_id_seq from episode_id_seq
3)rename the owned value from episode.id to sequence.id
test777=# \d episode_id_seq
Sequence "public.episode_id_seq"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | episode_id_seq
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 32
is_cycled | boolean | f
is_called | boolean | t
Owned by: public.episode.id

You can use ALTER SEQUENCE
ALTER SEQUENCE episode_id_seq RENAME TO sequence_id_seq;
As above.
ALTER SEQUENCE episode_id_seq OWNED BY sequence.id;

Bit masking in Postgres

I have this query
SELECT * FROM "functions" WHERE (models_mask & 1 > 0)
and the I get the following error:
PGError: ERROR: operator does not exist: character varying & integer
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
The models_mask is an integer in the database. How can I fix this.
Thank you!

Check out the docs on bit operators for Pg.
Essentially & only works on two like types (usually bit or int), so model_mask will have to be CASTed from varchar to something reasonable like bit or int:
models_mask::int & 1 -or- models_mask::int::bit & b'1'
You can find out what types an operator works with using \doS in psql
pg_catalog | & | bigint | bigint | bigint | bitwise and
pg_catalog | & | bit | bit | bit | bitwise and
pg_catalog | & | inet | inet | inet | bitwise and
pg_catalog | & | integer | integer | integer | bitwise and
pg_catalog | & | smallint | smallint | smallint | bitwise and
Here is a quick example for more information
# SELECT 11 & 15 AS int, b'1011' & b'1111' AS bin INTO foo;
SELECT
# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
int | integer |
bin | "bit" |
# SELECT * FROM foo;
int | bin
-----+------
11 | 1011

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Hashing a String to a Numeric Value in PostgreSQL - postgresql

You can create a md5 hash value without problems: select md5('hello, world'); This returns a string with a hex number. Unfortunately there is no built-in function to convert hex to integer but as you are doing that in PL/pgSQL anyway, this might help: https://stackoverflow.com/a/8316731/330315

Related

Postgres 13 permission issues with Revoke

What Postgres 13 index types support distance searches?

PostgreSQL: what is the difference between 0 and (0)::smallint as default?

how to change the value of a column of a sequence?

Bit masking in Postgres

Categories

Resources