I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.
Let's try this simple example to represent the problem I'm facing.
Assume this table:
CREATE TABLE testing1
(
id serial NOT NULL,
word text,
CONSTRAINT testing1_pkey PRIMARY KEY (id)
);
and that data:
insert into testing1 (word) values ('Heliod, God');
insert into testing1 (word) values ('Heliod''s Inter');
insert into testing1 (word) values ('Heliod''s Pilg');
insert into testing1 (word) values ('Heliod, Sun');
Then I want to run this query to get the results ordered by the word column:
SELECT
id, word
FROM testing1
WHERE UPPER(word::text) LIKE UPPER('heliod%')
ORDER BY word asc;
But look at the output, it's not ordered. I would expect the rows to be in that order, using their ids: 2, 3, 1, 4 (or, if I use the word's values: Heliod's Inter, Heliod's Pilg, Heliod, God, Heliod, Sun). This is what I get:
I thought that maybe something could confuse postgresql because of the WHERE criteria I used, but the below happens if I just order by on the rows:
Am I missing something here? I couldn't find anything in the docs about ordering values that contain quotes (I suspect that the quotes cause that behaviour because of their special meaning in postgresql, but I may be wrong).
I am using UTF-8 encoding for my database (not sure if it matters though) and this issue is happening on Postgresql version 12.7.
The output of
show lc_ctype;
is
"en_GB.UTF-8"
and the output of
show lc_collate;
is
"en_GB.UTF-8"
That is the correct way to order the rows in en_US.UTF-8. It does 'weird' (to someone used to ASCII) things with punctuation and whitespace, skipping on a first pass and considering it only for otherwise tied values.
If you don't want those rules, maybe use the C collation instead.
Indeed, I've tried #jjanes's suggestion to use the C collation and the output is the one I would expect:
SELECT
id, word
FROM testing1
ORDER BY word collate "C" ;
How weird, I have been using postgresql for some years now and I never noticed that behaviour.
Relevant section from the docs:
23.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. > Additional collations may be available depending on operating system
support. The default collation selects the LC_COLLATE and LC_CTYPE values
specified at database creation time. The C and POSIX collations both
specify “traditional C” behavior, in which only the ASCII letters “A”
through “Z” are treated as letters, and sorting is done strictly by
character code byte values.
This example, where intuition expects "foo bar", shows the strange behaviour:
SELECT 'foo'|| '#'::char ||'bar' -- char ok, foo#bar
SELECT 'foo'|| '#' ||'bar' -- literal ok
SELECT 'foo'|| '#'::text ||'bar' -- text ok
SELECT 'foo'|| ' '::char ||'bar' -- STRANGE! LOSTING SPACE!
SELECT ('foo'|| ' '::char ||'bar')='foobar' -- yes, it is true... strange
SELECT 'foo'|| ' '::text ||'bar' -- text OK
SELECT 'foo'|| (' '::char)::text ||'bar' -- char-to-text lost!
SELECT 'foo'|| ' ' ||'bar' -- literal OK
Why does PostgreSQL do that? It is not intuitive, and seems error-prone behaviour.
PS: where does the PostgreSQL guide say (it needs a red alert) something about this?
This is one of many reasons it's often recommended to stick with varchar() or, in Postgres text types instead.
SQL standard mandates that CHAR() values are padded with space to fill the remaining bytes. For instance:
'A'::CHAR(5)
Will result in "A " being stored. Now if have another field of different length, but same content:
'A'::CHAR10 = 'A'::CHAR(5)
We would want this to say TRUE, right? So the spaces added for padding to fill the CHAR() have to be trimmed.
This is less of a Postgres question then it is a disk storage or sql standard question. Something has to be written to those bytes on the disk and space is the standard. Some DB's only trim for comparison or conversion, and others, like Postgres, trim for nearly any function.
Since you are casting a space to a CHAR(1) (1 being the default length when none is specified, although it doesn't matter for this quesstion) your space gets lost as padding. This is just one of the caveats of using CHAR(). It's a damned if you do, damned if you don't situation.
Instead cast that thing to a VARCHAR() or TEXT as they are nearly always superior types to CHAR().
Noted in the Postgresql documentation:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs and slower sorting. In
most situations text or character varying should be used instead.
The PostgreSQL types bytea and bit varying sound similar:
bytea stores binary strings.
bit varying stores strings of 1's and 0's.
The documentation does not mention a maximum size for either. Is it 1GB like character varying?
I have two separate use cases, both over a table with millions of rows:
Storing MD5 hashes
That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:
Deduplication: Heavy use of GROUP BY, with an index I suppose.
Querying with WHERE md5 = for exact matches only.
Displaying as a hex string for human use.
Storing arbitrary binary data
Strings of binary data of varying length up to 4kB for:
Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
Extracting some bytes, for instance get the integer value of the byte 14 in my string.
Some deduplication.
Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?
CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';
Which of bytea and bit varying is the more appropriate?
I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?
In general, if you're not using bitwise operations you should be using bytea.
I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.
MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.
I think bit varying really has two uses:
To store flags fields that are literally bit strings; and
As an interim data type for internal calculations
For pretty much everything else, use bytea.
There's nothing stopping you storing a 4k bitfield if that's what it is, though.
It appears the maximum length of bytea is 1 GB. [1]
For bitwise operation use bit varying (explanation see below)
For storing MD5 hash use bytea. It will take less storage than bit varying
The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).
For bitwise operation use bit varying
If you concern about storage:
bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:
bit varying
vs
bytea
So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.
PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]
Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]
Comparison between UUID and bytea
create table u(uuid uuid primary key, payload character(300));
create table b( bytea bytea primary key, payload character(300));
INSERT INTO u
SELECT uuid_generate_v4()
FROM generate_series(1,1000*1000);
INSERT INTO b
SELECT random_bytea(16)
FROM generate_series(1,1000*1000);
VACUUM ANALYZE u;
VACUUM ANALYZE b;
## Your table size
SELECT pg_size_pretty(pg_total_relation_size('u'));
pg_size_pretty
----------------
81 MB
SELECT pg_size_pretty(pg_total_relation_size('b'));
pg_size_pretty
----------------
101 MB
## Speed comparison
\timing on
## Common select
select * from u limit 1000;
Time: 1.433 ms
select * from b limit 1000;
Time: 1.396 ms
## Random Select
SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
Time: 42.453 ms
SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
Time: 10.962 ms
Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)
Note: No Index, there is only one connection
Some source :
PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book
I'd like to make a random string for use in session verification using PostgreSQL. I know I can get a random number with SELECT random(), so I tried SELECT md5(random()), but that doesn't work. How can I do this?
You can fix your initial attempt like this:
SELECT md5(random()::text);
Much simpler than some of the other suggestions. :-)
I'd suggest this simple solution:
This is a quite simple function that returns a random string of the given length:
Create or replace function random_string(length integer) returns text as
$$
declare
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
begin
if length < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$$ language plpgsql;
And the usage:
select random_string(15);
Example output:
select random_string(15) from generate_series(1,15);
random_string
-----------------
5emZKMYUB9C2vT6
3i4JfnKraWduR0J
R5xEfIZEllNynJR
tMAxfql0iMWMIxM
aPSYd7pDLcyibl2
3fPDd54P5llb84Z
VeywDb53oQfn9GZ
BJGaXtfaIkN4NV8
w1mvxzX33NTiBby
knI1Opt4QDonHCJ
P9KC5IBcLE0owBQ
vvEEwc4qfV4VJLg
ckpwwuG8YbMYQJi
rFf6TchXTO3XsLs
axdQvaLBitm6SDP
(15 rows)
You can get 128 bits of random from a UUID. This is the method to get the job done in modern PostgreSQL.
CREATE EXTENSION pgcrypto;
SELECT gen_random_uuid();
gen_random_uuid
--------------------------------------
202ed325-b8b1-477f-8494-02475973a28f
May be worth reading the docs on UUID too
The data type uuid stores Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. (Some systems refer to this data type as a globally unique identifier, or GUID, instead.) This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.
How rare is a collision with UUID, or guessable? Assuming they're random,
About 100 trillion version 4 UUIDs would need to be generated to have a 1 in a billion chance of a single duplicate ("collision"). The chance of one collision rises to 50% only after 261 UUIDs (2.3 x 10^18 or 2.3 quintillion) have been generated. Relating these numbers to databases, and considering the issue of whether the probability of a Version 4 UUID collision is negligible, consider a file containing 2.3 quintillion Version 4 UUIDs, with a 50% chance of containing one UUID collision. It would be 36 exabytes in size, assuming no other data or overhead, thousands of times larger than the largest databases currently in existence, which are on the order of petabytes. At the rate of 1 billion UUIDs generated per second, it would take 73 years to generate the UUIDs for the file. It would also require about 3.6 million 10-terabyte hard drives or tape cartridges to store it, assuming no backups or redundancy. Reading the file at a typical "disk-to-buffer" transfer rate of 1 gigabit per second would require over 3000 years for a single processor. Since the unrecoverable read error rate of drives is 1 bit per 1018 bits read, at best, while the file would contain about 1020 bits, just reading the file once from end to end would result, at least, in about 100 times more mis-read UUIDs than duplicates. Storage, network, power, and other hardware and software errors would undoubtedly be thousands of times more frequent than UUID duplication problems.
source: wikipedia
In summary,
UUID is standardized.
gen_random_uuid() is 128 bits of random stored in 128 bits (2**128 combinations). 0-waste.
random() only generates 52 bits of random in PostgreSQL (2**52 combinations).
md5() stored as UUID is 128 bits, but it can only be as random as its input (52 bits if using random())
md5() stored as text is 288 bits, but it only can only be as random as its input (52 bits if using random()) - over twice the size of a UUID and a fraction of the randomness)
md5() as a hash, can be so optimized that it doesn't effectively do much.
UUID is highly efficient for storage: PostgreSQL provides a type that is exactly 128 bits. Unlike text and varchar, etc which store as a varlena which has overhead for the length of the string.
PostgreSQL nifty UUID comes with some default operators, castings, and features.
Building on Marcin's solution, you could do this to use an arbitrary alphabet (in this case, all 62 ASCII alphanumeric characters):
SELECT array_to_string(array
(
select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
FROM generate_series(1, 12)), '');
Please use string_agg!
SELECT string_agg (substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', ceil (random() * 62)::integer, 1), '')
FROM generate_series(1, 45);
I'm using this with MD5 to generate a UUID also. I just want a random value with more bits than a random () integer.
I was playing with PostgreSQL recently, and I think I've found a little better solution, using only built-in PostgreSQL methods - no pl/pgsql. The only limitation is it currently generates only UPCASE strings, or numbers, or lower case strings.
template1=> SELECT array_to_string(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer) FROM generate_series(1,12)), '');
array_to_string
-----------------
TFBEGODDVTDM
template1=> SELECT array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,12)), '');
array_to_string
-----------------
868778103681
The second argument to the generate_series method dictates the length of the string.
While not active by default, you could activate one of the core extensions:
CREATE EXTENSION IF NOT EXISTS pgcrypto;
Then your statement becomes a simple call to gen_salt() which generates a random string:
select gen_salt('md5') from generate_series(1,4);
gen_salt
-----------
$1$M.QRlF4U
$1$cv7bNJDM
$1$av34779p
$1$ZQkrCXHD
The leading number is a hash identifier. Several algorithms are available each with their own identifier:
md5: $1$
bf: $2a$06$
des: no identifier
xdes: _J9..
More information on extensions:
pgCrypto: http://www.postgresql.org/docs/9.2/static/pgcrypto.html
Included Extensions: http://www.postgresql.org/docs/9.2/static/contrib.html
EDIT
As indicated by Evan Carrol, as of v9.4 you can use gen_random_uuid()
http://www.postgresql.org/docs/9.4/static/pgcrypto.html
#Kavius recommended using pgcrypto, but instead of gen_salt, what about gen_random_bytes? And how about sha512 instead of md5?
create extension if not exists pgcrypto;
select digest(gen_random_bytes(1024), 'sha512');
Docs:
F.25.5. Random-Data Functions
gen_random_bytes(count integer) returns bytea
Returns count cryptographically strong random bytes. At most 1024
bytes can be extracted at a time. This is to avoid draining the
randomness generator pool.
The INTEGER parameter defines the length of the string. Guaranteed to cover all 62 alphanum characters with equal probability (unlike some other solutions floating around on the Internet).
CREATE OR REPLACE FUNCTION random_string(INTEGER)
RETURNS TEXT AS
$BODY$
SELECT array_to_string(
ARRAY (
SELECT substring(
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
FROM (ceil(random()*62))::int FOR 1
)
FROM generate_series(1, $1)
),
''
)
$BODY$
LANGUAGE sql VOLATILE;
I do not think that you are looking for a random string per se. What you would need for session verification is a string that is guaranteed to be unique. Do you store session verification information for auditing? In that case you need the string to be unique between sessions. I know of two, rather simple approaches:
Use a sequence. Good for use on a single database.
Use an UUID. Universally unique, so good on distributed environments too.
UUIDs are guaranteed to be unique by virtue of their algorithm for generation; effectively it is extremely unlikely that you will generate two identical numbers on any machine, at any time, ever (note that this is much stronger than on random strings, which have a far smaller periodicity than UUIDs).
You need to load the uuid-ossp extension to use UUIDs. Once installed, call any of the available uuid_generate_vXXX() functions in your SELECT, INSERT or UPDATE calls. The uuid type is a 16-byte numeral, but it also has a string representation.
create extension if not exists pgcrypto;
then
SELECT encode(gen_random_bytes(20),'base64')
or even
SELECT encode(gen_random_bytes(20),'hex')
This is for 20 bytes = 160 bits of randomness (as long as sha1 for example).
select * from md5(to_char(random(), '0.9999999999999999'));
select encode(decode(md5(random()::text), 'hex')||decode(md5(random()::text), 'hex'), 'base64')