PostgreSQL: Difference between "bytea" and "bit varying" types - postgresql

The PostgreSQL types bytea and bit varying sound similar:
bytea stores binary strings.
bit varying stores strings of 1's and 0's.
The documentation does not mention a maximum size for either. Is it 1GB like character varying?
I have two separate use cases, both over a table with millions of rows:
Storing MD5 hashes
That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:
Deduplication: Heavy use of GROUP BY, with an index I suppose.
Querying with WHERE md5 = for exact matches only.
Displaying as a hex string for human use.
Storing arbitrary binary data
Strings of binary data of varying length up to 4kB for:
Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
Extracting some bytes, for instance get the integer value of the byte 14 in my string.
Some deduplication.
Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?
CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';
Which of bytea and bit varying is the more appropriate?
I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

In general, if you're not using bitwise operations you should be using bytea.
I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.
MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.
I think bit varying really has two uses:
To store flags fields that are literally bit strings; and
As an interim data type for internal calculations
For pretty much everything else, use bytea.
There's nothing stopping you storing a 4k bitfield if that's what it is, though.

It appears the maximum length of bytea is 1 GB. [1]
For bitwise operation use bit varying (explanation see below)
For storing MD5 hash use bytea. It will take less storage than bit varying
The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).
For bitwise operation use bit varying
If you concern about storage:
bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:
bit varying
vs
bytea
So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.
PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]
Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]
Comparison between UUID and bytea
create table u(uuid uuid primary key, payload character(300));
create table b( bytea bytea primary key, payload character(300));
INSERT INTO u
SELECT uuid_generate_v4()
FROM generate_series(1,1000*1000);
INSERT INTO b
SELECT random_bytea(16)
FROM generate_series(1,1000*1000);
VACUUM ANALYZE u;
VACUUM ANALYZE b;
## Your table size
SELECT pg_size_pretty(pg_total_relation_size('u'));
pg_size_pretty
----------------
81 MB
SELECT pg_size_pretty(pg_total_relation_size('b'));
pg_size_pretty
----------------
101 MB
## Speed comparison
\timing on
## Common select
select * from u limit 1000;
Time: 1.433 ms
select * from b limit 1000;
Time: 1.396 ms
## Random Select
SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
Time: 42.453 ms
SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
Time: 10.962 ms
Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)
Note: No Index, there is only one connection
Some source :
PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book

Related

Postgres Varchar(n) or text which is better [duplicate]

What's the difference between the text data type and the character varying (varchar) data types?
According to the documentation
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
and
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
So what's the difference?
There is no difference, under the hood it's all varlena (variable length array).
Check this article from Depesz: http://www.depesz.com/index.php/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/
A couple of highlights:
To sum it all up:
char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing
spaces, plus it is problematic to change the limit
varchar(n) – it's problematic to change the limit in live environment (requires exclusive lock while altering table)
varchar – just like text
text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
The article does detailed testing to show that the performance of inserts and selects for all 4 data types are similar. It also takes a detailed look at alternate ways on constraining the length when needed. Function based constraints or domains provide the advantage of instant increase of the length constraint, and on the basis that decreasing a string length constraint is rare, depesz concludes that one of them is usually the best choice for a length limit.
As "Character Types" in the documentation points out, varchar(n), char(n), and text are all stored the same way. The only difference is extra cycles are needed to check the length, if one is given, and the extra space and time required if padding is needed for char(n).
However, when you only need to store a single character, there is a slight performance advantage to using the special type "char" (keep the double-quotes — they're part of the type name). You get faster access to the field, and there is no overhead to store the length.
I just made a table of 1,000,000 random "char" chosen from the lower-case alphabet. A query to get a frequency distribution (select count(*), field ... group by field) takes about 650 milliseconds, vs about 760 on the same data using a text field.
(this answer is a Wiki, you can edit - please correct and improve!)
UPDATING BENCHMARKS FOR 2016 (pg9.5+)
And using "pure SQL" benchmarks (without any external script)
use any string_generator with UTF8
main benchmarks:
2.1. INSERT
2.2. SELECT comparing and counting
CREATE FUNCTION string_generator(int DEFAULT 20,int DEFAULT 10) RETURNS text AS $f$
SELECT array_to_string( array_agg(
substring(md5(random()::text),1,$1)||chr( 9824 + (random()*10)::int )
), ' ' ) as s
FROM generate_series(1, $2) i(x);
$f$ LANGUAGE SQL IMMUTABLE;
Prepare specific test (examples)
DROP TABLE IF EXISTS test;
-- CREATE TABLE test ( f varchar(500));
-- CREATE TABLE test ( f text);
CREATE TABLE test ( f text CHECK(char_length(f)<=500) );
Perform a basic test:
INSERT INTO test
SELECT string_generator(20+(random()*(i%11))::int)
FROM generate_series(1, 99000) t(i);
And other tests,
CREATE INDEX q on test (f);
SELECT count(*) FROM (
SELECT substring(f,1,1) || f FROM test WHERE f<'a0' ORDER BY 1 LIMIT 80000
) t;
... And use EXPLAIN ANALYZE.
UPDATED AGAIN 2018 (pg10)
little edit to add 2018's results and reinforce recommendations.
Results in 2016 and 2018
My results, after average, in many machines and many tests: all the same (statistically less than standard deviation).
Recommendation
Use text datatype, avoid old varchar(x) because sometimes it is not a standard, e.g. in CREATE FUNCTION clauses varchar(x)≠varchar(y).
express limits (with same varchar performance!) by with CHECK clause in the CREATE TABLE e.g. CHECK(char_length(x)<=10). With a negligible loss of performance in INSERT/UPDATE you can also to control ranges and string structure e.g. CHECK(char_length(x)>5 AND char_length(x)<=20 AND x LIKE 'Hello%')
On PostgreSQL manual
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I usually use text
References: http://www.postgresql.org/docs/current/static/datatype-character.html
In my opinion, varchar(n) has it's own advantages. Yes, they all use the same underlying type and all that. But, it should be pointed out that indexes in PostgreSQL has its size limit of 2712 bytes per row.
TL;DR:
If you use text type without a constraint and have indexes on these columns, it is very possible that you hit this limit for some of your columns and get error when you try to insert data but with using varchar(n), you can prevent it.
Some more details: The problem here is that PostgreSQL doesn't give any exceptions when creating indexes for text type or varchar(n) where n is greater than 2712. However, it will give error when a record with compressed size of greater than 2712 is tried to be inserted. It means that you can insert 100.000 character of string which is composed by repetitive characters easily because it will be compressed far below 2712 but you may not be able to insert some string with 4000 characters because the compressed size is greater than 2712 bytes. Using varchar(n) where n is not too much greater than 2712, you're safe from these errors.
text and varchar have different implicit type conversions. The biggest impact that I've noticed is handling of trailing spaces. For example ...
select ' '::char = ' '::varchar, ' '::char = ' '::text, ' '::varchar = ' '::text
returns true, false, true and not true, true, true as you might expect.
Somewhat OT: If you're using Rails, the standard formatting of webpages may be different. For data entry forms text boxes are scrollable, but character varying (Rails string) boxes are one-line. Show views are as long as needed.
A good explanation from http://www.sqlines.com/postgresql/datatypes/text:
The only difference between TEXT and VARCHAR(n) is that you can limit
the maximum length of a VARCHAR column, for example, VARCHAR(255) does
not allow inserting a string more than 255 characters long.
Both TEXT and VARCHAR have the upper limit at 1 Gb, and there is no
performance difference among them (according to the PostgreSQL
documentation).
The difference is between tradition and modern.
Traditionally you were required to specify the width of each table column. If you specify too much width, expensive storage space is wasted, but if you specify too little width, some data will not fit. Then you would resize the column, and had to change a lot of connected software, fix introduced bugs, which is all very cumbersome.
Modern systems allow for unlimited string storage with dynamic storage allocation, so the incidental large string would be stored just fine without much waste of storage of small data items.
While a lot of programming languages have adopted a data type of 'string' with unlimited size, like C#, javascript, java, etc, a database like Oracle did not.
Now that PostgreSQL supports 'text', a lot of programmers are still used to VARCHAR(N), and reason like: yes, text is the same as VARCHAR, except that with VARCHAR you MAY add a limit N, so VARCHAR is more flexible.
You might as well reason, why should we bother using VARCHAR without N, now that we can simplify our life with TEXT?
In my recent years with Oracle, I have used CHAR(N) or VARCHAR(N) on very few occasions. Because Oracle does (did?) not have an unlimited string type, I used for most string columns VARCHAR(2000), where 2000 was at some time the maximum for VARCHAR, and in all practical purposes not much different from 'infinite'.
Now that I am working with PostgreSQL, I see TEXT as real progress. No more emphasis on the VAR feature of the CHAR type. No more emphasis on let's use VARCHAR without N. Besides, typing TEXT saves 3 keystrokes compared to VARCHAR.
Younger colleagues would now grow up without even knowing that in the old days there were no unlimited strings. Just like that in most projects they don't have to know about assembly programming.
I wasted way too much time because of using varchar instead of text for PostgreSQL arrays.
PostgreSQL Array operators do not work with string columns. Refer these links for more details: (https://github.com/rails/rails/issues/13127) and (http://adamsanderson.github.io/railsconf_2013/?full#10).
If you only use TEXT type you can run into issues when using AWS Database Migration Service:
Large objects (LOBs) are used but target LOB columns are not nullable
Due to their unknown and sometimes large size, large objects (LOBs) require more processing
and resources than standard objects. To help with tuning migrations of systems that contain
LOBs, AWS DMS offers the following options
If you are only sticking to PostgreSQL for everything probably you're fine. But if you are going to interact with your db via ODBC or external tools like DMS you should consider using not using TEXT for everything.
character varying(n), varchar(n) - (Both the same). value will be truncated to n characters without raising an error.
character(n), char(n) - (Both the same). fixed-length and will pad with blanks till the end of the length.
text - Unlimited length.
Example:
Table test:
a character(7)
b varchar(7)
insert "ok " to a
insert "ok " to b
We get the results:
a | (a)char_length | b | (b)char_length
----------+----------------+-------+----------------
"ok "| 7 | "ok" | 2

What is the difference between numeric(19,0) and bigint in POSTGRES?

I am trying to map the java type to the SQL type and I encountered such a scenario.
If I elaborate, I was using the auto-ddl api to apply scripts to my database while starting my spring container. Now I am trying to generate the scripts using liquibase by generating the db-changelog for POSTGRE server.
Are numeric(19,0) and BIGINT same in Postgres server? Please put some light on this.
The main difference is the storage:
bigint (and smallint and integer) are stored as integer values in the processor's native format (usually two's complement binary numbers).
The range is limited (but high), the storage space occupied is 8 bytes, and arithmetic is blazingly fast.
numeric is stored as binary coded decimal of variable length.
The range and the precision is almost unlimited (up to 131072 digits before the decimal point; up to 16383 digits after the decimal point), but arithmetic operations are comparatively slow.
BIGINT range is -9223372036854775808 to 9223372036854775807, so you can't store a number greater than 9223372036854775807, but NUMERIC (19, 0) can do.
Please find the following example:
CREATE TABLE TestTable (NumVal NUMERIC (19,0), IntVal BIGINT);
INSERT INTO TestTable (NumVal, IntVal) VALUES
('9223372036854775808', 9223372036854775807);
SELECT * FROM TestTable;
Here you can't store 9223372036854775808 in to BIGINT, but you can store the same value or greater than the value to NUMERIC (19, 0)
db<>fiddle for the same.
Numeric has variable storage size, while bigint is always 8 bytes.
SELECT pg_column_size(123456789112345678911111555678::numeric(30,0)) AS numeric30,
pg_column_size(1234567891123456789::numeric(19,0)) AS numeric19,
pg_column_size(123::numeric(19,0)) AS numeric3,
pg_column_size(1234567891123456789::bigint) AS bigint;
numeric30|numeric19|numeric3|bigint
---------|---------|--------|------
22| 16| 8| 8
Additionaly, from documentation (emphasis mine):
Calculations with numeric values yield exact results where possible,
e.g. addition, subtraction, multiplication. However, calculations on
numeric values are very slow compared to the integer types, or to the
floating-point types described in the next section.

Why unsigned integer is not available in PostgreSQL?

I came across this post (What is the difference between tinyint, smallint, mediumint, bigint and int in MySQL?) and realized that PostgreSQL does not support unsigned integer.
Can anyone help to explain why is it so?
Most of the time, I use unsigned integer as auto incremented primary key in MySQL. In such design, how can I overcome this when I port my database from MySQL to PostgreSQL?
Thanks.
It's not in the SQL standard, so the general urge to implement it is lower.
Having too many different integer types makes the type resolution system more fragile, so there is some resistance to adding more types into the mix.
That said, there is no reason why it couldn't be done. It's just a lot of work.
It is already answered why postgresql lacks unsigned types. However I would suggest to use domains for unsigned types.
http://www.postgresql.org/docs/9.4/static/sql-createdomain.html
CREATE DOMAIN name [ AS ] data_type
[ COLLATE collation ]
[ DEFAULT expression ]
[ constraint [ ... ] ]
where constraint is:
[ CONSTRAINT constraint_name ]
{ NOT NULL | NULL | CHECK (expression) }
Domain is like a type but with an additional constraint.
For an concrete example you could use
CREATE DOMAIN uint2 AS int4
CHECK(VALUE >= 0 AND VALUE < 65536);
Here is what psql gives when I try to abuse the type.
DS1=# select (346346 :: uint2);
ERROR: value for domain uint2 violates check constraint "uint2_check"
You can use a CHECK constraint, e.g.:
CREATE TABLE products (
id integer,
name text,
price numeric CHECK (price > 0)
);
Also, PostgreSQL has serial, smallserial and bigserial types for auto-increment.
The talk about DOMAINS is interesting but not relevant to the only possible origin of that question. The desire for unsigned ints is to double the range of ints with the same number of bits, it's an efficiency argument, not the desire to exclude negative numbers, everybody knows how to add a check constraint.
When asked by someone about it, Tome Lane stated:
Basically, there is zero chance this will happen unless you can find
a way of fitting them into the numeric promotion hierarchy that doesn't
break a lot of existing applications. We have looked at this more than
once, if memory serves, and failed to come up with a workable design
that didn't seem to violate the POLA.
What is the "POLA"? Google gave me 10 results that are meaningless. Not sure if it's politically incorrect thought and therefore censored. Why would this search term not yield any result? Whatever.
You can implement unsigned ints as extension types without too much trouble. If you do it with C-functions, then there will be about no performance penalties at all. You won't need to extend the parser to deal with literals because PgSQL has such an easy way to interpret strings as literals, just write '4294966272'::uint4 as your literals. Casts shouldn't be a huge deal either. You don't even need to do range exceptions, you can just treat the semantics of '4294966273'::uint4::int as -1024. Or you can throw an error.
If I wanted this, I would have done it. But since I'm using Java on the other side of SQL, to me it is of little value since Java doesn't have those unsigned integers either. So I gain nothing. I'm already annoyed if I get a BigInteger from a bigint column, when it should fit into long.
Another thing, if I did have the need to store 32 bit or 64 bit types, I can use PostgreSQL int4 or int8 respectively, just remembering that the natural order or arithmetic won't work reliably. But storing and retrieving is unaffected by that.
Here is how I can implement a simple unsigned int8:
First I will use
CREATE TYPE name (
INPUT = uint8_in,
OUTPUT = uint8_out
[, RECEIVE = uint8_receive ]
[, SEND = uint8_send ]
[, ANALYZE = uint8_analyze ]
, INTERNALLENGTH = 8
, PASSEDBYVALUE ]
, ALIGNMENT = 8
, STORAGE = plain
, CATEGORY = N
, PREFERRED = false
, DEFAULT = null
)
the minimal 2 functions uint8_in and uint8_out I must first define.
CREATE FUNCTION uint8_in(cstring)
RETURNS uint8
AS 'uint8_funcs'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION uint64_out(complex)
RETURNS cstring
AS 'uint8_funcs'
LANGUAGE C IMMUTABLE STRICT;
need to implement this in C uint8_funcs.c. So I go use the complex example from here and make it simple:
PG_FUNCTION_INFO_V1(complex_in);
Datum complex_in(PG_FUNCTION_ARGS) {
char *str = PG_GETARG_CSTRING(0);
uint64_t result;
if(sscanf(str, "%llx" , &result) != 1)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid input syntax for uint8: \"%s\"", str)));
return (Datum)SET_8_BYTES(result);
}
ah well, or you can just find it done already.
According to the latest documentation, the signed integer is supported but no unsigned integer in the table. However, the serial type is kind of similar to unsigned except it starts from 1 not from zero. But the upper limit is the same as signed. So the system truly does not have unsigned support. As pointed out by Peter, the door is open to implement the unsigned version. The code may have to be updated a lot, just too much work from my experience working with C programming.
https://www.postgresql.org/docs/10/datatype-numeric.html
integer 4 bytes typical choice for integer -2147483648 to +2147483647
serial 4 bytes autoincrementing integer 1 to 2147483647
Postgres does have an unsigned integer type that is unbeknownst to many: OID.
The oid type is currently implemented as an unsigned four-byte integer. […]
The oid type itself has few operations beyond comparison. It can be
cast to integer, however, and then manipulated using the standard
integer operators. (Beware of possible signed-versus-unsigned confusion
if you do this.)
It is not a numeric type though, and trying to do any arithmetic (or even bitwise operations) with it is going to fail. Also, it's just 4 bytes (INTEGER), there is no corresponding 8 byte (BIGINT) unsigned type.
So it's not really a good idea to use this yourself, and I agree with all the other answers that in a Postgresql database design you should always use an INTEGER or BIGINT column for your serial primary key - having it start in the negative (MINVALUE) or allowing it to wrap around (CYCLE) if you want to exhaust the full domain.
However, it is quite useful for input/output conversion, like your migration from another DBMS. Inserting the value 2147483648 into an integer column will lead to an "ERROR: integer out of range", while using the expression 2147483648::OID works just fine.
Similarly, when selecting an integer column as text with mycolumn::TEXT, you will get negative values at some point, but with mycolumn::OID::TEXT you will always get a natural number.
See an example at dbfiddle.uk.

postgres store with composite value type, or a better way of attributing an inverted index

can't seem to figure out the syntax for populating a hstore with a value of composite type -- note: I do not want to convert a record to a hstore.
select hstore('hello => ROW(1,2)');
I know it's something simple; However, google is not my friend today.
use case : custom inverted index.
The data is modelling an inverted index of lexemes, the composite data types are various probabilities related to the lexemes which I will use to implement document clustering. Does anyone know a better way of doing this ? I'm open to using an external system if it allows attaching attributes to key->posting pairs in the inverted index.
I'd use something external if it had solid support for what I am trying to do, I suspect that sticking 3-10k lexemes per tuple and then doing batch processing on them is gonna be nasty as the whole hstore will have to be parsed and converted.
At the moment my lexemes are in the range of 50-1k per tuple, mostly in the low hundreds, and i'm just doing it for my research algorithms. But there has to be a better way of doing this.
Strings in hstore are double-quoted. hstore only supports text values, nothing else, so you must store other types as their text representations:
SELECT hstore('k => "(1,2)"');
eg:
regress=> SELECT (hstore('k => "(1,2)"')) -> 'k';
?column?
----------
(1,2)
(1 row)
This means you have to cast the values to use them, eg:
regress=> CREATE TYPE pair AS (a integer, b integer);
CREATE TYPE
regress=> SELECT ((hstore('k => "(1,2)"')) -> 'k')::pair;
pair
-------
(1,2)
(1 row)
or using arrays instead to avoid the composite type:
regress=> SELECT ((hstore('k => "{1,2}"')) -> 'k')::integer[];
int4
-------
{1,2}
(1 row)
Arrays are indexed from 1 with the [] operator, eg [1].
This is generally horrendously inefficient because the values must be parsed and converted every single time. It's not really practical to suggest alternatives when you haven't said much about the nature of your problem and why you want hstore in the first place.

How do you create a random string that's suitable for a session ID in PostgreSQL?

I'd like to make a random string for use in session verification using PostgreSQL. I know I can get a random number with SELECT random(), so I tried SELECT md5(random()), but that doesn't work. How can I do this?
You can fix your initial attempt like this:
SELECT md5(random()::text);
Much simpler than some of the other suggestions. :-)
I'd suggest this simple solution:
This is a quite simple function that returns a random string of the given length:
Create or replace function random_string(length integer) returns text as
$$
declare
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
begin
if length < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$$ language plpgsql;
And the usage:
select random_string(15);
Example output:
select random_string(15) from generate_series(1,15);
random_string
-----------------
5emZKMYUB9C2vT6
3i4JfnKraWduR0J
R5xEfIZEllNynJR
tMAxfql0iMWMIxM
aPSYd7pDLcyibl2
3fPDd54P5llb84Z
VeywDb53oQfn9GZ
BJGaXtfaIkN4NV8
w1mvxzX33NTiBby
knI1Opt4QDonHCJ
P9KC5IBcLE0owBQ
vvEEwc4qfV4VJLg
ckpwwuG8YbMYQJi
rFf6TchXTO3XsLs
axdQvaLBitm6SDP
(15 rows)
You can get 128 bits of random from a UUID. This is the method to get the job done in modern PostgreSQL.
CREATE EXTENSION pgcrypto;
SELECT gen_random_uuid();
gen_random_uuid
--------------------------------------
202ed325-b8b1-477f-8494-02475973a28f
May be worth reading the docs on UUID too
The data type uuid stores Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. (Some systems refer to this data type as a globally unique identifier, or GUID, instead.) This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.
How rare is a collision with UUID, or guessable? Assuming they're random,
About 100 trillion version 4 UUIDs would need to be generated to have a 1 in a billion chance of a single duplicate ("collision"). The chance of one collision rises to 50% only after 261 UUIDs (2.3 x 10^18 or 2.3 quintillion) have been generated. Relating these numbers to databases, and considering the issue of whether the probability of a Version 4 UUID collision is negligible, consider a file containing 2.3 quintillion Version 4 UUIDs, with a 50% chance of containing one UUID collision. It would be 36 exabytes in size, assuming no other data or overhead, thousands of times larger than the largest databases currently in existence, which are on the order of petabytes. At the rate of 1 billion UUIDs generated per second, it would take 73 years to generate the UUIDs for the file. It would also require about 3.6 million 10-terabyte hard drives or tape cartridges to store it, assuming no backups or redundancy. Reading the file at a typical "disk-to-buffer" transfer rate of 1 gigabit per second would require over 3000 years for a single processor. Since the unrecoverable read error rate of drives is 1 bit per 1018 bits read, at best, while the file would contain about 1020 bits, just reading the file once from end to end would result, at least, in about 100 times more mis-read UUIDs than duplicates. Storage, network, power, and other hardware and software errors would undoubtedly be thousands of times more frequent than UUID duplication problems.
source: wikipedia
In summary,
UUID is standardized.
gen_random_uuid() is 128 bits of random stored in 128 bits (2**128 combinations). 0-waste.
random() only generates 52 bits of random in PostgreSQL (2**52 combinations).
md5() stored as UUID is 128 bits, but it can only be as random as its input (52 bits if using random())
md5() stored as text is 288 bits, but it only can only be as random as its input (52 bits if using random()) - over twice the size of a UUID and a fraction of the randomness)
md5() as a hash, can be so optimized that it doesn't effectively do much.
UUID is highly efficient for storage: PostgreSQL provides a type that is exactly 128 bits. Unlike text and varchar, etc which store as a varlena which has overhead for the length of the string.
PostgreSQL nifty UUID comes with some default operators, castings, and features.
Building on Marcin's solution, you could do this to use an arbitrary alphabet (in this case, all 62 ASCII alphanumeric characters):
SELECT array_to_string(array
(
select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
FROM generate_series(1, 12)), '');
Please use string_agg!
SELECT string_agg (substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', ceil (random() * 62)::integer, 1), '')
FROM generate_series(1, 45);
I'm using this with MD5 to generate a UUID also. I just want a random value with more bits than a random () integer.
I was playing with PostgreSQL recently, and I think I've found a little better solution, using only built-in PostgreSQL methods - no pl/pgsql. The only limitation is it currently generates only UPCASE strings, or numbers, or lower case strings.
template1=> SELECT array_to_string(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer) FROM generate_series(1,12)), '');
array_to_string
-----------------
TFBEGODDVTDM
template1=> SELECT array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,12)), '');
array_to_string
-----------------
868778103681
The second argument to the generate_series method dictates the length of the string.
While not active by default, you could activate one of the core extensions:
CREATE EXTENSION IF NOT EXISTS pgcrypto;
Then your statement becomes a simple call to gen_salt() which generates a random string:
select gen_salt('md5') from generate_series(1,4);
gen_salt
-----------
$1$M.QRlF4U
$1$cv7bNJDM
$1$av34779p
$1$ZQkrCXHD
The leading number is a hash identifier. Several algorithms are available each with their own identifier:
md5: $1$
bf: $2a$06$
des: no identifier
xdes: _J9..
More information on extensions:
pgCrypto: http://www.postgresql.org/docs/9.2/static/pgcrypto.html
Included Extensions: http://www.postgresql.org/docs/9.2/static/contrib.html
EDIT
As indicated by Evan Carrol, as of v9.4 you can use gen_random_uuid()
http://www.postgresql.org/docs/9.4/static/pgcrypto.html
#Kavius recommended using pgcrypto, but instead of gen_salt, what about gen_random_bytes? And how about sha512 instead of md5?
create extension if not exists pgcrypto;
select digest(gen_random_bytes(1024), 'sha512');
Docs:
F.25.5. Random-Data Functions
gen_random_bytes(count integer) returns bytea
Returns count cryptographically strong random bytes. At most 1024
bytes can be extracted at a time. This is to avoid draining the
randomness generator pool.
The INTEGER parameter defines the length of the string. Guaranteed to cover all 62 alphanum characters with equal probability (unlike some other solutions floating around on the Internet).
CREATE OR REPLACE FUNCTION random_string(INTEGER)
RETURNS TEXT AS
$BODY$
SELECT array_to_string(
ARRAY (
SELECT substring(
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
FROM (ceil(random()*62))::int FOR 1
)
FROM generate_series(1, $1)
),
''
)
$BODY$
LANGUAGE sql VOLATILE;
I do not think that you are looking for a random string per se. What you would need for session verification is a string that is guaranteed to be unique. Do you store session verification information for auditing? In that case you need the string to be unique between sessions. I know of two, rather simple approaches:
Use a sequence. Good for use on a single database.
Use an UUID. Universally unique, so good on distributed environments too.
UUIDs are guaranteed to be unique by virtue of their algorithm for generation; effectively it is extremely unlikely that you will generate two identical numbers on any machine, at any time, ever (note that this is much stronger than on random strings, which have a far smaller periodicity than UUIDs).
You need to load the uuid-ossp extension to use UUIDs. Once installed, call any of the available uuid_generate_vXXX() functions in your SELECT, INSERT or UPDATE calls. The uuid type is a 16-byte numeral, but it also has a string representation.
create extension if not exists pgcrypto;
then
SELECT encode(gen_random_bytes(20),'base64')
or even
SELECT encode(gen_random_bytes(20),'hex')
This is for 20 bytes = 160 bits of randomness (as long as sha1 for example).
select * from md5(to_char(random(), '0.9999999999999999'));
select encode(decode(md5(random()::text), 'hex')||decode(md5(random()::text), 'hex'), 'base64')