What is the limit of the length of primary key column? I'm going to use varchar as primary key. I've found no info, how long it can be, since PostgreSQL does not require to specify varchar limit when used as primary key?
The maximum length for a value in a B-tree index, which includes primary keys, is one third of the size of a buffer page, by default floor(8192/3) = 2730 bytes.
I believe that maximum varchar length is a Postgres configuration setting. However, it looks as though it can't exceed 1GB in size.
http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F
That having been said, it's probably not a good idea to have a large varchar column as a primary key. Consider using a serial or bigserial (http://www.postgresql.org/docs/current/interactive/datatype-numeric.html#DATATYPE-SERIAL)
You should made a test.
I've made tests, with table, that have single varchar column as primary key, on PostgreSQL 8.4. The result is, that I was able to store 235000 ASCII characters, 116000 polish diactrical characters (f.g. 'ć') or 75000 chinese (f.g. '汉'). For larger sets I've got a message:
BŁĄD: index row size 5404 exceeds btree maximum, 2712
However, the message told that:
Values larger than 1/3 of a buffer page cannot be indexed.
So the values were allowed, however not the whole string was used for uniqueness check.
Well, this is a very large amount of data that you can put in that column. However, as noted above, your design is poor if you will have to use such long values as keys. You should use artificial primary key.
Related
Could somebody tell is it good idea use varchar as PK. I mean is it less efficient or equal to int/uuid?
In example: car VIN I want to use it as PK but I'm not sure as good it will be indexed or work as FK or maybe there is some pitfalls.
It depends on which kind of data you are going to store.
In some cases (I would say in most cases) it is better to use integer-based primary keys:
for instance, bigint needs only 8 bytes, varchar can require more space. For this reason, a varchar comparison is often more costly than a bigint comparison.
while joining tables it would be more efficient to join them using integer-based values rather that strings
an integer-based key as a unique key is more appropriate for table relations. For instance, if you are going to store this primary key in another tables as a separate column. Again, varchar will require more space in other table too (see p.1).
This post on stackexchange compares non-integer types of primary keys on a particular example.
I have logs table with many rows where pk is generated by uuid_generate_v4() function.
What i'm curious about - is there a limit for generated uuids? Like if i will have 10.000.000.000 rows it will not able to generate unique primary key.
Since a UUID is a 128 bit number, the maximum of different UUIDs would be 2^128 = 340.282.366.920.938.463.463.374.607.431.768.211.456 (if that big number calculator made no mistake but it sure is very, very large). So you're far, far, far away from that with just 10.000.000.000.
Suppose I have key/value/timerange tuples, e.g.:
CREATE TABLE historical_values(
key TEXT,
value NUMERIC,
from_time TIMESTAMPTZ,
to_time TIMESTAMPTZ
)
and would like to be able to efficiently query values (sorted descending) for a specific key and time, e.g.:
SELECT value
FROM historical_values
WHERE
key = [KEY]
AND from_time <= [TIME]
AND to_time >= [TIME]
ORDER BY value DESC
What kind of index/types should I use to get the best lookup performance? I suspect my solution will involve a tstzrange and a gist index, but I'm
not sure how to make that play well with the key matching and value ordering requirements.
Edit: Here's some more information about usage.
Ideally uses features available in Postgres v9.6.
Relation will contain approx. 1k keys and 5m values per key. Values are large integers (up to 32 bytes), mostly unique. Time ranges between few hours to a couple years. Time horizon is 5 years. No NULL values allowed, but some time ranges are open-ended (could either use NULL or a time far into the future for to_time).
The primary key is the key and time range (as there is only one historical value for a time range, per key).
Common operations are a) updating to_time to "close" a historical value, and b) inserting a new value with from_time = NOW.
All values may be queried. Partitioning is an option.
DB design
For a big table like that ("1k keys and 5m values per key") I would suggest to optimize storage like:
CREATE TABLE hist_keys (
key_id serial PRIMARY KEY
, key text NOT NULL UNIQUE
);
CREATE TABLE hist_values (
hist_value_id bigserial PRIMARY KEY -- optional, see below!
, key_id int NOT NULL REFERENCES hist_keys
, value numeric
, from_time timestamptz NOT NULL
, to_time timestamptz NOT NULL
, CONSTRAINT range_valid CHECK (from_time <= to_time) -- or < ?
);
Also helps index performance.
And consider partitioning. List-partitioning on key_id. Maybe even add sub-partitioning on (range partitioning this time) on from_time. Read the manual here.
With one partition per key_id, (and constraint exclusion enabled!) Postgres would only look at the small partition (and index) for the given key, instead of the whole big table. Major win.
But I would strongly suggest to upgrade to at least Postgres 10 first, which added "declarative partitioning". Makes managing partition a lot easier.
Better yet, skip forward to Postgres 11 (currently beta), which adds major improvements for partitioning (incl. performance improvements). Most notably, for your goal to get the best lookup performance, quoting the chapter on partitioning in release notes for Postgres 11 (currently beta):
Allow faster partition elimination during query processing (Amit Langote, David Rowley, Dilip Kumar)
This speeds access to partitioned tables with many partitions.
Allow partition elimination during query execution (David Rowley, Beena Emerson)
Previously partition elimination could only happen at planning time,
meaning many joins and prepared queries could not use partition elimination.
Index
From the perspective of the value column, the small subset of selected rows is arbitrary for every new query. I don't expect you'll find a useful way to support ORDER BY value DESC with an index. I'd concentrate on the other columns. Maybe add value as last column to each index if you can get index-only scans out of it (possible for btree and GiST).
Without partitioning:
CREATE UNIQUE INDEX hist_btree_idx ON hist_values (key_id, from_time, to_time DESC);
UNIQUE is optional, but see below.
Note the importance of opposing sort orders for from_time and to_time. See (closely related!):
Optimizing queries on a range of timestamps (two columns)
This is almost the same index as the one implementing your PK on (key_id, from_time, to_time). Unfortunately, we cannot use it as PK index. Quoting the manual:
Also, it must be a b-tree index with default sort ordering.
So I added a bigserial as surrogate primary key in my suggested table design above and NOT NULL constraints plus the UNIQUE index to enforce your uniqueness rule.
In Postgres 10 or later consider an IDENTITY column instead:
Auto increment table column
You might even do with PK constraint in this exceptional case to avoid duplicating the index and keep the table at minimum size. Depends on the complete situation. You may need it for FK constraints or similar. See:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
A GiST index like you already suspected may be even faster. I suggest to keep your original timestamptz columns in the table (16 bytes instead of 32 bytes for a tstzrange) and add key_id after installing the additional module btree_gist:
CREATE INDEX hist_gist_idx ON hist_values
USING GiST (key_id, tstzrange(from_time, to_time, '[]'));
The expression tstzrange(from_time, to_time, '[]') constructs a range including upper and lower bound. Read the manual here.
Your query needs to match the index:
SELECT value
FROM hist_values
WHERE key = [KEY]
AND tstzrange(from_time, to_time, '[]') #> tstzrange([TIME_FROM], [TIME_TO], '[]')
ORDER BY value DESC;
It's equivalent to your original.
#> being the range contains operator.
With list-partitioning on key_id
With a separate table for each key_id, we can omit key_id from the index, improving size and performance - especially for the GiST index - for which we then also don't need the additional module btree_gist. Results in ~ 1000 partitions and the corresponding indexes:
CREATE INDEX hist999_gist_idx ON hist_values USING GiST (tstzrange(from_time, to_time, '[]'));
Related:
Store the day of the week and time?
I am working on a project where I have to store millions of rows with a column x of type bytea (with a maximum size of 128 bytes). I need to query the data by x (i.e. where x = ?). Now I was wondering if I can use x directly as a primary key without any negative performance impact?
I also have to join that table on the primary key from another table, therefore I would also have to store bytea as foreign key in another table.
As far as I know, most database systems make use of a B+-Tree which has a search complexity of θ(log(n)). When using bytea as primary key, I am not sure if Postgres can efficiently organize such a B+-Tree?
If you can guarantee that the value of the bytea never changes, you can use it as primary key.
But it is not necessarily wise to do so: if that key is stored in other tables as well, this will waste space, and an artificial primary key might be better.
There is one table at my database that have a row with ID equals to 0 (zero).
The primary key is a serial column.
I'm used to see sequences starting with 1. So, is there a ploblem if i keep this ID as zero?
The Serial data type creates integer columns which happen to auto-increment. Hence you should be able to add any integer value to the column (including 0).
From the docs
The type names serial and serial4 are equivalent: both create integer columns.
....(more about Serial) we have created an integer column and arranged for its default values to be assigned from a sequence generator
http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-SERIAL
This is presented as an answer because it’s too long for a comment.
You’re actually talking about two things here.
A primary key is a column designated to be the unique identifier for the table. There may be other unique columns, but the primary key is the one you have settled on, possibly because it’s the most stable value. (For example a customer’s email address is unique, but it’s subject to change, and it’s harder to manage).
The primary key can be any common data type, as long as it is guaranteed to be unique. In some cases, the primary key is a natural property of the row data, in which case it is a natural primary key.
In (most?) other cases, the primary key is an arbitrary value with no inherent meaning. In that case it is called a surrogate key.
The simplest surrogate key, the one which I like to call the lazy surrogate key, is a serial number. Technically, it’s not truly surrogate in that there is an inherent meaning in the sequence, but it is otherwise arbitrary.
For PostgreSQL, the data type typically associated with a serial number is integer, and this is implied in the SERIAL type. If you were doing this in MySQL/MariaDB, you might use unsigned integer, which doesn’t have negative values. PostgreSQL doesn’t have unsigned, so the data can indeed be negative.
The point about serial numbers is that they normally start at 1 and increment by 1. In PostgreSQL, you could have set up your own sequence manually (SERIAL is just a shortcut for that), in which case you can start with any value you like, such as 100, 0 or even -100 etc.
To actually give an answer:
A primary key can have any compatible value you like, as long as it’s unique.
A serial number can also have any compatible value, but it is standard practice to start as 1, because that’s how we humans count.
Reasons to override the start-at-one principle include:
I sometimes use 0 as a sort of default if a valid row hasn’t been selected.
You might use negative ids to indicate non-standard data, such as for testing or for virtual values; for example a customer with a negative id might indicate an internal allocation.
You might start your real sequence from a higher number and use lower ids for something similar to the point above.
Note that modern versions of PostgreSQL have a preferred standard alternative in the form of GENERATED BY DEFAULT AS IDENTITY. In line with modern SQL trends, it is much more verbose, but it is much more manageable than the old SERIAL.