There is one table at my database that have a row with ID equals to 0 (zero).
The primary key is a serial column.
I'm used to see sequences starting with 1. So, is there a ploblem if i keep this ID as zero?
The Serial data type creates integer columns which happen to auto-increment. Hence you should be able to add any integer value to the column (including 0).
From the docs
The type names serial and serial4 are equivalent: both create integer columns.
....(more about Serial) we have created an integer column and arranged for its default values to be assigned from a sequence generator
http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-SERIAL
This is presented as an answer because it’s too long for a comment.
You’re actually talking about two things here.
A primary key is a column designated to be the unique identifier for the table. There may be other unique columns, but the primary key is the one you have settled on, possibly because it’s the most stable value. (For example a customer’s email address is unique, but it’s subject to change, and it’s harder to manage).
The primary key can be any common data type, as long as it is guaranteed to be unique. In some cases, the primary key is a natural property of the row data, in which case it is a natural primary key.
In (most?) other cases, the primary key is an arbitrary value with no inherent meaning. In that case it is called a surrogate key.
The simplest surrogate key, the one which I like to call the lazy surrogate key, is a serial number. Technically, it’s not truly surrogate in that there is an inherent meaning in the sequence, but it is otherwise arbitrary.
For PostgreSQL, the data type typically associated with a serial number is integer, and this is implied in the SERIAL type. If you were doing this in MySQL/MariaDB, you might use unsigned integer, which doesn’t have negative values. PostgreSQL doesn’t have unsigned, so the data can indeed be negative.
The point about serial numbers is that they normally start at 1 and increment by 1. In PostgreSQL, you could have set up your own sequence manually (SERIAL is just a shortcut for that), in which case you can start with any value you like, such as 100, 0 or even -100 etc.
To actually give an answer:
A primary key can have any compatible value you like, as long as it’s unique.
A serial number can also have any compatible value, but it is standard practice to start as 1, because that’s how we humans count.
Reasons to override the start-at-one principle include:
I sometimes use 0 as a sort of default if a valid row hasn’t been selected.
You might use negative ids to indicate non-standard data, such as for testing or for virtual values; for example a customer with a negative id might indicate an internal allocation.
You might start your real sequence from a higher number and use lower ids for something similar to the point above.
Note that modern versions of PostgreSQL have a preferred standard alternative in the form of GENERATED BY DEFAULT AS IDENTITY. In line with modern SQL trends, it is much more verbose, but it is much more manageable than the old SERIAL.
Related
Could somebody tell is it good idea use varchar as PK. I mean is it less efficient or equal to int/uuid?
In example: car VIN I want to use it as PK but I'm not sure as good it will be indexed or work as FK or maybe there is some pitfalls.
It depends on which kind of data you are going to store.
In some cases (I would say in most cases) it is better to use integer-based primary keys:
for instance, bigint needs only 8 bytes, varchar can require more space. For this reason, a varchar comparison is often more costly than a bigint comparison.
while joining tables it would be more efficient to join them using integer-based values rather that strings
an integer-based key as a unique key is more appropriate for table relations. For instance, if you are going to store this primary key in another tables as a separate column. Again, varchar will require more space in other table too (see p.1).
This post on stackexchange compares non-integer types of primary keys on a particular example.
I have a third-party application connecting to a view in my PostgreSQL database. It requires the view to have a primary key but can't handle the UUID type (which is the primary key for the view). It also can't handle the UUID as the primary key if it is served as text from the view.
What I'd like to do is convert the UUID to a number and use that as the primary key instead. However,
SELECT x'14607158d3b14ac0b0d82a9a5a9e8f6e'::bigint
Fails because the number is out of range.
So instead, I want to use SQL to take the big end of the UUID and create an int8 / bigint. I should clarify that maintaining order is 'desirable' but I understand that some of the order will change by doing this.
I tried:
SELECT x(substring(UUID::text from 1 for 16))::bigint
but the x operator for converting hex doesn't seem to like brackets. I abstracted it into a function but
SELECT hex_to_int(substring(UUID::text from 1 for 16))::bigint
still fails.
How can I get a bigint from the 'big end' half of a UUID?
Fast and without dynamic SQL
Cast the leading 16 hex digits of a UUID in text representation as bitstring bit(64) and cast that to bigint. See:
Convert hex in text representation to decimal number
Conveniently, excess hex digits to the right are truncated in the cast to bit(64) automatically - exactly what we need.
Postgres accepts various formats for input. Your given string literal is one of them:
14607158d3b14ac0b0d82a9a5a9e8f6e
The default text representation of a UUID (and the text output in Postgres for data type uuid) adds hyphens at predefined places:
14607158-d3b1-4ac0-b0d8-2a9a5a9e8f6e
The manual:
A UUID is written as a sequence of lower-case hexadecimal digits, in
several groups separated by hyphens, specifically a group of 8 digits
followed by three groups of 4 digits followed by a group of 12 digits,
for a total of 32 digits representing the 128 bits.
If input format can vary, strip hyphens first to be sure:
SELECT ('x' || translate(uuid_as_string, '-', ''))::bit(64)::bigint;
Cast actual uuid input with uuid::text.
db<>fiddle here
Note that Postgres uses signed integer, so the bigint overflows to negative numbers in the upper half - which should be irrelevant for this purpose.
DB design
If at all possible add a bigserial column to the underlying table and use that instead.
This is all very shaky, both the problem and the solution you describe in your self-answer.
First, a mismatch between a database design and a third-party application is always possible, but usually indicative of a deeper problem. Why does your database use the uuid data type as a PK in the first place? They are not very efficient compared to a serial or a bigserial. Typically you would use a UUID if you are working in a distributed environment where you need to "guarantee" uniqueness over multiple installations.
Secondly, why does the application require the PK to begin with (incidentally: views do not have a PK, the underlying tables do)? If it is only to view the data then a PK is rather useless, particularly if it is based on a UUID (and there is thus no conceivable relationship between the PK and the rest of the tuple). If it is used to refer to other data in the same database or do updates or deletes of existing data, then you need the exact UUID and not some extract of it because the underlying table or other relations in your database would have the exact UUID. Of course you can convert all UUID's with the same hex_to_int() function, but that leads straight back to my point above: why use uuids in the first place?
Thirdly, do not mess around with things you have little or no knowledge of. This is not intended to be offensive, take it as well-meant advice (look around on the internet for programmers who tried to improve on cryptographic algorithms or random number generation by adding their own twists of obfuscation; quite entertaining reads). There are 5 algorithms for generating UUID's in the uuid-ossp package and while you know or can easily find out which algorithm is used in your database (the uuid_generate_vX() functions in your table definitions, most likely), do you know how the algorithm works? The claim of practical uniqueness of a UUID is based on its 128 bits, not a 64-bit extract of it. Are you certain that the high 64-bits are random? My guess is that 64 consecutive bits are less random than the "square root of the randomness" (for lack of a better way to phrase the theoretical drop in periodicity of a 64-bit number compared to a 128-bit number) of the full UUID. Why? Because all but one of the algorithms are made up of randomized blocks of otherwise non-random input (such as the MAC address of a network interface, which is always the same on a machine generating millions of UUIDs). Had 64 bits been enough for randomized value uniqueness, then a uuid would have been that long.
What a better solution would be in your case is hard to say, because it is unclear what the third-party application does with the data from your database and how dependent it is on the uniqueness of the "PK" column in the view. An approach that is likely to work if the application does more than trivially display the data without any further use of the "PK" would be to associate a bigint with every retrieved uuid in your database in a (temporary) table and include that bigint in your view by linking on the uuids in your (temporary) tables. Since you can not trigger on SELECT statements, you would need a function to generate the bigint for every uuid the application retrieves. On updates or deletes on the underlying tables of the view or upon selecting data from related tables, you look up the uuid corresponding to the bigint passed in from the application. The lookup table and function would look somewhat like this:
CREATE TEMPORARY TABLE temp_table(
tempint bigserial PRIMARY KEY,
internal_uuid uuid);
CREATE INDEX ON temp_table(internal_uuid);
CREATE FUNCTION temp_int_for_uuid(pk uuid) RETURNS bigint AS $$
DECLARE
id bigint;
BEGIN
SELECT tempint INTO id FROM temp_table WHERE internal_uuid = pk;
IF NOT FOUND THEN
INSERT INTO temp_table(internal_uuid) VALUES (pk)
RETURNING tempint INTO id;
END IF;
RETURN id;
END; $$ LANGUAGE plpgsql STRICT;
Not pretty, not efficient, but fool-proof.
Use the bit() function to parse a decimal number from hex literal built from a substr of the UUID:
select ('x'||substr(UUID, 1, 16))::bit(64)::bigint
See SQLFiddle
Solution found.
UUID::text will return a string with hyphens. In order for substring(UUID::text from 1 for 16) to create a string that x can parse as hex the hyphens need to be stripped first.
The final query looks like:
SELECT hex_to_int(substring((select replace(id::text,'-','')) from 1 for 16))::bigint FROM table
The hext_to_int function needs to be able to handle a bigint, not just int. It looks like:
CREATE OR REPLACE FUNCTION hex_to_int(hexval character varying)
RETURNS bigint AS
$BODY$
DECLARE
result bigint;
BEGIN
EXECUTE 'SELECT x''' || hexval || '''::bigint' INTO result;
RETURN result;
END;
$BODY$`
A new database is to store the log data of a (series of) web servers. The structure of the log records has been converted to this ‘naïve’ general schema:
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost TEXT,
path TEXT,
user_agent TEXT,
-- and so on
);
And so on, there are many more fields, but this shows the general principle. The bulk of the data are contained in free-text fields as shows above. Of course, this will make the database rather big in the long run. We are talking about a web server log, so this doesn’t come as a big surprise.
The domain of those text fields is limited, though. There is e.g. a very finite set of vhosts that will be seen, a much larger, but still decidedly finite set of paths, user agents and so on. In cases like this, would it be more appropriate to factor the text fields out into sub-tables, and reference them only via identifiers? I am thinking along a line like this:
CREATE TABLE vhost ( ident SERIAL PRIMARY KEY, vhost TEXT NOT NULL UNIQUE );
CREATE TABLE path ( ident SERIAL PRIMARY KEY, path TEXT NOT NULL UNIQUE );
CREATE TABLE user_agent ( ident SERIAL PRIMARY KEY, user_agent TEXT NOT NULL UNIQUE );
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost INTEGER REFERENCES vhost ( ident ) ,
path INTEGER REFERENCES path ( ident ),
user_agent INTEGER REFERENCES user_agent ( ident ),
-- and so on
);
I have tried both approaches now. As expected, the second one is much smaller, by give or take factor three. However, querying it becomes significantly slower due to all the joins involved. The difference is by about an order of magnitude.
From what I understand, the table should be sufficiently normal in both cases. At some later point in the project, there’ll maybe be additional attributes attached to the various text values (like additional information about each vhost and so on).
The practical considerations are obvious, it’s basically a space/time tradeoff. In the long run, what is considered best practice in such a case? Are there other, perhaps more theoretical implications for such a scenario that I might want to be aware of?
The domain of those text fields is limited, though. There is e.g. a
very finite set of vhosts that will be seen, a much larger, but still
decidedly finite set of paths, user agents and so on. In cases like
this, would it be more appropriate to factor the text fields out into
sub-tables, and reference them only via identifiers?
There are a couple of different ways to look at this kind of problem. Regardless of which way you look at it, appropriate is a fuzzy word.
Constraints
Let's imagine that you create a table and a foreign key constraint to allow the "vhost" column to accept only five values. Can you also constrain the web server to write only those five values to the log file? No, you can't.
You can add some code to insert new virtual hosts into into the referenced table. You can even automate that with triggers. But when you do that, you're no longer constraining the values for "vhost". This remains true whether you use natural keys or surrogate keys.
Data compression with ID numbers
You can also think of this as a problem of data compression. You save space--potentially a lot of space--by using integers as foreign keys to tables of unique text. You might not save time. Queries that require a lot of joins are often slower than queries that just read data directly. You've already seen that.
In your case, which has to do with machine-generated log files, I prefer to store them as they come from the device (web server, flow sensor, whatever) unless there's a compelling reason not to.
In a few cases, I've worked on systems where domain experts have determined that certain kinds of values shouldn't be transferred from a log file to a database. For example, domain experts might decide that negative numbers from a sensor means the sensor is broken. This kind of constraint is better handled with a CHECK() constraint, but the principle is the same.
Take what the device gives you unless there's a compelling reason not to.
What is the limit of the length of primary key column? I'm going to use varchar as primary key. I've found no info, how long it can be, since PostgreSQL does not require to specify varchar limit when used as primary key?
The maximum length for a value in a B-tree index, which includes primary keys, is one third of the size of a buffer page, by default floor(8192/3) = 2730 bytes.
I believe that maximum varchar length is a Postgres configuration setting. However, it looks as though it can't exceed 1GB in size.
http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F
That having been said, it's probably not a good idea to have a large varchar column as a primary key. Consider using a serial or bigserial (http://www.postgresql.org/docs/current/interactive/datatype-numeric.html#DATATYPE-SERIAL)
You should made a test.
I've made tests, with table, that have single varchar column as primary key, on PostgreSQL 8.4. The result is, that I was able to store 235000 ASCII characters, 116000 polish diactrical characters (f.g. 'ć') or 75000 chinese (f.g. '汉'). For larger sets I've got a message:
BŁĄD: index row size 5404 exceeds btree maximum, 2712
However, the message told that:
Values larger than 1/3 of a buffer page cannot be indexed.
So the values were allowed, however not the whole string was used for uniqueness check.
Well, this is a very large amount of data that you can put in that column. However, as noted above, your design is poor if you will have to use such long values as keys. You should use artificial primary key.
Can I define a primary key according to three attributes? I am using Visual Paradigm and Postgres.
CREATE TABLE answers (
time SERIAL NOT NULL,
"{Users}{userID}user_id" int4 NOT NULL,
"{Users}{userID}question_id" int4 NOT NULL,
reply varchar(255),
PRIMARY KEY (time, "{Users}{userID}user_id", "{Users}{userID}question_id"));
A picture may clarify the question.
Yes you can, just as you showed.(though I question your naming of the 2. and 3. column.)
From the docs:
"Primary keys can also constrain more than one column; the syntax is similar to unique constraints:
CREATE TABLE example (
a integer,
b integer,
c integer,
PRIMARY KEY (a, c)
);
A primary key indicates that a column or group of columns can be used as a unique identifier for rows in the table. (This is a direct consequence of the definition of a primary key. Note that a unique constraint does not, by itself, provide a unique identifier because it does not exclude null values.) This is useful both for documentation purposes and for client applications. For example, a GUI application that allows modifying row values probably needs to know the primary key of a table to be able to identify rows uniquely.
A table can have at most one primary key (while it can have many unique and not-null constraints). Relational database theory dictates that every table must have a primary key. This rule is not enforced by PostgreSQL, but it is usually best to follow it.
"
Yes, you can. There is just such an example in the documentation.. However, I'm not familiar with the bracketed terms you're using. Are you doing some variable evaluation before creating the database schema?
yes you can
if you'd run it - you would see it in no time.
i would really, really, really suggest to rethink naming convention. time column that contains serial integer? column names like "{Users}{userID}user_id"? oh my.