Is there efficient difference between varchar and int as PK - postgresql

Could somebody tell is it good idea use varchar as PK. I mean is it less efficient or equal to int/uuid?
In example: car VIN I want to use it as PK but I'm not sure as good it will be indexed or work as FK or maybe there is some pitfalls.

It depends on which kind of data you are going to store.
In some cases (I would say in most cases) it is better to use integer-based primary keys:
for instance, bigint needs only 8 bytes, varchar can require more space. For this reason, a varchar comparison is often more costly than a bigint comparison.
while joining tables it would be more efficient to join them using integer-based values rather that strings
an integer-based key as a unique key is more appropriate for table relations. For instance, if you are going to store this primary key in another tables as a separate column. Again, varchar will require more space in other table too (see p.1).
This post on stackexchange compares non-integer types of primary keys on a particular example.

Related

Postgres JSONB unique constraint

I have a table as following table.
create table person {
firstname varchar,
lastname varchar,
person_info jsonb,
..
}
I already have unique constraints on firstname + lastname. I recently identify there is always something different in person_info jsonb. I want to uniquely identify by person_info jsonb.
Should I add person_info as part of unique constraints firstname + lastname + person_info ? Is there any performance impact with such implementation ? I heard JSONB is not good for index when number of data increases.
I am thinking to use store person_info hashvalue in different field and combine this new hashvalue field as part of unique index.
I would appreciate if I get some help from expert on this.
This seems like a wrong idea.
A primary key should be immutable and uniquely identify a table row.
Names are not good for that, because
different people can have the same name
names can change
This is probably why you are tempted to add additional information to truly identify each individual row.
Unless you have some immutable attribute that uniquely identifies each person (such as the social security nubmer), you should generate an artificial primary key for the table:
ALTER TABLE person
ADD id bigint
GENERATED ALWAYS AS IDENTITY
PRIMARY KEY;
Indexing a jsonb is possible, but you will get problems with long values since index entries are limited in size, and you will get an error if you exceed the limit.
I recommend that any attribute that you might want to index is not stored in a jsonb, but as a regular table column.
JSONB indexing IMHO refers to the ability to index fields inside the binary JSON rather than the whole block. Be aware also that key ordering is not kept! So if you can obtain two different hashes for two json with the exact same data but different ordering. Instead, if you can find which json fields gives you uniqueness, than you can use directly those for indexing.
Try also to look at this page

Is a PK of type bytea slower compared to a PK of a sequential type integer in Postgres?

I am working on a project where I have to store millions of rows with a column x of type bytea (with a maximum size of 128 bytes). I need to query the data by x (i.e. where x = ?). Now I was wondering if I can use x directly as a primary key without any negative performance impact?
I also have to join that table on the primary key from another table, therefore I would also have to store bytea as foreign key in another table.
As far as I know, most database systems make use of a B+-Tree which has a search complexity of θ(log(n)). When using bytea as primary key, I am not sure if Postgres can efficiently organize such a B+-Tree?
If you can guarantee that the value of the bytea never changes, you can use it as primary key.
But it is not necessarily wise to do so: if that key is stored in other tables as well, this will waste space, and an artificial primary key might be better.

Can a primary key in postgres have zero value?

There is one table at my database that have a row with ID equals to 0 (zero).
The primary key is a serial column.
I'm used to see sequences starting with 1. So, is there a ploblem if i keep this ID as zero?
The Serial data type creates integer columns which happen to auto-increment. Hence you should be able to add any integer value to the column (including 0).
From the docs
The type names serial and serial4 are equivalent: both create integer columns.
....(more about Serial) we have created an integer column and arranged for its default values to be assigned from a sequence generator
http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-SERIAL
This is presented as an answer because it’s too long for a comment.
You’re actually talking about two things here.
A primary key is a column designated to be the unique identifier for the table. There may be other unique columns, but the primary key is the one you have settled on, possibly because it’s the most stable value. (For example a customer’s email address is unique, but it’s subject to change, and it’s harder to manage).
The primary key can be any common data type, as long as it is guaranteed to be unique. In some cases, the primary key is a natural property of the row data, in which case it is a natural primary key.
In (most?) other cases, the primary key is an arbitrary value with no inherent meaning. In that case it is called a surrogate key.
The simplest surrogate key, the one which I like to call the lazy surrogate key, is a serial number. Technically, it’s not truly surrogate in that there is an inherent meaning in the sequence, but it is otherwise arbitrary.
For PostgreSQL, the data type typically associated with a serial number is integer, and this is implied in the SERIAL type. If you were doing this in MySQL/MariaDB, you might use unsigned integer, which doesn’t have negative values. PostgreSQL doesn’t have unsigned, so the data can indeed be negative.
The point about serial numbers is that they normally start at 1 and increment by 1. In PostgreSQL, you could have set up your own sequence manually (SERIAL is just a shortcut for that), in which case you can start with any value you like, such as 100, 0 or even -100 etc.
To actually give an answer:
A primary key can have any compatible value you like, as long as it’s unique.
A serial number can also have any compatible value, but it is standard practice to start as 1, because that’s how we humans count.
Reasons to override the start-at-one principle include:
I sometimes use 0 as a sort of default if a valid row hasn’t been selected.
You might use negative ids to indicate non-standard data, such as for testing or for virtual values; for example a customer with a negative id might indicate an internal allocation.
You might start your real sequence from a higher number and use lower ids for something similar to the point above.
Note that modern versions of PostgreSQL have a preferred standard alternative in the form of GENERATED BY DEFAULT AS IDENTITY. In line with modern SQL trends, it is much more verbose, but it is much more manageable than the old SERIAL.

postgresql hstore key/value vs traditional SQL performance

I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.

Primary key defined by many attributes?

Can I define a primary key according to three attributes? I am using Visual Paradigm and Postgres.
CREATE TABLE answers (
time SERIAL NOT NULL,
"{Users}{userID}user_id" int4 NOT NULL,
"{Users}{userID}question_id" int4 NOT NULL,
reply varchar(255),
PRIMARY KEY (time, "{Users}{userID}user_id", "{Users}{userID}question_id"));
A picture may clarify the question.
Yes you can, just as you showed.(though I question your naming of the 2. and 3. column.)
From the docs:
"Primary keys can also constrain more than one column; the syntax is similar to unique constraints:
CREATE TABLE example (
a integer,
b integer,
c integer,
PRIMARY KEY (a, c)
);
A primary key indicates that a column or group of columns can be used as a unique identifier for rows in the table. (This is a direct consequence of the definition of a primary key. Note that a unique constraint does not, by itself, provide a unique identifier because it does not exclude null values.) This is useful both for documentation purposes and for client applications. For example, a GUI application that allows modifying row values probably needs to know the primary key of a table to be able to identify rows uniquely.
A table can have at most one primary key (while it can have many unique and not-null constraints). Relational database theory dictates that every table must have a primary key. This rule is not enforced by PostgreSQL, but it is usually best to follow it.
"
Yes, you can. There is just such an example in the documentation.. However, I'm not familiar with the bracketed terms you're using. Are you doing some variable evaluation before creating the database schema?
yes you can
if you'd run it - you would see it in no time.
i would really, really, really suggest to rethink naming convention. time column that contains serial integer? column names like "{Users}{userID}user_id"? oh my.