I am trying to store UUID value into my table using PostgreSQL 9.3 version.
Example:
create table test
(
uno UUID,
name text,
address text
);
insert into test values(1,'abc','xyz');
Note: How to store integer value into UUID type?
The whole point of UUIDs is that they are automatically generated because the algorithm used virtually guarantees that they are unique in your table, your database, or even across databases. UUIDs are stored as 16-byte datums so you really only want to use them when one or more of the following holds true:
You need to store data indefinitely, forever, always.
When you have a highly distributed system of data generation (e.g. INSERTS in a single "system" which is distributed over multiple machines each having their own local database) where central ID generation is not feasible (e.g. a mobile data collection system with limited connectivity, later uploading to a central server).
Certain scenarios in load-balancing, replication, etc.
If one of these cases applies to you, then you are best off using the UUID as a primary key and have it generated automagically:
CREATE TABLE test (
uno uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
name text,
address text
);
Like this you never have to worry about the UUIDs themselves, it is all done behind the scenes. Your INSERT statement would become:
INSERT INTO test (name, address) VALUES ('abc','xyz') RETURNING uno;
With the RETURNING clause obviously optional if you want to use that to reference related data.
It's not allowed to simply cast an integer into a UUID type. Generally, UUIDs are generated either internally in Postgres (see http://www.postgresql.org/docs/9.3/static/uuid-ossp.html for more detail on ways to do this) or via a client program.
If you just want unique IDs for your table itself, but don't actually need UUIDs (those are geared toward universal uniqueness, across tables and servers, etc.), you can use the serial type, which creates an implicit sequence for you which automatically increments when you INSERT into a table.
Related
Consider you have the very simple table definition of:
CREATE TABLE first_name
(
id Integer NOT NULL,
name varchar(10),
PRIMARY KEY id
);
Now consider you have two rows like:
id , name
1 Dan
2 Jack
Imagine you have X processes that read from time to time the max(id) value,
then decide what is the sequential id to be written as the new record.
The problem is when having multiple processes like that, while reading we can already have another id entered.
What is the best option to guarantee in postgres an atomic action of read latest id and then write the next one, when having multiple processes doing the same all the time?
I know we have the Serial type (like mysql autoincrement) which allows automatic management of field updates in a sequential manner, how will it perform when multiple processes won't have any lock mechanism applied and just the serial definition, is it sufficient? are we protected here for concurrency problem?
Example for the second declaration from point 2:
CREATE TABLE first_name
(
id Serial,
name varchar(10),
PRIMARY KEY id
);
To get the max just query the serial value:
select currval(pg_get_serial_sequence('first_name', 'id'));
for example:
clima=# CREATE TABLE first_name
(
id serial,
name varchar(10)
);
CREATE TABLE
clima=# insert into first_name(name) select 'Diego';
INSERT 0 1
clima=# select currval(pg_get_serial_sequence('first_name', 'id'));
currval
---------
1
(1 row)
Yes and No, Depends, have you transactions... with witch scope?
The whole point of SERIAL is that the database solves the concurrency issue for you.
With respect to postgresql:
Here's a page from the postgresql documentation (Data Types/Numeric Types/Serial Types) which tells you that SERIAL columns are built on sequences.
Note: Because smallserial, serial and bigserial are implemented using sequences...
Here we see sequence generators, CREATE SEQUENCE a postgresql (only?) construct that lets you make your own integer sequences without tying them to an (identity) column. It discusses the semantics, which includes the property that not every sequential number might appear in your sequence (because the sequence ids are generated even if the row isn't actually added to the table, e.g., if the inserting transaction is rolled back).
Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently.
(Also you can "cache" sequence generation but then you have issues with non-sequential sequence ids).
Finally, here we see you can also use GENERATED AS IDENTITY to the same effect, standard SQL.
From https://stackoverflow.com/a/40597571/3284469
If you don't specify a primary key, RDBMS will help you choose an unique and non-null key, OR create an internal key (probably an int type) as primary key for this table.
Could you give some examples for the "OR" case, where a RDBMS (PostgreSQL in particular, and possibly also MySQL or SQL Server) create an "internal key (probably an int type) as primary key" for a table without a primary key specified?
Does PostgreSQL have something similar to MySQL?
Thanks.
for Postgres:
From "5.4. System Columns":
oid
The object identifier (object ID) of a row. This column is only present if the table was created using WITH OIDS, or if the default_with_oids configuration variable was set at the time. This column is of type oid (same name as the column); see Section 8.18 for more information about the type.
and
ctid
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore `ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.
Both come close to what you're searching for but have restrictions as you can read in the documentation. So, as the manual states, using a user-defined PK is the better choice.
for SQL Server:
There is the undocumented pseudo column %%physloc%%. It describes the physical location of a row. That, however, might be subject to change if the row gets physically moved for whatever reason. And it's undocumented, that is, its behavior might change any time between releases or even just patches or it might be removed completely without further notice. So using a user-defined PK is the better choice here either.
A new database is to store the log data of a (series of) web servers. The structure of the log records has been converted to this ‘naïve’ general schema:
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost TEXT,
path TEXT,
user_agent TEXT,
-- and so on
);
And so on, there are many more fields, but this shows the general principle. The bulk of the data are contained in free-text fields as shows above. Of course, this will make the database rather big in the long run. We are talking about a web server log, so this doesn’t come as a big surprise.
The domain of those text fields is limited, though. There is e.g. a very finite set of vhosts that will be seen, a much larger, but still decidedly finite set of paths, user agents and so on. In cases like this, would it be more appropriate to factor the text fields out into sub-tables, and reference them only via identifiers? I am thinking along a line like this:
CREATE TABLE vhost ( ident SERIAL PRIMARY KEY, vhost TEXT NOT NULL UNIQUE );
CREATE TABLE path ( ident SERIAL PRIMARY KEY, path TEXT NOT NULL UNIQUE );
CREATE TABLE user_agent ( ident SERIAL PRIMARY KEY, user_agent TEXT NOT NULL UNIQUE );
CREATE TABLE log (
matchcode SERIAL PRIMARY KEY,
stamp TIMESTAMP WITH TIME ZONE,
ip INET,
bytes NUMERIC,
vhost INTEGER REFERENCES vhost ( ident ) ,
path INTEGER REFERENCES path ( ident ),
user_agent INTEGER REFERENCES user_agent ( ident ),
-- and so on
);
I have tried both approaches now. As expected, the second one is much smaller, by give or take factor three. However, querying it becomes significantly slower due to all the joins involved. The difference is by about an order of magnitude.
From what I understand, the table should be sufficiently normal in both cases. At some later point in the project, there’ll maybe be additional attributes attached to the various text values (like additional information about each vhost and so on).
The practical considerations are obvious, it’s basically a space/time tradeoff. In the long run, what is considered best practice in such a case? Are there other, perhaps more theoretical implications for such a scenario that I might want to be aware of?
The domain of those text fields is limited, though. There is e.g. a
very finite set of vhosts that will be seen, a much larger, but still
decidedly finite set of paths, user agents and so on. In cases like
this, would it be more appropriate to factor the text fields out into
sub-tables, and reference them only via identifiers?
There are a couple of different ways to look at this kind of problem. Regardless of which way you look at it, appropriate is a fuzzy word.
Constraints
Let's imagine that you create a table and a foreign key constraint to allow the "vhost" column to accept only five values. Can you also constrain the web server to write only those five values to the log file? No, you can't.
You can add some code to insert new virtual hosts into into the referenced table. You can even automate that with triggers. But when you do that, you're no longer constraining the values for "vhost". This remains true whether you use natural keys or surrogate keys.
Data compression with ID numbers
You can also think of this as a problem of data compression. You save space--potentially a lot of space--by using integers as foreign keys to tables of unique text. You might not save time. Queries that require a lot of joins are often slower than queries that just read data directly. You've already seen that.
In your case, which has to do with machine-generated log files, I prefer to store them as they come from the device (web server, flow sensor, whatever) unless there's a compelling reason not to.
In a few cases, I've worked on systems where domain experts have determined that certain kinds of values shouldn't be transferred from a log file to a database. For example, domain experts might decide that negative numbers from a sensor means the sensor is broken. This kind of constraint is better handled with a CHECK() constraint, but the principle is the same.
Take what the device gives you unless there's a compelling reason not to.
I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.
I need to develop a key/value backend, something like this:
Table T1 id-PK, Key - string, Value - string
INSERT into T1('String1', 'Value1')
INSERT INTO T1('String1', 'Value2')
Table T2 id-PK2, id2->external key to id
some other data in T2, which references data in T1 (like users which have those K/V etc)
I heard about PostgreSQL hstore with GIN/GIST. What is better (performance-wise)?
Doing this the traditional way with SQL joins and having separate columns(Key/Value) ?
Does PostgreSQL hstore perform better in this case?
The format of the data should be any key=>any value.
I also want to do text matching e.g. partially search for (LIKE % in SQL or using the hstore equivalent).
I plan to have around 1M-2M entries in it and probably scale at some point.
What do you recommend ? Going the SQL traditional way/PostgreSQL hstore or any other distributed key/value store with persistence?
If it helps, my server is a VPS with 1-2GB RAM, so not a pretty good hardware. I was also thinking to have a cache layer on top of this, but I think it rather complicates the problem. I just want good performance for 2M entries. Updates will be done often but searches even more often.
Thanks.
Your question is unclear because your not clear about your objective.
The key here is the index (pun intended) - if your dealing with a large amount of keys you want to be able to retrieve them with a the least lookups and without pulling up unrelated data.
Short answer is you probably don't want to use hstore, but lets look into more detail...
Does each id have many key/value pairs (hundreds+)? Don't use hstore.
Will any of your values contain large blocks of text (4kb+)? Don't use hstore.
Do you want to be able to search by keys in wildcard expressions? Don't use hstore.
Do you want to do complex joins/aggregation/reports? Don't use hstore.
Will you update the value for a single key? Don't use hstore.
Multiple keys with the same name under an id? Can't use hstore.
So what's the use of hstore? Well, one good scenario would be if you wanted to hold key/value pairs for an external application where you know you always want to retrive all key/values and will always save the data back as a block (ie, it's never edited in-place). At the same time you do want some flexibility to be able to search this data - albiet very simply - rather than storing it in say a block of XML or JSON. In this case since the number of key/value pairs are small you save on space because your compressing several tuples into one hstore.
Consider this as your table:
CREATE TABLE kv (
id /* SOME TYPE */ PRIMARY KEY,
key_name TEXT NOT NULL,
key_value TEXT,
UNIQUE(id, key_name)
);
I think the design is poorly normalized. Try something more like this:
CREATE TABLE t1
(
t1_id serial PRIMARY KEY,
<other data which depends on t1_id and nothing else>,
-- possibly an hstore, but maybe better as a separate table
t1_props hstore
);
-- if properties are done as a separate table:
CREATE TABLE t1_properties
(
t1_id int NOT NULL REFERENCES t1,
key_name text NOT NULL,
key_value text,
PRIMARY KEY (t1_id, key_name)
);
If the properties are small and you don't need to use them heavily in joins or with fancy selection criteria, and hstore may suffice. Elliot laid out some sensible things to consider in that regard.
Your reference to users suggests that this is incomplete, but you didn't really give enough information to suggest where those belong. You might get by with an array in t1, or you might be better off with a separate table.