Postgresql Increment if exist or Create a new row - postgresql

Hello I have a simple table like that:
+------------+------------+----------------------+----------------+
|id (serial) | date(date) | customer_fk(integer) | value(integer) |
+------------+------------+----------------------+----------------+
I want to use every row like a daily accumulator, if a customer value arrives
and if doesn't exist a record for that customer and date, then create a new row for that customer and date, but if exist only increment the value.
I don't know how implement something like that, I only know how increment a value using SET, but more logic is required here. Thanks in advance.
I'm using version 9.4

It sounds like what you are wanting to do is an UPSERT.
http://www.postgresql.org/docs/devel/static/sql-insert.html
In this type of query, you update the record if it exists or you create a new one if it does not. The key in your table would consist of customer_fk and date.
This would be a normal insert, but with ON CONFLICT DO UPDATE SET value = value + 1.
NOTE: This only works as of Postgres 9.5. It is not possible in previous versions. For versions prior to 9.1, the only solution is two steps. For 9.1 or later, a CTE may be used as well.
For earlier versions of Postgres, you will need to perform an UPDATE first with customer_fk and date in the WHERE clause. From there, check to see if the number of affected rows is 0. If it is, then do the INSERT. The only problem with this is there is a chance of a race condition if this operation happens twice at nearly the same time (common in a web environment) since the INSERT has a chance of failing for one of them and your count will always have a chance of being slightly off.
If you are using Postgres 9.1 or above, you can use an updatable CTE as cleverly pointed out here: Insert, on duplicate update in PostgreSQL?
This solution is less likely to result in a race condition since it's executed in one step.
WITH new_values (date::date, customer_fk::integer, value::integer) AS (
VALUES
(today, 24, 1)
),
upsert AS (
UPDATE mytable m
SET value = value + 1
FROM new_values nv
WHERE m.date = nv.date AND m.customer_fk = nv.customer_fk
RETURNING m.*
)
INSERT INTO mytable (date, customer_fk, value)
SELECT date, customer_fk, value
FROM new_values
WHERE NOT EXISTS (SELECT 1
FROM upsert up
WHERE up.date = new_values.date
AND up.customer_fk = new_values.customer_fk)
This contains two CTE tables. One contains the data you are inserting (new_values) and the other contains the results of an UPDATE query using those values (upsert). The last part uses these two tables to check if the records in new_values are not present in upsert, which would mean the UPDATE failed, and performs an INSERT to create the record instead.
As a side note, if you were doing this in another SQL engine that conforms to the standard, you would use a MERGE query instead. [ https://en.wikipedia.org/wiki/Merge_(SQL) ]

Related

Postgresql: increment column with unique values without constraint violation?

I am trying to implement a table with revision history in Postgresql as follows:
table has a multi-column primary key or unique constraint on columns id and rev (both numeric)
to create a new entry, insert data, have id auto-generated and rev set to 0
to update an existing entry, insert a new row with the previous id and rev set to -1, then increment the rev on all entries with that id by 1
to get the latest version, select by id and rev = 0
The problem that I am facing is the update after the insert; unsurprisingly, Postgresql sometimes raises a "duplicate key" error when rows are updated in the wrong order.
Since there is no ORDER BY available on UPDATE, I searched for other solutions and noticed that the updates pass without errors if the starting point is the result set of an subquery:
UPDATE test
SET rev = test.rev + 1
FROM (
SELECT rev
FROM test
WHERE id = 99
ORDER BY rev DESC
) AS prev
WHERE test.rev = prev.rev
The descending order ensures that the greater values get incremented first, so that the lower values do not violate the unique constraint when they get updated.
One catch though; I can't derive from the documentation if this is working due to some implementation detail (which might change without notice in the future) or indeed guaranteed by the language specification - can someone explain?
I was also wondering whether it is performance-wise better to have the rev column in the index (as described above, which leads to at least a partial index rebuild on every update, but maybe also to faster reads) or to define a (non-unique) index on id only and ignore the performance impact that could be caused by an (initially) larger query set. (I am expecting a rather low revision count per unique id on average, maybe 5.)

INSERT INTO .. SELECT causing possible race condition?

INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?
Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict
This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.

Proper table to track employee changes over time?

I have been using Python to do this in memory, but I would like to know the proper way to set up an employee mapping table in Postgres.
row_id | employee_id | other_id | other_dimensions | effective_date | expiration_date | is_current
Unique constraint on (employee_id, other_id), so a new row would be inserted whenever there is a change
I would want the expiration date from the previous row to be updated to the new effective_date minus 1 day, and the is_current should be updated to False
Ultimate purpose is to be able to map each employee back accurately on a given date
Would love to hear some best practices so I can move away from my file-based method where I read the whole roster into memory and use pandas to make changes, then truncate the original table and insert the new one.
Here's a general example built using the column names you provided that I think does more or less what you want. Don't treat it as a literal ready-to-run solution, but rather an example of how to make something like this work that you'll have to modify a bit for your own actual use case.
The rough idea is to make an underlying raw table that holds all your data, and establish a view on top of this that gets used for ordinary access. You can still use the raw table to do anything you need to do to or with the data, no matter how complicated, but the view provides more restrictive access for regular use. Rules are put in place on the view to enforce these restrictions and perform the special operations you want. While it doesn't sound like it's significant for your current application, it's important to note that these restrictions can be enforced via PostgreSQL's roles and privileges and the SQL GRANT command.
We start by making the raw table. Since the is_current column is likely to be used for reference a lot, we'll put an index on it. We'll take advantage of PostgreSQL's SERIAL type to manage our raw table's row_id for us. The view doesn't even need to reference the underlying row_id. We'll default the is_current to a True value as we expect most of the time we'll be adding current records, not past ones.
CREATE TABLE raw_employee (
row_id SERIAL PRIMARY KEY,
employee_id INTEGER,
other_id INTEGER,
other_dimensions VARCHAR,
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN DEFAULT TRUE
);
CREATE INDEX employee_is_current_index ON raw_employee (is_current);
Now we define our view. To most of the world this will be the normal way to access employee data. Internally it's a special SELECT run on-demand against the underlying raw_employee table that we've already defined. If we had reason to, we could further refine this view to hide more data (it's already hiding the low-level row_id as mentioned earlier) or display additional data produced either via calculation or relations with other tables.
CREATE OR REPLACE VIEW employee AS
SELECT employee_id, other_id,
other_dimensions, effective_date, expiration_date,
is_current
FROM raw_employee;
Now our rules. We construct these so that whenever someone tries an operation against our view, internally it'll perform a operation against our raw table according to the restrictions we define. First INSERT; it mostly just passes the data through without change, but it has to account for the hidden row_id:
CREATE OR REPLACE RULE employee_insert AS ON INSERT TO employee DO INSTEAD
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
NEW.employee_id, NEW.other_id,
NEW.other_dimensions,
NEW.effective_date, NEW.expiration_date,
NEW.is_current
);
The NEXTVAL part enables us to lean on PostgreSQL for row_id handling. Next is our most complicated one: UPDATE. Per your described intent, it has to match against employee_id, other_id pairs and perform two operations: updating the old record to be no longer current, and inserting a new record with updated dates. You didn't specify how you wanted to manage new expiration dates, so I took a guess. It's easy to change it.
CREATE OR REPLACE RULE employee_update AS ON UPDATE TO employee DO INSTEAD (
UPDATE raw_employee SET is_current = FALSE
WHERE raw_employee.employee_id = OLD.employee_id AND
raw_employee.other_id = OLD.other_id;
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
COALESCE(NEW.employee_id, OLD.employee_id),
COALESCE(NEW.other_id, OLD.other_id),
COALESCE(NEW.other_dimensions, OLD.other_dimensions),
COALESCE(NEW.effective_date, OLD.expiration_date - '1 day'::INTERVAL),
COALESCE(NEW.expiration_date, OLD.expiration_date + '1 year'::INTERVAL),
TRUE
);
);
The use of COALESCE enables us to update columns that have explicit updates, but keep old values for ones that don't. Finally, we need to make a rule for DELETE. Since you said you want to ensure you can track employee histories, the best way to do this is also the simplest: we just disable it.
CREATE OR REPLACE RULE employee_delete_protect AS
ON DELETE TO employee DO INSTEAD NOTHING;
Now we ought to be able to insert data into our raw table by performing INSERT operations on our view. Here are two sample employees; the first has a few weeks left but the second is about to expire. Note that at this level we don't need to care about the row_id. It's an internal implementation detail of the lower level raw table.
INSERT INTO employee VALUES (
1, 1,
'test', CURRENT_DATE - INTERVAL '1 week', CURRENT_DATE + INTERVAL '3 weeks',
TRUE
);
INSERT INTO employee VALUES (
2, 2,
'another test', CURRENT_DATE - INTERVAL '1 month', CURRENT_DATE,
TRUE
);
The final example is deceptively simple after all the build-up that we've done. It performs an UPDATE operation on the view, and internally it results in an update to the existing employee #2 plus a new entry for employee #2.
UPDATE employee SET expiration_date = CURRENT_DATE + INTERVAL '1 year'
WHERE employee_id = 2 AND other_id = 2;
Again I'll stress that this isn't meant to just take and use without modification. There should be enough info here though for you to make something work for your specific case.

Update statement where one column depends on another updated column

I want to update two columns in my table, one of them depends on the calculation of another updated column. The calculation is rather complex, so I don't want to repeat that every time, I just want to use the newly updated value.
CREATE TABLE test (
A int,
B int,
C int,
D int
)
INSERT INTO test VALUES (0, 0, 5, 10)
UPDATE test
SET
B = C*D * 100,
A = B / 100
So my question, is this even possible to get 50 as the value for column A in just one query?
Another option would be to use persistent computed columns, but will that work when I have dependencies on another computed column?
you cant achieve what you are trying to in a single query.This is due to a Concept called All At Once Operations which translates to "In SQL Server, Operations which appears in Same logical Phase are evaluated at the same time.."..
Below operations wont yield result you are expecting
insert into table1
(t1,t1+100,t1+200)-- sql wont use new t1 incremented value
sames goes with update as well
update t1
set t1=t1*100
t2=t1 --sql wont use t1 updated value(*100)
References:
TSQL Querying by Itzik Ben-Gan

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum