How to maintain record history on table with one-to-many relationships? - postgresql

I have a "services" table for detailing services that we provide. Among the data that needs recording are several small one-to-many relationships (all with a foreign key constraint to the service_id) such as:
service_owners -- user_ids responsible for delivery of service
service_tags -- e.g. IT, Records Management, Finance
customer_categories -- ENUM value
provider_categories -- ENUM value
software_used -- self-explanatory
The problem I have is that I want to keep a history of updates to a service, for which I'm using an update trigger on the table, that performs an insert into a history table matching the original columns. However, if a normalized approach to the above data is used, with separate tables and foreign keys for each one-to-many relationship, any update on these tables will not be recognised in the history of the service.
Does anyone have any suggestions? It seems like I need to store child keys in the service table to maintain the integrity of the service history. Is a delimited text field a valid approach here or, as I am using postgreSQL, perhaps arrays are also a valid option? These feel somewhat dirty though!
Thanks.

If your table is:
create table T (
ix int identity primary key,
val nvarchar(50)
)
And your history table is:
create table THistory (
ix int identity primary key,
val nvarchar(50),
updateType char(1), -- C=Create, U=Update or D=Delete
updateTime datetime,
updateUsername sysname
)
Then you just need to put an update trigger on all tables of interest. You can then find out what the state of any/all of the tables were at any point in history, to determine what the relationships were at that time.

I'd avoid using arrays in any database whenever possible.
I don't like updates for the exact reason you are saying here...you lose information as it's over written. My answer is quite simple...don't update. Not sure if you're at a point where this can be implemented...but if you can I'd recommend using the main table itself to store historical (no need for a second set of history tables).
Add a column to your main header table called 'active'. This can be a character or a bit (0 is off and 1 is on). Then it's a bit of trigger magic...when an update is preformed, you insert a row into the table identical to the record being over-written with a status of '0' (or inactive) and then update the existing row (this process keeps the ID column on the active record the same, the newly inserted record is the inactive one with a new ID).
This way no data is ever lost (admittedly you are storing quite a few rows...) and the history can easily be viewed with a select where active = 0.
The pain here is if you are working on something already implemented...every existing query that hits this table will need to be updated to include a check for the active column. Makes this solution very easy to implement if you are designing a new system, but a pain if it's a long standing application. Unfortunately existing reports will include both off and on records (without throwing an error) until you can modify the where clause

Related

PostgreSQL - on conflict update for GENERATED ALWAYS AS IDENTITY

I have a table of values which I want to manage with my application ...
let's say this is the table
CREATE TABLE student (
id_student int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
id_teacher int2
student_name varchar(255),
age int2
CONSTRAINT provider_pk PRIMARY KEY (id_student)
);
In the application, each teacher can see the list of all his/her students .. and they can edit or add new students
I am trying to figure out how to UPSERT data in the table in PostgreSQL ... what I am doing now is for each teacher (after the manipulation in the app) they are allowed to edit on FE only in JS (without the necessity of saving each change individually)... so after the edit, they click SAVE button and that's the time I need to store the changes and new records in the DB ...
what I do now, is I delete all records for that particular teacher and store the new object/array they created (by editing, adding, .. whatever) - so it's easy and I don't have to check for changes and new records ... the drawbacks is a brutal waste of the sequence for ID_STUDENT (autogenerated on the DB side) and of course a huge overhead on indexes while inserting (=rebuilding) considering there will be a lot of teachers saving a lot of their students .. that might cause some perf issues ... not to mention the fragmenting (HWM) so I would have to VACUUM regularly on this table
In Oracle, I could easily use MERGE INTO (which is fantastic for this use case) but the MERGE is not in the PostgreSQL :(
the only thing I know about is the INSERT ON CONFLICT UPDATE ... but the problem is, how am I supposed to apply this on GENERATED ALWAYS AS IDENTITY key? I do not provide this sequence (on top of that I don't even know the latest number) and therefore I cannot trigger the ON CONFLICT (id_student) ....
is there any nice way out of this sh*t ? Or DELETE / INSERT is really the way to go?
You shouldn't be too worried about the data churn – after all, an UPDATE also writes a new version of the row, so it wouldn't be that much different. And the sequence is no problem, because you used bigint for the primary key, right (anything else would have been a mistake)?
If you want to use INSERT ... ON CONFLICT in combination with an auto-generated sequence, you need some way beside the primary key to identify a row, that is, you need a UNIQUE constraint that you can use with ON CONFLICT. If there is no candidate for such a constraint, how can you identify the records for the teacher?

Storing duplicate data as a column in Postgres?

In some database project, I have a users table which somehow has a computed value avg_service_rating. And there is another table called services with all the services associated to the user and the ratings for that service. Is there a computationally-lite way which I can maintain the avg_service_rating rating without updating it every time an INSERT is done on the services table? Perhaps like a generate column but with a function call instead? Any direct advice or link to resources will be greatly appreciated as well!
CREATE TABLE users (
username VARCHAR PRIMARY KEY,
avg_service_ratings NUMERIC -- is it possible to store some function call for this column?,
...
);
CREATE TABLE service (
username VARCHAR NOT NULL REFERENCE users (username);
service_date DATE NOT NULL,
rating INTEGER,
PRIMARY KEY (username, service_date),
);
If the values should be consistent, a generated column won't fit the bill, since it is only recomputed if the row itself is modified.
I see two solutions:
have a trigger on the services table that updates the users table whenever a rating is added or modified. That slows down data modifications, but not your queries.
Turn users into a view. The original users table would be renamed, and it loses the avg_service_rating column, which is computed on the fly by the view.
To make the illusion perfect, create an INSTEAD OF INSERT OR UPDATE OR DELETE trigger on the view that modifies the underlying table. Then your application does not need to be changed.
With this solution you pay a certain price both on SELECT and on data modifications, but the latter price will be lower, since you don't have to modify two tables (and users might receive fewer modifications than services). An added advantage is that you avoid data duplication.
A generated column would only be useful if the source data is in the same table row.
Otherwise your options are a view (where you could call a function or calculate the value via a subquery), or an AFTER UPDATE OR INSERT trigger on the service table, which updates users.avg_service_ratings. With a trigger, if you get a lot of updates on the service table you'd need to consider possible concurrency issues, but it would mean the figure doesn't need to be calculated every time a row in the users table is accessed.

DB2 access specific row, in an non Unique table, for update / delete operations

Can I do row-specific update / delete operations in a DB2 table Via SQL, in a NON QUNIQUE Primary Key Context?
The Table is a PHYSICAL FILE on the NATIVE SYSTEM of the AS/400.
It was, like many other Files, created without the unique definition, which leads DB2 to the conclusion, that The Table, or PF has no qunique Key.
And that's my problem. I can't override the structure of the table to insert a unique ID ROW, because, I would have to recompile ALL my correlating Programs on the AS/400, which is a serious issue, much things would not work anymore, "perhaps". Of course, I can do that refactoring for one table, but our system has thousands of those native FILES, some well done with Unique Key, some without Unique definition...
Well, I work most of the time with db2 and sql on that old files. And all files which have a UNIQUE Key are no problem for me to do those important update / delete operations.
Is there some way to get an additional column to every select with a very unique row id, respective row number. And in addition, what is much more important, how can I update this RowNumber.
I did some research and meanwhile I assume, that there is no chance to do exact alterations or deletes, when there is no unique key present. What I would wish would be some additional ID-ROW which is always been sent with the table, which I can Refer to when I do my update / delete operations. Perhaps my thinking here has an fallacy as non Unique Key Tables are purposed to be edited in other ways.
Try the RRN function.
SELECT RRN(EMPLOYEE), LASTNAME
FROM EMPLOYEE
WHERE ...;
UPDATE EMPLOYEE
SET ...
WHERE RRN(EMPLOYEE) = ...;

PostgreSQL 9.5 ON CONFLICT DO UPDATE command cannot affect row a second time

I have a table from which I want to UPSERT into another, when try to launch the query, I get the "cannot affect row a second time" error. So I tried to see if I have some duplicate on my first table regarding the field with the UNIQUE constraint, and I have none. I must be missing something, but since I cannot figure out what (and my query is a bit complex because it is including some JOIN), here is the query, the field with the UNIQUE constraint is "identifiant_immeuble" :
with upd(a,b,c,d,e,f,g,h,i,j,k) as(
select id_parcelle, batimentimmeuble,etatimmeuble,nb_loc_hab_ma,nb_loc_hab_ap,nb_loc_pro, dossier.id_dossier, adresse.id_adresse, zapms.geom, 1, batimentimmeuble2
from public.zapms
left join geo_pays_gex.dossier on dossier.designation_siea=zapms.id_dossier
left join geo_pays_gex.adresse on adresse.id_voie=(select id_voie from geo_pays_gex.voie where (voie.designation=zapms.nom_voie or voie.nom_quartier=zapms.nom_quartier) and voie.rivoli=lpad(zapms.rivoli,4,'0'))
and adresse.num_voie=zapms.num_voie
and adresse.insee=zapms.insee_commune::integer
)
insert into geo_pays_gex.bal2(identifiant_immeuble, batimentimmeuble, id_etat_addr, nb_loc_hab_ma, nb_loc_hab_ap, nb_loc_pro, id_dossier, id_adresse, geom, raccordement, batimentimmeuble2)
select a,b,c,d,e,f,g,h,i,j,k from upd
on conflict (identifiant_immeuble) do update
set batimentimmeuble=excluded.batimentimmeuble, id_etat_addr=excluded.id_etat_addr, nb_loc_hab_ma=excluded.nb_loc_hab_ma, nb_loc_hab_ap=excluded.nb_loc_hab_ap, nb_loc_pro=excluded.nb_loc_pro,
id_dossier=excluded.id_dossier, id_adresse=excluded.id_adresse,geom=excluded.geom, raccordement=1, batimentimmeuble2=excluded.batimentimmeuble2
;
As you can see, I use several intermediary tables in this query : one to store the street's names (voie), one related to this one storing the adresses (adresse, basically numbers related through a foreign key to the street's names table), and another storing some other datas related to the projects' names (dossier).
I don't know what other information I could give to help find an answer, I guess it is better I do not share the actual content of my tables since it may touch some privacy regulations or such.
Thanks for your attention.
EDIT : I found a workaround by deleting the entries present in the zapms table from the bal2 table, as such
delete from geo_pays_gex.bal2 where bal2.identifiant_immeuble in (select id_parcelle from zapms);
it is not entirely satisfying though, since I would have prefered to keep track of the data creator and the date of creation, as much as the fact that the data has been modified (I have some fields to store this information) and here I simply erase all this history... And I have another table with the primary key of the bal2 table as a foreign key. I am still in the DB creation so I can afford to truncate this table, but in production it wouldn't be possible since I would lose some datas.

How to use BULK INSERT when rows depend on foreign keys values?

My question is related to this one I asked on ServerFault.
Based on this, I've considered the use of BULK INSERT. I now understand that I have to prepare myself a file for each entities I want to save into the database. No matter what, I still wonder whether this BULK INSERT will avoid the memory issue on my system as described in the referenced question on ServerFault.
As for the Streets table, it's quite simple! I have only two cities and five sectors to care about as the foreign keys. But then, how about the Addresses? The Addresses table is structured like this:
AddressId int not null identity(1,1) primary key
StreetNumber int null
NumberSuffix_Value int not null DEFAULT 0
StreetId int null references Streets (StreetId)
CityId int not null references Cities (CityId)
SectorId int null references Sectors (SectorId)
As I said on ServerFault, I have about 35,000 addresses to insert. Shall I memorize all the IDs? =P
And then, I now have the citizen people to insert who have an association with the addresses.
PersonId int not null indentity(1,1) primary key
Surname nvarchar not null
FirstName nvarchar not null
IsActive bit
AddressId int null references Addresses (AddressId)
The only thing I can think of is to force the IDs to static values, but then, I lose any flexibility that I had with my former approach with the INSERT..SELECT stategy.
What are then my options?
I force the IDs to be always the same, then I have to SET IDENTITY_INSERT ON so that I can force the values into the table, this way I always have the same IDs for each of my rows just as suggested here.
How to BULK INSERT with foreign keys? I can't get any docs on this anywhere. =(
Thanks for your kind assistance!
EDIT
I edited in order to include the BULK INSERT SQL instruction that finally made it for me!
I had my Excel workbook ready with the information I needed to insert. So, I simply created a few supplemental worksheet and began to write formulas in order to "import" the information data to these new sheets. I had one for each of my entities.
Streets;
Addresses;
Citizens.
As for the two other entities, it wasn't worthy to bulk insert them, as I had only two cities and five sectors (cities subdivisions) to insert. Once the both the cities and sectors inserted, I noted their respective IDs and began to ready my record sets for bulk insert. Using the power of Excel to compute the values and to "import" the foreign keys was a charm of itself, by the way. Afterwards, I have saved each of the worksheets to a separated CSV file. My records were then ready to bulked.
USE [DatabaseName]
GO
delete from Citizens
delete from Addresses
delete from Streets
BULK INSERT Streets
FROM N'C:\SomeFolder\SomeSubfolder\Streets.csv'
WITH (
FIRSTROW = 2
, KEEPIDENTITY
, FIELDTERMINATOR = N','
, ROWTERMINATOR = N'\n'
, CODEPAGE = N'ACP'
)
GO
FIRSTROW
Indicates the row number at which to begin the insert. In my situation, my CSVs contained the column headers, so the second row was the one to begin with. Aside, one could possibly want to start anywhere in his file, let's say the 15th row.
KEEPIDENTITY
Allows one to bulk-insert specified in-file entity IDs even though the table has an identity column. This parameter is the same as SET INDENTITY_INSERT my_table ON before a row insert when you wish to insert with a precise id.
As for the other parameters, they speak by themselves.
Now that this is explained, the same code was repeated for each of the two remaining entities to insert Addresses and Citizens. And because the KEEPIDENTITY was specified, all of my foreign keys remained still, though my primary keys were set as identities in SQL Server.
Only a few tweaks though, just the exact same thing as marc_s said in his answer, just import your data as fast as you can into a staging table with no restriction at all. This way, you're gonna make your life much easier, while following good practices nevertheless. =)
The basic idea is to bulk insert your data into a staging table that doesn't have any restrictions, any constraints etc. - just bulk load the data as fast as you can.
Once you have the data in the staging table, then you need to start to worry about constraints etc. when you insert the data from the staging table into the real tables.
Here, you could e.g.
insert only those rows into your real work tables that match all the criteria (and mark them as "successfully inserted" in your staging table)
handle all rows that are left in the staging table that aren't successfully inserted by some error / recovery process - whatever that could be: printing a report with all the "problem" rows, tossing them into an "error bin" or whatever - totally up to you.
Key point is: the actual BULK INSERT should be into a totally unconstrained table - just load the data as fast as you can - and only then in a second step start to worry about constraints and lookup data and references and stuff like that