How to manage foreign key errors from insert for the purpose of data validation (t-sql) - tsql

I am building a database in SQL Server 2000 and need to perform data validation by testing for foreign key violations. This post is related to an earlier post I made (Trigger exits on first failed insert and cant set xact_abort OFF in SQL Server 2000) which focussed on how to port from a working SQL Server 2005 implementation to a server 2000 implementation. Following the advice received on this post indicating wholesale recoding was required, i am now re-considering the design itself - hence this post. To recap on my application, my
I receive a daily data feed containing ~5k records into a Staging table. When this insert is done a single record is then added to a table called TRIGGER_DATA.
I have created a trigger ‘on insert’ on this table which then attempts to insert the data therein into a FACT_data table one record at a time.
The FACT_data table is foreign keyed to many DIM tables which define the acceptable inputs the field can take.
If any record violates a foreign key constraint the insert should fail and the record should instead be inserted into a Load_error table (which has no foreign key and all fields are Nullable).
Given the volume of records in each insert i thought it would be a bad idea to create the trigger on the Stage_data table since this would result in ~5k trigger firing in one go each day. However since i cannot set xact_abort off in a trigger under SQL Server 2000 and therefore on the first failure it aborts in the trigger i am wondering if it might be actually be a half decent solution.
Questions:
The basic question i am now asking myself is what is the typical approach for doing this - it seems to me that this kind of data validation through checking for FK violations must be common and therefore a consensus best practise may have emerged (although i really cant find any for server 2000 platform!)
Am i correct that the trigger on the stage_data table would be bad practise given the volume of records in each insert or is it acceptable?
Is my approach of looping through each record from within the trigger and testing the insert ok?
What are your thoughts on this alternative that i have just thought of. Stop using triggers altogether and, after the Stage table is loaded, update a 'stack' table with a record saying that data had been received and was ready to be validated and loaded to the FACT table (perhaps along with a priority level indicating order in which order tasks must be processed). This stack or 'job' table would then be a register of all requested inserts along with their status (created/in-progress/completed). I would then have a stored procedure continually poll this table and process the top priority record. This would mean that all stored proc calls would happen outwith the trigger.
Many thanks

You don't need a trigger at all. Unless there is some reason that you need split-second timing of this daily data load, just schedule a job (stored proc) that runs as often as necessary to look for data in the staging table.
When it finds any, process the records one at a time and load the ones that are OK and do whatever you do with the ones that have broken FKs (delete, move to a work queue, etc.).
If you use a schedule frequency that is often enough that there is some risk of the next job starting while the last one is still running, then you should create a sentinel table that your stored proc can write in to say that the job is running. This could work one of two ways. Either you just have one record that says "running" or "not running" or, you could have one record per job (like a transaction log) that has a status code indicating whether the job is complete or not.

Related

Can I configure a table such that inserted rows always have a greater primary key

I would like to configure a table in Postgres to behave like an append only log. This table will have an automatically generated primary ID.
Workers will work on the items in the table in order and should only need to store the last row ID that they have completed.
How can i prevent rows being written to the table (perhaps by some transactions taking longer than others) where the row ID is less than the greatest value in the table?
There is no way to prevent concurrent inserts in the table (short of locking the table, which is a bad idea, because it breaks autovacuum).
So there is no way to to guarantee that rows are inserted in a certain order. The order in which rows are inserted isn't really a meaningful concept in PostgreSQL.
If you really want that, you have to use a different mechanism to serialize inserts, for example using PostgreSQL advisory locks or synchronization mechanisms on the client side.
Except the numbers assigned are session specific, so a session that starts earlier but lasts longer can write to the table with an id that is less then one that started later but finished sooner. So either you create your own number sequence generation that involves locking or you use an INSERT timestamp.

SQL Transactions - allow read original data before commit (snapshot?)

I am facing an issue, possibly quite easy to solve, I am just new to advanced transaction settings.
Every 30 minutes I am running an INSERT query that is getting latest data from a linked server to my client's server, to a table we can call ImportTable. For this I have a simple job that looks like this:
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer
COMMIT
The thing is, each time the job runs the ImportTable is locked for the query run time (2-5 minutes) and nobody can read the records. I wish the table to be read-accessible all the time, with as little downtime as possible.
Now, I read that it is possible to allow SNAPSHOT ISOLATION in the database settings that could probably solve my problem (set to FALSE at the moment), but I have never played with different transaction isolation types and as this is not my DB but my client's, I'd rather not alter any database settings if I am not sure if it can break something.
I know I could have an intermediary table that the records are inserted to and then inserted to the final table and that is certainly a possible solution, I was just hoping for something more sophisticated and learning something new in the process.
PS: My client's server & database is fairly new and barely used, so I expect very little impact if I change some settings, but still, I cannot just randomly change various settings for learning purposes.
Many thanks!
Inserts wont normally block the table ,unless it is escalated to table level.In this case,you are deleting table first and inserting data again,why not insert only updated data?.for the query you are using transaction level (rsci)snapshot isolation will help you,but you will have an added impact of row version which means sql will store row versions of rows that changed in tempdb.
please see MCM isolation videos of Kimberely tripp for indepth understanding ,also dont forget to test in stage enviornment.
You are making this harder than it needs to be
The problem is the 2-5 minutes that you let be part of a transaction
It is only a few thousand rows - that part takes like a few milliseconds
If you need ImportTable to be available during those few milliseconds then put it in a SnapShot
Delete ImportTableStaging;
INSERT INTO ImportTableStaging(columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer;
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns) with (tablock)
SELECT (columns)
FROM ImportTableStaging
COMMIT
If you are worried about concurrent update to ImportTableStaging then use a #temp

How to trigger creation/update of another row of record if one row is created/updated in postgresql

I am receiving a record csv for outside, then when I create or update the entry into the postgresql, I need to create an mirror entry that only have sign differences. This is could be done at program level, I am curious to know would it possible using triggers.
For the examples I can find, they all end with code,
FOR EACH ROW EXECUTE PROCEDURE foo()
And usually deal with checks, add addtional info using NEW.additionalfield, or insert into another table. If I use trigger this way to insert another row in the same table, it seems the trigger will triggered again and the creation become recursive.
Any way to work this out?
When dealing with triggers, the rules of thumb are:
If it changes the current row, based on some business rules or other (e.g. adding extra info or processing calculated fields), it belongs in a BEFORE trigger.
If it has side effects on one or more rows in separate tables, it belongs in an AFTER trigger.
If it runs integrity checks on any table that no other built-in constraints (checks, unique keys, foreign keys, exclude, etc.) can take care of, it belongs in a CONSTRAINT [after] trigger.
If it has side effects on one or more other rows within the same table, you should probably revisit your schema, your code flow, or both.
Regarding that last point, there actually are workarounds in Postgres, such as trying to get a lock or checking xmin vs the transaction's xid, to avoid getting bogged down in recursive scenarios. A recent version additionally introduced pg_trigger_depth(). But I'd still advise against it.
Note that a constraint trigger can be created as deferrable initially deferred. This will delay the constraint trigger until the very end of the transaction, rather than immediately after the statement.
Your question and nickname hint that you're wondering how to automatically balance a set of lines in a double-entry book-keeping application. Assuming so, do NOT create the balancing entry automatically. Instead, begin a transaction, enter each line separately, and have a (for each row, deferrable initially deferred) constraint trigger pick things up from there and reject the entire batch if anything is unbalanced. Proceeding that way will spare you a mountain of headaches when you want to balance more than two or three lines with each other.
Another reading might be that you want to create an audit trail. If so, create additional audit tables and use after triggers to populate them. There are multiple ways to create and manage these audit tables. Look into slowly changing dimensions. (Fwiw, type 6 with a start_end column of type tsrange or tstzrange works well for the audit tables if you're interested in a table's full history including its history of relationships with other audit tables.) Use the "live" tables for your application to keep things fast, and use the audit-tables when you need historical reporting.

PostgreSQL INSERT - auto-commit mode vs non auto-commit mode

I'm new to PostgreSQL and still learning a lot as I go. My company is using PostgreSQL and we are populating the database with tons of data. The data we collect is quite bulky in nature and is derived from certain types of video footage. For example, data related to about 15 minutes worth of video took me about 2 days to ingest into the database.
My problem is that I have data sets which relate to hours worth of video which would take weeks to ingest into the database. I was informed part of the reason this is taking so long to ingest was because PostgeSQK has auto commit set to true by default and committing transactions takes a lot of time/resources. I was informed that I could turn auto commit off, due to which the process would speed up tremendously. However, my concern is that multiple users are going to be populating this database. If i change the program to commit after say every 10 secords and two people are attempting to populate the same table. The first person gets an id and when he's on say record 7 then the second person attempts to insert into the same table they are given the same id key and once the first person decides to commit his changes, the second persons id key will already be used, thus throwing an error.
So what is the best way to insert data into a PostgreSQL database when multiple people are ingesting data at the same time? Is there a way to work around issuing out the same id key to multiple people when inserting data in auto-commit mode?
If the IDs are coming from the serial type or a PostgreSQL sequence (which is used by the serial type), then you never have to worry about two users getting the same ID from the sequence. It simply isn't possible. The nextval() function only ever hands out a given ID a single time.

How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads

I have a PostgreSQL 9.2.2 database that serves orders to my ERP system. The database tables contain boolean columns indicating if a customer is added or not among other records. The code I use extracts the rows from the database and sends them to our ERP system one at a time (single threaded). My code works perfectly in this regard; however over the past year our volume has grown enough to require a multi-threaded solution.
I don't think the MVCC modes will work for me because the added_customer column is only updated once a customer has been successfully added. The default MVCC modes could cause the same row to be worked on at the same time resulting in duplicate web service calls. What I want to avoid is duplicate web service calls to our ERP system as they can be rather heavy, although admittedly I am not an expert on MVCC nor the other modes that PostgreSQL provides.
My question is: How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads?
You will need to record the fact that the rows are being processed somehow. You will also need to deal with concurrent attempts to mark them as being processed and handle failures with sending them to your ERP system.
You may find SELECT ... FOR UPDATE useful to get a set of rows and simultaneously lock them against updates. One approach might be for each thread to select a target row, try to add it's ID to a "processing" table, then remove it in the same transaction you update added_customer.
If a thread fetches no candidate rows, or fails to insert then it just needs to sleep briefly and try again. If anything goes badly wrong then you should have rows left in the "processing" table that you can inspect/correct.
Of course the other option is to just grab a set of candidate rows and spawn a separate process/thread for each that communicates with the ERP. That keeps the database fetching single-threaded while allowing multiple channels to the ERP.
You can add a column user_is_proccesed to the table. It can hold the process id for the back end, that updates the record.
Then use a small serializable transaction to set the user_is_proccesed to "lock row for proccesing".
Something like:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE user_table
SET user_is_proccesed = pg_backend_pid()
WHERE <some condition>
AND user_is_proccesed IS NULL; -- no one is proccesing it now
COMMIT;
The key thing here - with SERIALIZABLE only one transaction can successfully update the record (all other concurrent SERIALIZABLE updates will fail with ERROR: could not serialize access due to concurrent update).