PostgreSQL INSERT - auto-commit mode vs non auto-commit mode - postgresql

I'm new to PostgreSQL and still learning a lot as I go. My company is using PostgreSQL and we are populating the database with tons of data. The data we collect is quite bulky in nature and is derived from certain types of video footage. For example, data related to about 15 minutes worth of video took me about 2 days to ingest into the database.
My problem is that I have data sets which relate to hours worth of video which would take weeks to ingest into the database. I was informed part of the reason this is taking so long to ingest was because PostgeSQK has auto commit set to true by default and committing transactions takes a lot of time/resources. I was informed that I could turn auto commit off, due to which the process would speed up tremendously. However, my concern is that multiple users are going to be populating this database. If i change the program to commit after say every 10 secords and two people are attempting to populate the same table. The first person gets an id and when he's on say record 7 then the second person attempts to insert into the same table they are given the same id key and once the first person decides to commit his changes, the second persons id key will already be used, thus throwing an error.
So what is the best way to insert data into a PostgreSQL database when multiple people are ingesting data at the same time? Is there a way to work around issuing out the same id key to multiple people when inserting data in auto-commit mode?

If the IDs are coming from the serial type or a PostgreSQL sequence (which is used by the serial type), then you never have to worry about two users getting the same ID from the sequence. It simply isn't possible. The nextval() function only ever hands out a given ID a single time.

Related

Can I configure a table such that inserted rows always have a greater primary key

I would like to configure a table in Postgres to behave like an append only log. This table will have an automatically generated primary ID.
Workers will work on the items in the table in order and should only need to store the last row ID that they have completed.
How can i prevent rows being written to the table (perhaps by some transactions taking longer than others) where the row ID is less than the greatest value in the table?
There is no way to prevent concurrent inserts in the table (short of locking the table, which is a bad idea, because it breaks autovacuum).
So there is no way to to guarantee that rows are inserted in a certain order. The order in which rows are inserted isn't really a meaningful concept in PostgreSQL.
If you really want that, you have to use a different mechanism to serialize inserts, for example using PostgreSQL advisory locks or synchronization mechanisms on the client side.
Except the numbers assigned are session specific, so a session that starts earlier but lasts longer can write to the table with an id that is less then one that started later but finished sooner. So either you create your own number sequence generation that involves locking or you use an INSERT timestamp.

SQL Transactions - allow read original data before commit (snapshot?)

I am facing an issue, possibly quite easy to solve, I am just new to advanced transaction settings.
Every 30 minutes I am running an INSERT query that is getting latest data from a linked server to my client's server, to a table we can call ImportTable. For this I have a simple job that looks like this:
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer
COMMIT
The thing is, each time the job runs the ImportTable is locked for the query run time (2-5 minutes) and nobody can read the records. I wish the table to be read-accessible all the time, with as little downtime as possible.
Now, I read that it is possible to allow SNAPSHOT ISOLATION in the database settings that could probably solve my problem (set to FALSE at the moment), but I have never played with different transaction isolation types and as this is not my DB but my client's, I'd rather not alter any database settings if I am not sure if it can break something.
I know I could have an intermediary table that the records are inserted to and then inserted to the final table and that is certainly a possible solution, I was just hoping for something more sophisticated and learning something new in the process.
PS: My client's server & database is fairly new and barely used, so I expect very little impact if I change some settings, but still, I cannot just randomly change various settings for learning purposes.
Many thanks!
Inserts wont normally block the table ,unless it is escalated to table level.In this case,you are deleting table first and inserting data again,why not insert only updated data?.for the query you are using transaction level (rsci)snapshot isolation will help you,but you will have an added impact of row version which means sql will store row versions of rows that changed in tempdb.
please see MCM isolation videos of Kimberely tripp for indepth understanding ,also dont forget to test in stage enviornment.
You are making this harder than it needs to be
The problem is the 2-5 minutes that you let be part of a transaction
It is only a few thousand rows - that part takes like a few milliseconds
If you need ImportTable to be available during those few milliseconds then put it in a SnapShot
Delete ImportTableStaging;
INSERT INTO ImportTableStaging(columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer;
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns) with (tablock)
SELECT (columns)
FROM ImportTableStaging
COMMIT
If you are worried about concurrent update to ImportTableStaging then use a #temp

Hyperion RDBMS Table

As we know details of every job are stored in rdbms in table Hsp_Job_Status. But unfortunately this table gets truncated each time we re-start services. As per business requirement we needed to keep a record of BR's launched by user and it's details. So we had developed a work around and created a trigger on the table such that it inserted each new row/update in a backup table. This was working fine uptill now.
Recently after re-start the the values of old Job_id (i.e primary key), are not appearing in order. It started series form a previous number. It was going in series of 106XX but after re-start the numbering started from 100XX. As Hsp_job_status was truncated during restart, there was no issue of duplicate primary key in that table. But it created duplicate values in backup table. And this has created issues with backup table and procedure that we use.
Usually the series is continuous one even after table truncate. So may be some thing has gone wrong during restart. Can you please suggest me as to what should i check and do to resolve this issue.
Thanks in advance.
Partial answer: the simple solution is to insert an instance prefix to the Job_Id, and on service startup increment the active instance. The instance table can then include details from startup/shutdown events to help drive SLA metrics. Unfortunately, I don't know how you would go about implementing such a scheme, since it's been many years since I've spoken any SQL dialects.

How to manage foreign key errors from insert for the purpose of data validation (t-sql)

I am building a database in SQL Server 2000 and need to perform data validation by testing for foreign key violations. This post is related to an earlier post I made (Trigger exits on first failed insert and cant set xact_abort OFF in SQL Server 2000) which focussed on how to port from a working SQL Server 2005 implementation to a server 2000 implementation. Following the advice received on this post indicating wholesale recoding was required, i am now re-considering the design itself - hence this post. To recap on my application, my
I receive a daily data feed containing ~5k records into a Staging table. When this insert is done a single record is then added to a table called TRIGGER_DATA.
I have created a trigger ‘on insert’ on this table which then attempts to insert the data therein into a FACT_data table one record at a time.
The FACT_data table is foreign keyed to many DIM tables which define the acceptable inputs the field can take.
If any record violates a foreign key constraint the insert should fail and the record should instead be inserted into a Load_error table (which has no foreign key and all fields are Nullable).
Given the volume of records in each insert i thought it would be a bad idea to create the trigger on the Stage_data table since this would result in ~5k trigger firing in one go each day. However since i cannot set xact_abort off in a trigger under SQL Server 2000 and therefore on the first failure it aborts in the trigger i am wondering if it might be actually be a half decent solution.
Questions:
The basic question i am now asking myself is what is the typical approach for doing this - it seems to me that this kind of data validation through checking for FK violations must be common and therefore a consensus best practise may have emerged (although i really cant find any for server 2000 platform!)
Am i correct that the trigger on the stage_data table would be bad practise given the volume of records in each insert or is it acceptable?
Is my approach of looping through each record from within the trigger and testing the insert ok?
What are your thoughts on this alternative that i have just thought of. Stop using triggers altogether and, after the Stage table is loaded, update a 'stack' table with a record saying that data had been received and was ready to be validated and loaded to the FACT table (perhaps along with a priority level indicating order in which order tasks must be processed). This stack or 'job' table would then be a register of all requested inserts along with their status (created/in-progress/completed). I would then have a stored procedure continually poll this table and process the top priority record. This would mean that all stored proc calls would happen outwith the trigger.
Many thanks
You don't need a trigger at all. Unless there is some reason that you need split-second timing of this daily data load, just schedule a job (stored proc) that runs as often as necessary to look for data in the staging table.
When it finds any, process the records one at a time and load the ones that are OK and do whatever you do with the ones that have broken FKs (delete, move to a work queue, etc.).
If you use a schedule frequency that is often enough that there is some risk of the next job starting while the last one is still running, then you should create a sentinel table that your stored proc can write in to say that the job is running. This could work one of two ways. Either you just have one record that says "running" or "not running" or, you could have one record per job (like a transaction log) that has a status code indicating whether the job is complete or not.

How should i keep track of the delete operations in database without using triggers?

The appliocation polls the database after certain intervals of time. On each polling, the application would read all the tables.
As a part of optimization, we want that application should read the table only if any INSERT/UPDATE/DELETE has happened. So i want to use the timestamp concept.
Having a seperate timestamp column can help me in tracking any row modifications.
While querying on a table i can check if the in-memory stored timestamp is lesser than the max-of-TimeStamp in the table. if yes, it means that some row has been modified.
But if certain row gets deleted, then the latest timestamp assosiated with this row is no more pressent. Hence the above algorithm fails in this case since the max-of-timestamp does not give the correct value.
Is there a way in which i can track the delete operations as well without using triggers?
Any help would be highly appreciated.
I am using Sybase ASA database.
Maybe you could implement a logical deletion. Instead of removing a record you simply mark it as deleted with a specific flag for example.
You still have the max timestamp and you can exclude the flagged records from the selection queries (maybe create some views on top of the table to do the job automatically).