What is the difference between inserting duplicate rows and inserting new rows - database-performance

When I use voltdb (an in-memory database), I found a strange phenomenon: Inserting duplicate raws (500ms) takes more time than inserting new columns (10 ms ),and the table has been partitioned.In my knowledge, inserting a duplicate data should return faster because there is no need to do a real insert action.

It looks like what you are measuring is the response time back to the client, which may not reflect the execution time. It may depend on what the client does with success results vs. error results, or it may just be random network latency, or how many other transactions were queued ahead of your requests when you sent them.
To measure the actual time it takes to insert a new record or to attempt to insert a duplicate record (and for it to fail and return a unique constraint violation), you may need to isolate and repeat enough transactions to get a good measurement from the built-in statistics.
Supposing you have a partitioned table with a primary key. Here is how I would measure this:
Call "exec #Statistics PROCEDUREPROFILE 1;", but you can ignore these initial results.
Insert 50-100 unique records into your table by calling the TABLENAME.insert procedure.
Call "exec #Statistics PROCEDUREPROFILE 1;" and look at the avg_execution_time column for the TABLENAME.insert procedure. This average should be just for the unique inserts you did between calls to #Statistics.
Insert 50-100 duplicate records using the TABLENAME.insert procedure.
Call "exec #Statistics PROCEDUREPROFILE 1;" and look at the avg_execution_time column for the TABLENAME.insert procedure. This average should be just for the duplicate inserts you did between calls to #Statistics.
I haven't tested unique inserts vs. insert constraint violations for speed myself. In general, inserts are fast, regardless of outcome, but I suspect your instinct is correct and the failed duplicate inserts may execute faster since it has to check either way, but if it finds a duplicate there is no need to insert.
Disclaimer: I work for VoltDB.

Related

PostgreSQL: Optimizing bulk INSERT INTO/SELECT FROM

In my Postgresql schema I have a jobs table and an accounts table. Once a day I need to schedule a job for each account by inserting a row per account into the jobs table. This can done using a simple INSERT INTO.. SELECT FROM statement, but is there any empirical way I can know if I am straining my DB by this bulk insert and whether I should chunk the inserts instead?
Postgres often does miraculous work so I have no idea if bulk inserting 500k records at a time is better than 100 x 5k, for example. The bulk insert works today but can take minutes to complete.
One additional data point: the jobs table has a uniqueness constraint on account ID, so this statement includes an ON CONFLICT clause too.
In PostgreSQL, it doesn't matter how many rows you modify in a single transaction, so it is preferable to do everything in a single statement so that you don't end up with half the work done in case of a failure. The only consideration is that transactions should not take too long, but if that happens once a day, it is no problem.

Should I use database trigger in insert heavy operation table?

My use-case is that I need to copy a few columns from TABLE A to another TABLE B and also derive values of a few other columns of TABLE B by some calculation.
As per current estimation around 50,000 rows will be inserted on daily basis in TABLE A.
TABLE B should be updated with all data before End of day.
Hence, either I can use trigger which will be invoked on INSERT operation on TABLE A or schedule some Job at EOD which read all data in bulk from TABLE A , do some calculation and insert in TABLE B.
As I am new to trigger, I am not sure which option should i pick for this use-case. Any suggestion which would be a better approach ?
So far what I have read about triggers, they can slow down DBs performance if they are invoked frequently.
As around 50,000 insert operation will happen daily , so can I assume that 50,000 falls under heavy operations where triggers would not be beneficial ?
EDIT 1 : 50,000 insert operation will reach 100,000 insert operations daily
Postgres DB is used.
If you are doing bulk COPY into an unindexed table, adding a simple trigger will slow you down by a lot (like 5 fold). But if you are using single-row INSERTs or the table is indexed, the marginal slow down of adding a simple trigger will be pretty small.
50,000 inserts per day is not very many. You should have no trouble using a trigger on that, unless the trigger has to scan a large table on each call or something.

Postgres Insert as select from table ignore any errors

I am trying to bulk load records from a temp table to table using insert as select stmt and on conflict strategy do update.
I want load as many records as possible, currently if there any any foreign key violations no records get inserted, everything gets rolled back. Is there a way to insert valid records and skip the faulty records.
In https://dba.stackexchange.com/a/46477 I saw a strategy of going with the foreign table in the query to ignore the faulty rows. I don't want to do that too as I may have many foreign keys on that table and it will make my query more complex and table specific. I would like it to be generic.
Sample use case, if have 100 rows in the temp table and suppose row number 5 and 7 are causing insertion failure, I want to insert the rest 98 records and identify which two rows failed.
I want to avoid inserting record by record and catch the error, as it is not efficient. I am doing this whole exercise to avoid loading a table row by row.
Oracle provides support to catch bulk errors at a shot.
Sample https://stackoverflow.com/a/36430893/8575780
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1422998100346727312
I have already explored options loading using copy, it catches not null constraints and other data type errors, but when foreign key violation happens nothing gets committed.
I am looking something closer to what pgloader is doing when it faces error.
https://pgloader.readthedocs.io/en/latest/pgloader.html#batches-and-retry-behaviour

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Delete Takes a Long Time

I've got a table which has about 5.5 million records. I need to delete some records from it based on date. My query looks like this:
DELETE FROM Table WHERE [Date] between '2011-10-31 04:30:23' and '2011-11-01 04:30:42'
It's about 9000 rows, but this operation last very long time. How can I speed it up? Date is type of datetime2, table has int primary key clustered. Update and delete triggers are disabled.
It's very possible that [Date] is being cast to a string on every row resulting in a sequential scan of the entire table.
You should try casting your parameters to a date instead:
DELETE FROM Table WHERE [Date] between convert(datetime, '2011-10-31 04:30:23') and convert(datetime, '2011-11-01 04:30:42')
Also, make sure there's an index on [Date]
Firstly make sure you have an index on date.
If there is an index check the execution plan and make sure it is using it. Notice that it doesn't always follow that using an index is the most efficient method of processing a delete because if you are deleting a large proportion of records (rule of thumb is in excess of 10%) the additional overhead of the index look-up can be greater than a full scan.
With a large table it's also well worth making sure that the statistics are up to date (run sp_updatestats) because if the database has an incorrect understanding of the number of rows in the table it will make inappropriate choices in its execution plan. For example if the statistics are incorrect the database may decide to ignore your index even if it exists because it thinks there are far fewer records in the table than there are. Odd distributions of dates might have similar effects.
I'd probably try dropping the index on date then recreating it again. Indexes are binary trees and to work efficiently they need to be balanced. If your data has accumulated over time the index may well lopsided and queries might take a long time to find the appropriate data. Both this and statistics issue should be handled automatically by your database maintenance job, but it's often overlooked.
Finally you don't say if there are many other indexes on the table. If there are then you might be running into issues with the database having to reorganize indexes as it progresses the delete as well as update the indexes. It's a bit drastic, but one option is to drop all other indexes on the table before running the delete, then create them again afterwards.