tsql - Batch insert performance - tsql

I would like to improve the speed of some inserts on an application I am working on.
I had originally created a batch like this:
insert tableA (field1, field2) values (1,'test1'), (2, 'test2')
This works great on SQL Server 2008 and above, but I need my inserts to work on SQL Server 2005. My question is would I get any performance benefit using a batch like this:
insert tableA (field1, field2)
select 1, 'test1'
union all
select 2, 'test2'
Over this batch:
insert tableA (field1, field2) values (1, 'test1')
insert tableA (field1, field2) values (2, 'test2')

Altough it may seem that less code to process should give you a performance gain, using first option (union all) is a bad idea.
All select statements have to be parsed and executed before the data is inserted into a table. It'll consume a lot of memory and may take forever to finish even for fairly small amount of data. On the other hand separate insert statements are executed "on the fly".
Run a simple test in SSMS that will prove this: create a simple table with 4 fields and try inserting 20k rows. Second solution will execute in seconds, while first... you'll see :).
Another problem is that when the data is not correct in som row (type mismatch for example), you'll receive an error like Conversion failed when converting the varchar value 'x' to data type int., but you'll have no indication which row failed - you'd have to find it yourself. But with separate inserts, you'll know exactly which insert failed.

Related

T-SQL Indices when using OR statement

I've looked through several sources to get information on indices regarding AND statements, table joins, etc., but I've yet to find much useful information when there are OR statements present. That being said how would someone ideally handle creating indexes for a situation like the one here?
Updated SQL Statement to not use like.
SELECT *
FROM table_name
WHERE table_name.field1 = 'criteria'
and (table_name.field2 = 1 or
table_name.field3 = 0)
Obviously, I would want to create an index for field1. But whether I include fields 2 and 3 or handle this in another way, I'm not sure. If this were a simple three part AND statement I would probably use CREATE INDEX IND_index_name on table_name (field1, field2, field3) but I have reason to believe this logic doesn't work the same way for OR statements. Based on the statement given, I can assume that field1 always needs to be evaluated first, but then I have multiple ways I could handle this. The potential solutions I am evaluating are listed below, but I'm not sure if there may yet be another better solution. Any help would be greatly appreciated!
CREATE INDEX IND_index_name on table_name (field1) INCLUDING (field2, field3)
CREATE INDEX IND_index_name on table_name (field1, field2, field3)
CREATE INDEX IND_index_name1 on table_name (field1, field2)
CREATE INDEX IND_index_name2 on table_name (field1, field3)
CREATE INDEX IND_index_name1 on table_name (field1)
CREATE INDEX IND_index_name2 on table_name (field2)
CREATE INDEX IND_index_name3 on table_name (field3)
As additional info, I do not have access to SQL Server Management Studio tools because I am using DBeaver. For the sake of this example let's assume it takes almost a full workday to run without indexes. The answers submitted here will be used to tackle a much larger more complex query where data from table_name would used in several subsequent queries that run after the query shown above.
There is not one way to index OR
I would try 3 separate indexes

Redshift large 'in' clause best practices

We have a query in which a list of parameter values is provided in "IN" clause of the query. Some time back this query failed to execute as the size of data in "IN" clause got quite large and hence the resulting query exceeded the 16 MB limit of the query in REDSHIFT. As a result of which we then tried processing the data in batches so as to limit the data and not breach the 16 MB limit.
My question is what are the factors/pitfalls to keep in mind while supplying such large data for the "IN" clause of a query or is there any alternative way in which I can deal with such large data for the "IN" clause?
If you have control over how you are generating your code, you could split it up as follows
first code to be submitted, drop and recreate filter table:
drop table if exists myfilter;
create table myfilter (filter_text varchar(max));
Second step is to populate the filter table in parts of a suitable size, e.g. 1000 values at a time
insert into myfilter
values({{myvalue1}},{{myvalue2}},{{myvalue3}} etc etc up to 1000 values );
repeat the above step multiple times until you have all of your values inserted
Then, use that filter table as follows
select * from master_table
where some_value in (select filter_text from myfilter);
drop table myfilter;
Large IN is not the best practice itself, it's better to use joins for large lists:
construct a virtual table a subquery
join your target table to the virtual table
like this
with
your_list as (
select 'first_value' as search_value
union select 'second_value'
...
)
select ...
from target_table t1
join your_list t2
on t1.col=t2.search_value

PostgreSQL multi-value upserts

is it possible to perform a multi-value upsert in PostgreSQL? I know multi-value inserts exist, as do the "ON CONFLICT" key words to perform an update if the key is violated... but is it possible to bring the two together? Something like so...
INSERT INTO table1(col1, col2) VALUES (1, 'foo'), (2,'bar'), (3,'baz')
ON CONFLICT ON CONSTRAINT theConstraint DO
UPDATE SET (col2) = ('foo'), ('bar'), ('baz')
I googled the crud out of this and couldn't find anything on regarding it.
I have an app that is utilizing pg-promise and I'm doing batch processing. It works but its horrendously slow (like 50-ish rows every 5 seconds or so...). I figured if I could do away with the batch processing and instead correctly build this multi-valued upsert query, it could improve performance.
Edit:
Well... I just tried it myself and no, it doesn't work. Unless I'm doing it incorrectly. So now I guess my question has changed to, what's a good way to implement something like this?
Multi-valued upsert is definitely possible, and a significant part of why insert ... on conflict ... was implemented.
CREATE TABLE table1(col1 int, col2 text, constraint theconstraint unique(col1));
INSERT INTO table1 VALUES (1, 'parrot'), (4, 'turkey');
INSERT INTO table1 VALUES (1, 'foo'), (2,'bar'), (3,'baz')
ON CONFLICT ON CONSTRAINT theconstraint
DO UPDATE SET col2 = EXCLUDED.col2;
results in
regress=> SELECT * FROM table1 ORDER BY col1;
col1 | col2
------+------
1 | foo
2 | bar
3 | baz
4 | turkey
(4 rows)
If the docs were unclear, please submit appropriate feedback to the pgsql-general mailing list. Or even better, propose a patch to the docs.
1.Before the insert
2.Command
3.After the insert

Insert into subselect slow

I try to fill a table "SAMPLE" that requires ids from three other tables.
The table "SAMPLE" that needs to be filled look holds the following:
id (integer, not null, pk)
code (text, not null)
subsystem_id (integer, fk)
system_id (integer, not null, fk)
manufacturer_id (integer, fk)
The current query looks like this:
insert into SAMPLE(system_id, manufacturer_id, code, subsystem_id)
values ((select id from system where initial = 'P'), (select id from manufacturer where name = 'nameXY'), 'P0001', (select id from subsystem where code = 'NAME PATTERN'));
It is ridiculously slow, inserting 8k rows in around a minute.
I'm not sure if this is a really bad query problem or if my postgres configuration is heavily messed up.
For clarification, more table information:
subsystem:
This table holds fixed values (9) with a basic pattern I can access easily.
system
This table holds fixed values (4) that can be identified using the "initial" attribute
manufacturer
This table holds the name of a manufacturer.
The "SAMPLE" table will be the only connection between those tables so I'm not sure if I can use joins.
I'm pretty sure 8k values should be a gigantic joke to insert for a database so I'm really confused.
My specs:
Win 7 x86_64
8GB RAM
intel i5 3470S (QUAD) 2,9 GHZ
Postgres is v9.3
I didn't see any peak during my query so I suspect something is up with my configuration. If you need information about it, let me know.
Note: It is possible that I have codes or names that can not be found in the subsystem or manufacturer tables. Instead of adding nothing, I want to add a NULL value to the cell then.
8000 inserts/mn is roughly 133 per second or 0.133 ms per statement.
This is to be expected if the INSERTs happen in a loop each statement in its own transaction.
Each transaction commits to disk and waits for the disk to confirm that the data is written in durable storage. This is known to be slow.
Add a transaction around the loop with BEGIN and END and it will run at normal speed.
Ideally you wouldn't even have a loop but a more complex query that does a single INSERT to create all the rows from their sources, if possible.
I could not test it because I have no PostgreSql installed and no database with a similar structure, but may it would be faster to get the insert data from a single statement
INSERT INTO Sample (system_id, manufacturer_id, code, subsystem_id)
SELECT s.id AS system_id,
m.id AS manufacturer_id,
'P0001' AS code,
ss.id AS subsystem_id
FROM system s
JOIN manufacturer m
ON m.name = 'nameXY'
JOIN subsystem ss
ON ss.code = 'NAME PATTERN'
WHERE s.initial = 'P'
I hope this works.

Returning inserted rows in PostgreSQL

I'm currently working on a report generation servlet that agglomerates information from several tables and generates a report. In addition to returning the resulting rows, I'm also storing them into a reports table so they won't need to be regenerated later, and will persist if the tables they're drawn from are wiped. To do the latter I have a statement of the form (NB: x is externally generated and actually a constant in this statement):
INSERT INTO reports
(report_id, col_a, col_b, col_c)
SELECT x as report_id, foo.a, bar.b, bar.c
FROM foo, bar
This works fine, but then I need a second query to actually return the resulting rows back, e.g.
SELECT col_a, col_b, col_c
FROM reports
WHERE report_id = x
This works fine and since it only involves the single table, shouldn't be expensive, but seems like I should be able to directly return the results of the insertion avoiding the second query. Is there some syntax for doing this I've not been able to find? (I should note, I'm fairly new at DB work, so if the right answer is to just run the second query, as it's only slightly slower, so be it)
In PostgreSQL with version >= 8.2, you can use this construct:
INSERT INTO reports (report_id, col_a, col_b, col_c)
SELECT x as report_id, foo.a, bar.b, bar.c
FROM foo, bar
RETURNING col_a, col_b, col_c
Or without select:
INSERT INTO distributors (did, dname) VALUES (DEFAULT, 'XYZ Widgets')
RETURNING did;
documentation
You could also use an SRF although that may be overkill. It depends on what you are trying to do. For example, if you are only returning the information to perform a piece of logic that will go directly back to the database to perform more queries, it may make sense to use an SRF.
http://wiki.postgresql.org/wiki/Return_more_than_one_row_of_data_from_PL/pgSQL_functions