How many records can I insert at once - tsql

I am using an Azure SQL database and I'm wondering if there is a limit on how many records I can insert at once. I have a case where I need to insert 7000+ records.
If there is a limit, anyone know a good way of inserting the records in batches?
foreach (var records in records)
{
db.Records.Add(record);
}
db.SaveChanges();

There is not straight answer on this question. Off course there is a limit on how many record you can insert. This depends on many things like 32/64bit / Memory / DB size / transaction size / add method to ef / number of records in one action / transactions locks on your db in a production situation etc. You can write some test that checks how big the chucks have to be like: fastest-way-of-inserting-in-entity-framework. Or you can use AddRange: Ef bulk insert.
db.AddRange(records);
db.SaveChanges();
You have to decide on if you need to insert the records in one transaction or you can use a number of transaction to keep your table locks smaller.

Related

Efficient Way To Update Multiple Records In PostgreSQL

I have to update multiple records in a table with the most efficient way out there having the least latency and without utilising CPU extensively. At a time records to update can be ranged from 1 to 1000.
We do not want to lock the database when this update occurs as other services are utilising it.
Note: There are no dependencies generated from this table towards any other table in the system.
After looking in many places I've drilled down a few ways to do the task-
simple-update: A simple update query to the table with update command with already known id's
Either multiple update queries (one query for each individual record), or
Usage of update ... from clause as mentioned here as a single query (one query for all records)
delete-then-insert: Firstly, delete the outdated data and then insert updated data with new id's (since there is no dependency on records, new id's are acceptable)
insert-then-delete: Firstly, insert updated records with new id's and then delete outdated data using old id's (since there is no dependency on records, new id's are acceptable)
temp-table: Firstly, insert updated records into a temporary table. Secondly, update the original table with inserted records from the temporary table. At last, remove the temporary table.
We must not drop the existing table and create a new one in its place
We must not truncate the existing table because we have a huge number of records that we cannot store in the buffer memory
I'm open to any more suggestions.
Also, what will be the impact of making the update all at once vs doing it in batches of 100, 200 or 500?
References:
https://dba.stackexchange.com/questions/75631
https://dba.stackexchange.com/questions/230257
As mentioned by #Frank Heikens in the comments, I'm sure that different people will have different statistics based on their system design. I did some checks and I have found some insights to share for one of my development systems.
Configurations of the system used:
AWS
Engine: PostgreSQL
Engine version: 12.8
Instance class: db.m6g.xlarge
Instance vCPU: 4
Instance RAM: 16GB
Storage: 1000 GiB
I used a lambda function and pg package to write data into a table (default FILLFACTOR) that contains 34,09,304 records.
Both lambda function and database were in the same region.
UPDATE 1000 records into the database with a single query
Run
Time taken
1
143.78ms
2
115.277ms
3
98.358ms
4
98.065ms
5
114.78ms
6
111.261ms
7
107.883ms
8
89.091ms
9
88.42ms
10
88.95ms
UPDATE 1000 records into the database with a single query in 2 batches of 500 records concurrently
Run
Time taken
1
43.786ms
2
48.099ms
3
45.677ms
4
40.578ms
5
41.424ms
6
44.052ms
7
42.155ms
8
37.231ms
9
38.875ms
10
39.231ms
DELETE + INSERT 1000 records into the database
Run
Time taken
1
230.961ms
2
153.159ms
3
157.534ms
4
151.055ms
5
132.865ms
6
153.485ms
7
131.588ms
8
135.99ms
9
287.143ms
10
175.562ms
I did not proceed to check for updating records with the help of another buffer table because I had found my answer.
I've seen the database metrics graph provided by the AWS and by looking into those it was clear that DELETE + INSERT was more CPU intensive. And from the statistics shared above DELETE + INSERT took more time as compared to UPDATE.
If updates are done concurrently in batches, yes, updates will be faster, depending on the number of connections (a connection pool is recommended).
Using a buffer table, truncate, and other methods might be more suitable approaches if needed to update almost all the records in a giant table, though I currently do not have metrics to support this. However, for a limited number of records, UPDATE is a fine choice to proceed with.
Also, if not executed properly, please be mindful that if DELETE + INSERT fails, you might lose records and if INSERT + DELETE fails you might end up having duplicate records.

Redshift table design for efficiency

I have a redshift cluster with a single dc1.large node. I've got data writing into it, on order of 50 million records a day, in the format of a timestamp, a user ID and an item ID. The item ID (varchar) is unique, the user ID (varchar) is not, and the timestamp (timestamp) is not.
In my redshift DB of about 110m records, if I have a table with no sort key, it takes about 30 seconds to search for a single item ID.
If I have a table with a sort key on item ID, I get a single item ID search time of about 14-16 seconds.
If I have a table with an interleved sort key with all three columns, the single item ID search time is still 14-16 seconds.
What I'm hoping to achieve is the ability to query for the records of thousands or tens of thousands of item IDs on order of a second.
The query just looks like
select count(*) from rs_table where itemid = 'id123';
or
select count(*) from rs_table where itemid in ('id123','id124','id125');
This query comes back in 541ms
select count(*) from rs_table;
AWS documentation suggests that there is a compile time for queries the first time they're run, but I don't think that's what I'm seeing (and it would be not ideal if it was, since each unique set of 10,000 IDs might never be queried in exactly the same order again.
I have to assume I'm doing something wrong with either the sort key design, the query, or some combination of the two - for only ~10g of table space, something like redshift shouldn't take this long to query, right?
Josh,
We probably need a few additional pieces of information to give you a good recommendation.
Here are some things to start thinking about.
Are most of your queries record lookups as you describe above?
What is your distribution key?
Do you join this table with other large fact tables?
If you load 50M records per day and you only have 110M records in the
table, does that mean that you only store 2 days?
Do you do massive deletes and then load another 50M records per day?
Do you run ANALYZE after your loads?
If you deleted a large number of records, did you run VACUUM?
If all of your queries are similar to the ones that you describe, why are you using Redshift? Amazon DynamoDB or MongoDB (even Cassandra) would be great database choices for the types of queries that you describe.
If you run analytical workloads Redshift is an excellent platform. If you are more interested in "record lookups" the NoSQL options, as well as mysql or MariaDB might give you better performance.
Also, if this is a dev/test environment and you have loaded and deleted large amounts of data without ever running a VACUUM you would see significant performance degradation.

SQL Query with a little bit more complicated WHERE clause performance issue

I have some performance issue in this case:
Very simplified query:
SELECT COUNT(*) FROM Items WHERE ConditionA OR ConditionB OR ConditionC OR ...
Simply I have to determine how many Items the user has access through some complicated conditions.
When there is a large number of records (100,000+) in the Items table and say ~10 complicated conditions concatenated in WHERE clause, I get the result about 2 seconds in my case. The problem is when a very few conditions are met, f.e. when I get only 10 Items from 100,000.
How can I improve the performace in this "Get my items" case?
Additional information:
the query is generated by EF 6.1
MS SQL 2012 Express
SQL Execution Plan
Add an additional table to your schema. Instead of building a long query, insert the value for each condition into this new table, along with a key for that user/session. Then, JOIN the two tables together.
This should perform much better, because it will allow the database engine to make better use of indexes on your Items table.
Additionally, this will position you to eventually pre-define sets of permissions for your users, such that you don't need to insert them right at the moment you do the check. The permission sets will already be in the table, and the new table can also be indexed, which will improve performance further.

Optimize SQLite query that has multiple parameters using FMDB

I have a query that starts pretty simple:
SELECT zudf16, zudf17, zudf18, zudf19, zudf20, zitemid, zItemName, zPhotoName, zBasePrice FROM zskus WHERE zmanufacturerid=? AND zudf16=? ORDER BY zItemID
As choices are made I build on it:
SELECT zudf16, zudf17, zudf18, zudf19, zudf20, zitemid, zItemName, zPhotoName, zBasePrice FROM zskus WHERE zmanufacturerid=? AND zudf16=? AND zudf19=? AND zudf20=? ORDER BY zItemID
For each parameter I add the query drops in performance, especially on iPad1. Each of these columns have an index on them.
How can I increase the performance of this query as it get's more targeted rather than lose performance?
This is across around 60,000 records that match the initial ManufacturerID but with a total of about 200,000 rows total in the database.

EF CF slow on generating insert statements

I have project that pull data from a service (return xml) which deserialize into objects/entities.
I'm using EF CF and testing is working fine until it come to big chuck of data, not too big, only 150K records, I use SQL profile to check the SQL statement and it's really fast, but there is a huge slow issue with generating insert statement.
simply put, the data model is simple, class Client has many child object set (5) and 1 many-to-many relationship.
ID for model is provided from service so I cleaned up the duplicate instances of one entity (same ID).
var clientList = service.GetAllClients(); // return IEnumerable<Client> // return 10K clients
var filteredList = Client.RemoveDuplicateInstancesSameEntity(clientList); // return IEnumerable<Client>
int cur = 0;
in batch = 100;
while (true)
{
logger.Trace("POINT A : get next batch");
var importSegment = filteredList.Skip(cur).Take(batch).OrderBy(x=> x.Id);
if (!importSegment.Any())
Break;
logger.Trace("POINT B: Saving to DB");
importSegment.ForEach(c => repository.addClient(c));
logger.Trace("POINT C: calling persist");
repository.persist();
cur = cur + batch;
}
logic is simple, breaking it up into batch to speed up the process. each 100 Client create about 1000 insert statement (for child records and 1 many to many table).
using profiler and logging to analyze this. right after it insert
log show POINT B as the last step all the time. but i dont see any insert statement yet in profiler. then 2 minutes later, I see all the insert statement and then the POINT B for the next batch. and 2 minutes again.
did I do anything wrong or is there is setting or anything I can do to improve?
insert 1k records seems to be fast. Database is wiped out when process start so no records in there. doesn't seem to be an issue with SQL slowness but EF generating insert statement?
although the project works but it is slow. I want to speed it up and understand more about EF when it comes to big chunks of data. or is this normal?
the first 100 is fast and then is getting slower and slower and slower. seems like issue at POINT B. is it issue with too much data repo/dbcontext can't handle it in timely manner?
repo is inheritance from dbcoontext and addClient is simply
dbcontext.Client.Add(client)
Thank you very much.