Insert Statement Times out due to nvarchar(max) field - entity-framework

I have a Sql Server database on Azure that processes incoming EDI documents. Basically, it receives the data, and saves it to a row in a table. The size of the 'edi_data' column can be as large as 7 Megs.
We have been using this for about two years with no problems. However, in the last two days, the insert statement has exceeded the 30 second timeout, and so throws an error.
The DTU of this database has been increased from 15, to 100. Raising the DTU did help to process more transmissions, but this error is occurring again today. The max DTU is not getting above 35% after the increase.
The is the Insert as generated by EntityFramework 6:
(#0 int,#1 int,#2 datetime2(7),#3 nvarchar(max),#4 nvarchar(max),#5
nvarchar(max),#6 nvarchar(max),#7 nvarchar(max),#8 nvarchar(max),#9
nvarchar(max),#10 nvarchar(max),#11 nvarchar(max),#12 nvarchar(max),#13
nvarchar(max),#14 nvarchar(max),#15 nvarchar(max),#16 nvarchar(max),#17
bit,#18 int)INSERT [dbo].[transmission]([transmission_status_id],
[transmission_attempts], [transmission_date], [edi_data], [originator_num],
[recipient_num], [error_message], [encryption_type], [gisb_version],
[receipt_signing_protocol], [receipt_type], [http_request],
[request_headers], [http_response], [response_headers], [edi_type],
[original_file_name], [file_name], [archive_flag], [group_control_code],
[orig_transmission_id], [direction])
VALUES (#0, #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11, #12, NULL, #13, #14, #15, #16, #17, NULL, NULL, #18)
Is there some other way to resolve this other than increasing the DTU even more? (I know I can increase the Command Timeout to more than 30 seconds, but I would like to fix the speed issue if possible.)

The nvarchar(max) and varbinary(max) types are meant for storing CLOBs and BLOBs. While one can save a relatively small buffer or string directly, it's far more efficient to use a streaming API to copy large amount of data to the server.
This can lead to significant memory savings for the client too, as it won't have to allocate the entire 7MB string at once, facing the inevitable garbage collection penalty.
Entity Framework doesn't offer any streaming functionality directly. ADO.NET and specifically the SqlClient provider does. The SqlClient Streaming Support article in the documentation explains how to use streaming to load or store big files into a BLOB field.
A SqlParameter with a text type like NVarChar can accept a TextReader as a value instead of a string. SqlClient will read the data from the reader and send it to the database.
Stealing the doc samples, for this table :
CREATE TABLE [TextStreams] (
[id] INT PRIMARY KEY IDENTITY(1, 1),
[textdata] NVARCHAR(MAX)
)
The following method will copy data from the source stream to the server :
private static async Task StreamTextToServer() {
using (SqlConnection conn = new SqlConnection(connectionString)) {
await conn.OpenAsync();
using (SqlCommand cmd = new SqlCommand("INSERT INTO [TextStreams] (textdata) VALUES (#textdata)", conn)) {
using (StreamReader file = File.OpenText("textdata.txt")) {
// Add a parameter which uses the StreamReader we just opened
// Size is set to -1 to indicate "MAX"
cmd.Parameters.Add("#textdata", SqlDbType.NVarChar, -1).Value = file;
// Send the data to the server asynchronously
await cmd.ExecuteNonQueryAsync();
}
}
}
}

Related

Database calls, 484ms apart, are producing incorrect results in Postgres

We have "things" sending data to AWS IoT. A rule forwards the payloads to a Lambda which is responsible for inserting or updating the data into Postgres (AWS RDS). The Lambda is written in python and uses PG8000 for interacting with the db. The lambda event looks like this:
{
"event_uuid": "8cd0b9b1-be93-49f8-1234-af4381052672",
"date": "2021-07-08T16:09:25.138809Z",
"serial_number": "a1b2c3",
"temp": "34"
}
Before inserting the data into Postgres, a query is run on the table to look for any existing event_uuids which are required to be unique. For a specific reason, there is no UNIQUE constraint on the event_uuid column. If the event_uuid does not exist, the data is inserted. If the event_uuid does exist, the data is updated. This all works great, except for the following case.
THE ISSUE: one of our things is sending two of the same payloads in very quick succession. It's an issue with one of our things but it's not something we can resolve at the moment and we need to account for it. Here are the timestamps from CloudWatch of when each payload was received:
2021-07-08T12:10:09.288-04:00
2021-07-08T12:10:09.772-04:00
As a result of the payloads being received 484ms apart, the Lambda is inserting both payloads instead of inserting the first and performing an update with the second one.
Any ideas on how to get around this?
Here is part of the Lambda code...
conn = make_conn()
event_query = f"""
SELECT json_build_object('uuid', uuid)
FROM samples
WHERE event_uuid='{event_uuid}'
AND serial_number='{serial_number}'
"""
event_resp = fetch_one(conn, event_query)
if event_resp:
update_sample_query = f"""
UPDATE samples SET temp={temp} WHERE uuid='{event_resp["uuid"]}'
"""
else:
insert_sample_query = f"""
INSERT INTO samples (uuid, event_uuid, temp)
VALUES ('{uuid4()}', '{event_uuid}', {temp})
"""

how read-through work in ignite

my cache is empty so sql queries return null.
The read-through means that if the cache is missed, then Ignite will automatically get down to the underlying db(or persistent store) to load the corresponding data.
If there are new data inserted into the underlying db table ,i have to down cache server to load the newly inserted data from the db table automatically or it will sync automatically ?
Is work same as Spring's #Cacheable or work differently.
It looks to me that the answer is no. Cache SQL query don't work as no data in cache but when i tried cache.get in i got following results :
case 1:
System.out.println("data == " + cache.get(new PersonKey("Manish", "Singh")).getPhones());
result ==> data == 1235
case 2 :
PersonKey per = new PersonKey();
per.setFirstname("Manish");
System.out.println("data == " + cache.get(per).getPhones());
throws error:- as following
error image, image2
Read-through semantics can be applied when there is a known set of keys to read. This is not the case with SQL, so in case your data is in an arbitrary 3rd party store (RDBMS, Cassandra, HBase, ...), you have to preload the data into memory prior to running queries.
However, Ignite provides native persistence storage [1] which eliminates this limitation. It allows to use any Ignite APIs without having anything in memory, and this includes SQL queries as well. Data will be fetched into memory on demand while you're using it.
[1] https://apacheignite.readme.io/docs/distributed-persistent-store
When you insert something into the database and it is not in the cache yet, then get operations will retrieve missing values from DB if readThrough is enabled and CacheStore is configured.
But currently it doesn't work this way for SQL queries executed on cache. You should call loadCache first, then values will appear in the cache and will be available for SQL.
When you perform your second get, the exact combination of name and lastname is sought in DB. It is converted into a CQL query containing lastname=null condition, and it fails, because lastname cannot be null.
UPD:
To get all records that have firstname column equal to 'Manish' you can first do loadCache with an appropriate predicate and then run an SQL query on cache.
cache.loadCache((k, v) -> v.lastname.equals("Manish"));
SqlFieldsQuery qry = new SqlFieldsQuery("select firstname, lastname from Person where firstname='Manish'");
try (FieldsQueryCursor<List<?>> cursor = cache.query(qry)) {
for (List<?> row : cursor)
System.out.println("firstname:" + row.get(0) + ", lastname:" + row.get(1));
}
Note that loadCache is a complex operation and requires to run over all records in the DB, so it shouldn't be called too often. You can provide null as a predicate, then all records will be loaded from the database.
Also to make SQL run fast on cache, you should mark firstname field as indexed in QueryEntity configuration.
In your case 2, have you tried specifying lastname as well? By your stack trace it's evident that Cassandra expects it to be not null.

C# Comparing lists of data from two separate databases using LINQ to Entities

I have 2 SQL Server databases, hosted on two different servers. I need to extract data from the first database. Which is going to be a list of integers. Then I need to compare this list against data in multiple tables in the second database. Depending on some conditions, I need to update or insert some records in the second database.
My solution:
(WCF Service/Entity Framework using LINQ to Entities)
Get the list of integers from 1st db, takes less than a second gets 20,942 records
I use the list of integers to compare against table in the second db using the following query:
List<int> pastDueAccts; //Assuming this is the list from Step#1
var matchedAccts = from acct in context.AmAccounts
where pastDueAccts.Contains(acct.ARNumber)
select acct;
This above query is taking so long that it gives a timeout error. Even though the AmAccount table only has ~400 records.
After I get these matchedAccts, I need to update or insert records in a separate table in the second db.
Can someone help me, how I can do step#2 more efficiently? I think the Contains function makes it slow. I tried brute force too, by putting a foreach loop in which I extract one record at a time and do the comparison. Still takes too long and gives timeout error. The database server shows only 30% of the memory has been used.
Profile the sql query being sent to the database by using SQL Profiler. Capture the SQL statement sent to the database and run it in SSMS. You should be able to capture the overhead imposed by Entity Framework at this point. Can you paste the SQL Statement emitted in step #2 in your question?
The query itself is going to have all 20,942 integers in it.
If your AmAccount table will always have a low number of records like that, you could just return the entire list of ARNumbers, compare them to the list, then be specific about which records to return:
List<int> pastDueAccts; //Assuming this is the list from Step#1
List<int> amAcctNumbers = from acct in context.AmAccounts
select acct.ARNumber
//Get a list of integers that are in both lists
var pastDueAmAcctNumbers = pastDueAccts.Intersect(amAcctNumbers);
var pastDueAmAccts = from acct in context.AmAccounts
where pastDueAmAcctNumbers.Contains(acct.ARNumber)
select acct;
You'll still have to worry about how many ids you are supplying to that query, and you might end up needing to retrieve them in batches.
UPDATE
Hopefully somebody has a better answer than this, but with so many records and doing this purely in EF, you could try batching it like I stated earlier:
//Suggest disabling auto detect changes
//Otherwise you will probably have some serious memory issues
//With 2MM+ records
context.Configuration.AutoDetectChangesEnabled = false;
List<int> pastDueAccts; //Assuming this is the list from Step#1
const int batchSize = 100;
for (int i = 0; i < pastDueAccts.Count; i += batchSize)
{
var batch = pastDueAccts.GetRange(i, batchSize);
var pastDueAmAccts = from acct in context.AmAccounts
where batch.Contains(acct.ARNumber)
select acct;
}

EF CF slow on generating insert statements

I have project that pull data from a service (return xml) which deserialize into objects/entities.
I'm using EF CF and testing is working fine until it come to big chuck of data, not too big, only 150K records, I use SQL profile to check the SQL statement and it's really fast, but there is a huge slow issue with generating insert statement.
simply put, the data model is simple, class Client has many child object set (5) and 1 many-to-many relationship.
ID for model is provided from service so I cleaned up the duplicate instances of one entity (same ID).
var clientList = service.GetAllClients(); // return IEnumerable<Client> // return 10K clients
var filteredList = Client.RemoveDuplicateInstancesSameEntity(clientList); // return IEnumerable<Client>
int cur = 0;
in batch = 100;
while (true)
{
logger.Trace("POINT A : get next batch");
var importSegment = filteredList.Skip(cur).Take(batch).OrderBy(x=> x.Id);
if (!importSegment.Any())
Break;
logger.Trace("POINT B: Saving to DB");
importSegment.ForEach(c => repository.addClient(c));
logger.Trace("POINT C: calling persist");
repository.persist();
cur = cur + batch;
}
logic is simple, breaking it up into batch to speed up the process. each 100 Client create about 1000 insert statement (for child records and 1 many to many table).
using profiler and logging to analyze this. right after it insert
log show POINT B as the last step all the time. but i dont see any insert statement yet in profiler. then 2 minutes later, I see all the insert statement and then the POINT B for the next batch. and 2 minutes again.
did I do anything wrong or is there is setting or anything I can do to improve?
insert 1k records seems to be fast. Database is wiped out when process start so no records in there. doesn't seem to be an issue with SQL slowness but EF generating insert statement?
although the project works but it is slow. I want to speed it up and understand more about EF when it comes to big chunks of data. or is this normal?
the first 100 is fast and then is getting slower and slower and slower. seems like issue at POINT B. is it issue with too much data repo/dbcontext can't handle it in timely manner?
repo is inheritance from dbcoontext and addClient is simply
dbcontext.Client.Add(client)
Thank you very much.

Query for null records where filestream is enabled

I have a table that we just enabled FileStreams on. We created a new varbinary column and set it to store to a filestream. Then we copied everything from the existing column to the new one in order to get the file data pushed to the file system.
So far so good.
However, we weren't able to take the DB offline while doing this (uptime SLA) and there were 2 records out of 7400 that came in after the update statement ran but before we renamed the columns. We currently have 2 columns: FileData and FileDataOld. Where FileData is the one tied to the filestream.
The average file size is a little over 2MB. So, I decided to run a very simple select statement to find the records that didn't go:
select DocumentId, FileName
from docslist
where FileData is null
When I ran this query, the CPU spiked to 80% and sat there for quite a while. Ultimately I killed the select after 2 minutes because that was just insane.
If I run something like:
select DocumentId, FileName from docslist
It returns almost instantly.
However, as soon as I try to query where FileData or FileDataOld is null it spins off into forever land.
When using Resource Monitor, and I query for 'FileData is null', I can see it pulling every byte from every single document off the file system. Which is pretty odd; you'd think that info would be stored within the table itself.
If I query for FileDataOld is null, it looks like it's trying to load the entire table (16GB) in memory.
How can I improve this?? I just need to get the 2 records that happened after the update statement and force those documents to move over.
Can't you do:
select DocumentId, FileName from docslist WHERE DATALENGTH(FileData)>0
On mdsn it says:
DATALENGTH is especially useful with varchar, varbinary, text, image,
nvarchar, and ntext data types because these data types can store
variable-length data.
The DATALENGTH of NULL is NULL.
Reference here