Performance Issue in Entity Framework Database first approach - entity-framework

I have windows form application which uses database first approach of entity framework. Application is being used to read txt files of 50,000 rows. Each rows has 20 values and for each value we have some business rules which required to validate from database values and then insert those values in 4-5 different tables
So there are minimum 50 database calls we have in our code for each row.
Initially I faced "out of memory" exception and because of that I was not able to run the txt file above 500 rows. Now I fixed that issue and application is running fine. But it is taking more time when I try with more number of rows. Like for 500 rows(3.5 mins), 1000 rows(10 mins) & 2000 rows(30 mins).
I would like to know whether context object holds the data every time when I insert into database. and why it is taking more time for simple (select query) lambda expression when row count goes above 500.
Appreciate your response/comments/suggestions.
Note:
1) We have around 15-20 tables.
2) One edmx file with some custom classes.
3) Context and edmx file resides in different class library and we are using it as dll reference.
Thanks in advance.
Manoj

Related

How to improve spring data JPA performance

I am attempting to improve the performance of my application in which one of the operations is to read data from a CSV file and store the values from each row as one POJO (so 1500 CSV rows = 1500 POJOs) in a PostgresSQL database. It is a spring boot application and uses a JpaRepository with (default configurations) as the means for persistence. My original attempt was basically this statement in each iteration of the loop as it read each row in the CSV file:
autowiredRepoInstance.save(objectInstance);
However with the spring.jpa.show-sql=true setting in the application.properties file, I saw that there was one insert being done for each POJO. My attempt at improving the performance was to declare an ArrayList outside the loop, save each instance of the POJO in that list within the loop, and at every 500th item, perform a save, as below (ignoring for now the cases where there are more/less than multiples of 500):
loop(
objList.add(objectInstance);
if (objList.size() == 500) {
autowiredRepoInstance.save(objList);
objList.clear();
}
)
However, this was also generating individual insert statements. What settings can I change to improve performance? Specifically, I would like to minimize the number of SQL statements/operations, and have the underlying Hibernate use "multirow" inserts that postgresql allows:
https://www.postgresql.org/docs/9.6/static/sql-insert.html
But any other suggestions are also welcomed.
Thank you.
First read all data from CSV and process like below
Generate a bufferred stream over Input file
Generate a stream over buffered reader apply filer or map to process data
As output of above you will get list of entities
Divide list of entities into list of list entities (if you have huge data like more than a million records)
Pass inner list of entities (you can set 10000) JPA repository save method in batches (if possible use parallel stream)
I processed 1.3 million records in less than a minutes with above process
Or use some batch processing technologies

SQL Query slow during batch update of table

I have a postgresql table with about 250K records. It gets updated a few times an hour. However, the entire table gets deleted and new records added. (Batch job).
I don't have much control over that process. During the time it takes the transaction to delete/re-load queries on the table basically lock/hang until the job finishes. The job takes about a minute to run. We have real time users looking at this data (which is spatial data on a map with a time slider). They recogize the lost records very easily.
Is there anything that can be done about these 60 or so second query times during the update. I've thought about loading into a 2nd table, dropping the original and renaming the 2nd to the original but that introduces more chance of error. Are there any settings that will just grab data as is and not necessarily look for a consistent view of the data.
Basically just looking for ideas on how to handle this situation.
I'm running postgresql 9.3.1
Thanks

BigQuery streaming insert using template tables data availability issue

We have been using BigQuery for over a year now with no issues. We load data as batch jobs every few hours and it usually is instantly available.
We just started experimenting with streaming inserts using template tables. With our first test, we saw no errors and the data showed up instantly. The test created approximately 120 tables. A simple select count (using the web ui) on the tables came up with the right total number of ~8000 rows. After a couple of hours of more streaming, the total dropped to ~1400 rows.
Unsure about what happened, we dropped the dataset, recreated the template table and re-ran the streaming. This time around, the tables showed up right away but the data did not. On our third attempt the tables themselves did not show up for more than a couple of hours. We are on the fourth attempt and this time we only streamed data belonging to one table. The table showed up right away, but it has been over an hour and the data does not show up.
The streaming service uses the latest Java library, inserts only one record at a time and logs the response. The response, without an exception is always {"kind":"bigquery#tableDataInsertAllResponse"} and no errors.
Any help trying to understand what is happening would be great. Thanks.
Looks like we've identified the issue. It appears there's a race in the template-tables path only that causes our system to think the first chunk of data was deleted by user action (table truncation -- which it obviously wasn't), and is dropped. We've identified the fix and will attempt to push out a fix shortly.
Thanks for letting us know!

getting data from DB in spring batch and store in memory

In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html

Crystal Reports - very large database, very long processing time

I'm really at a loss as to how to procede.
I have a very large database, and the table I'm accessing has approx. 600,000 records. This database is accessed using an accounting application, which provides the report with the SQL query by which this report accesses the database.
My report has a linked subreport which has restrictions that are placed in the report header. When this report is run, the average time to refresh, using a very base query is 36 minutes. When adding two more items to the query, the report takes 2.5 hours.
Here is what I've tried:
cleaned up the report only leaving items in absolutely necessary - no difference
removed most formulas (removing the remaining formulas makes no time difference)
tried editing the SQL query - wasn't allowed because of the accounting application
tried flipping subreport and main report - didn't work
added other groupings - no difference
removed groupings - no difference
checked all the servers for lack of temp disc space - no issue
tried "on demand" subreport - no change
checked Parameters (discrete vs. range) and it is as it should be
tried bursting indexes, grouping on server, etc. - no difference
the report requires 2 passes. I've tried getting it down to one pass unsuccessfully.
There must be something I'm missing.
There does not appear to be any other modifications to the report using regular crystal functions. Is there any way to speed up the accessing of the data without having to go through all 600,000 records? The SQL query that accesses this data is long and has many requests. It is not something I can change.
Can I add something (formula?) that nullifies these requests? I'm reaching now...
Couple of things we have had success with is adding indexes to the databases, and instead of importing tables into the report, we instead wrote a stored procedure to retrieve the desired results.
If indices and stored procedures dont get you where you need to be you have reached the denormalise until it works part of life with a database. You might want to look at creating an MI database with tables optimized for your reporting needs; and some data transformation scripts that can extract the data from production to your MI database. Depending on what it is oracle / ms have tools to help you do this.
We use Crystal Reports with a billing system, and we had queries in the database that take over 1.5 hours to complete. This doesn't even take into account the rendering/formatting of the reports.
We created Materialized Views and force the client to refresh them daily. A materialized view is basically a database view that holds the returned dataset. The dataset is not refreshed unless you explicitly tell it to refresh.
Do you know what the SQL query is? If so, you can move the report outside the accounting application and paste the query directly into the Command in the database expert. I've had to do this in a couple of cases with another application I work with.