What's the drawback of SqlBulkCopy - sqlbulkcopy

I have done some research for "The bast way to insert huge data into DB with C#" then a lot of people just suggested me using SqlBulkCopy. After I tried it out and it really amazed me. Undoubtedly, SqlBulkCopy is very very fast. It seems that SqlBulkCopy is a perfect way to insert data (especially huge data). But why dont we use it at all times. Is there any drawback of using SqlBulkCopy?

SqlBulkCopy does exist for Oracle v11 as well, but it's provided by the Oracle .NET assemblies you get when you install Oracle Client. The SqlBulkCopy class is basically implemented one by one, by the provider of the target database engine.
One HUGE drawback, though - there is absolutely no error reporting. If, for example, you've updated data in a DataSet, are flushing it back tothe DB with an adapter, and there's a key violation (or any other failure), the culprit DataRows will have .HasErrors set to true, and you can add that to your exception message when it's raised.
With SqlBulkCopy, you just get the type of the error and that's it. Good luck debugging it.

Two reasons I can think of:
As far as I know, it's only available for Microsoft SQL Server
In a lot of normal workloads, you don't do bulk inserts, but occasional inserts intermixed with selects and updates. Microsoft themselves state that a normal insert is more efficient for that, on the SqlBulkCopy MSDN page.
Note that if you want a SqlBulkCopy to be equivalent to a normal insert, at the very least you'll have to pass it the SqlBulkCopyOptions.CheckConstraints parameter.

Related

Azure Database, EF, Time out issues

I have taken over an existing MVC website which uses entity framework and hangfire and is hosted on Azure and uses Azure database. Every so often the website times out.
I'm new to Azure portal, entity framework and hangfire.
If I increase the DTU's it clears the timeout issues?
I'm looking for ways of how to diagnose why the website times out. I have added error logging using elmah and checked hangfire but this doesn't give me any further information.
Is there anything in azure portal that can help?
If it "times out" and if "increasing DTU resolves timeouts" and these observations are true (I think it's on you to really convince yourself this is absolutely true, don't make this assumption lightly) then the usual and obvious candidate is "a slow sql query". Entity Framework is often used with linq to create sql queries without writing sql. These queries are often fine for very simple tasks, such as someData.Where(x=>x.Id == 1).First(), however, if linq is used to join tables, or create complex associations, the generated sql can become monstrously bad, from a performance perspective. You can add logging to write out the sql generated by linq, or you can try to trace the database to see what sql is running on it. If tracing is out of the question, there are still meta queries you can use to view things like cached query plans and SQL Server can give you estimated costs and cached execution counts.
You can still hang yourself without using linq. You can still use stored procedures with EF. Way too many developers are naive about SQL performance still; you need to comb over your back end and learn the schema, the stored procedures; inspect the sql contents of everything. Check for any database triggers (easy to miss). Red flags are subqueries, too many joining, too many results from a query, lots of string manipulation in a query, joining tables on strings, or XML/JSON-based SQL work.
Be aware that "slow sql queries" will become slower when load is high. And when slow sql queries build up, they only take more time to resolve. This can also cause debilitating table locking, depending on the nature of the query.
But queries can be performant and still cause locking. ie One table is being written to often and it's blocking other writes or reads from that table. This is a little harder to diagnose, but you can figure it out by carefully inspecting logs of database calls and how long they take to execute. There are also sql queries you can run on the database to diagnose long-running queries, or what tables are locked at a given point in time.
Finally, check for any back end webjobs for your application. If timeouts occur at reoccurring days or times, then somebody's batch SQL could be blocking your production database from being read.
But this is all speculation. I think you need to do more research to determine what is actually causing the site to become unresponsive. If you can log response times for common queries, you can rule out SQL-based latency as being the culprit or not and work from there. There's nothing inherently "amiss" about any of the technologies you specified.
If queries are perfomant but still causing issues, a long term solution is to add something like a message queue and batch your sql work intelligently, or just make the database work asynchronous and not block the UI.
You should correlate any logged timeouts with azure's monitoring. Azure can give you CPU/RAM/page visits and such on the dashboard.
SQL Azure is a bit of a different beast. It doesn't have the on-demand performance of a dedicated DB unless you're prepared to throw serious $$ at it. And even then ...
EF, when written for well can perform quite well. When written poorly it can be a dog, and those problems are compounded on a platform like SQL Azure.
The first thing is to check that your EF contexts are set up to use an execution strategy suited to Azure: https://learn.microsoft.com/en-us/ef/ef6/fundamentals/connection-resiliency/retry-logic
The next thing would be to see what kinds of SQL tracing you can run on Azure. Tracing is essential to see what EF is doing behind the scenes. I'm not familiar with tools available for Azure, in my case my Azure experience was running SQL Server on VMs because SQL Azure was too immature, not HIPAA compliant at the time, and expensive for the DTU estimates we were able to get. Worst case, can you restore an database backup into an SQL Server instance and point a copy of your application environment temporarily at that to run through common usage scenarios? Using an SQL Trace you can pick up on exactly when and how often EF is executing queries, and what queries it is executing.
Things to look at:
How many queries are running? If you are loading a set of records and expect one query, are there a whole heap of queries getting sent? This would indicate lazy-load calls being triggered.
What queries are being run? Is it selecting a lot more fields than are being displayed? This would be potentially a case where entire entities are being loaded where a .Select() could be used to reduce the amount of data. Perhaps even the case where entire sets of entities are being loaded that aren't relevant to what is displayed/done, such as cases where someone is using .ToList() prior to just doing a .Count() or .Any() or doing a .FirstOrDefault() just to do a != null check.
Is the database properly indexed? Copy some of the heavier queries into SQL Manager and execute them with an execution plan. Are there indexing suggestions?
The common sins of developing with EF and other ORMs boil down to "pulling too much, too often." It's surprising how many clients I've worked with have development teams that have not used a profiler to inspect their ORM use efficiency. (and I'm talking 0% so far.)

PostgreSql or SQL Server 2008 R2 should be use with .Net application using entity framework?

I have a database in PostgreSQL with millions of records and I have to develop a website that will use this database using Entity Framework (using dotnetConnect for PostgreSQL driver in case of PostgreSQL database).
Since SQL Server and .Net are both native to the Windows platform, should I migrate the database from PostgreSQL to SQL Server 2008 R2 for performance reasons?
I have read some blogs comparing the two RDBMS' but I am still confused about which system I should use.
There is no clear answer here, as its subjective, however this is what I would consider:
The overhead of learning a new DBMS and its tools.
The SQL dialects each RDBMS uses and if you are using that dialect currently.
The cost (monetary and time) required to migrate from PostgreSQL to another RDBMS
Do you or your client have an ongoing budget for the new RDBMS? If not, don't make the mistake of developing an application to use a RDBMS that will never see the light of day.
Personally if your current database is working well I wouldn't change. Why fix what isn't broke?
You need to find out if there is actually a problem, and if moving to SQL Server will fix it before doing any application changes.
Start by ignoring the fact you've got .net and using entity framework. Look at the queries that your web application is going to make, and try them directly against the database. See if its returning the information quick enough.
Only if, after you've tuned indexes etc. you can't make the answers come back in a time you're happy with should you decide the database is a problem. At that point it makes sense to try the same tests against a SQL Server database, but don't just assume SQL Server is going to be faster. You might find out that neither can do what you need, and you need to use faster disks or more memory etc.
The mechanism you're using to talk to a database (DotConnect or Microsoft drivers) will likely be a very minor performance consideration, considering the amount of information flowing (SQL statements in one direction and result sets in the other) is going to be almost identical for both technologies.

Experiences with PostgreSQL Java/JDBC Copy API for bulk inserts

With version 8.4 PostgreSQL finally integrated a proprietary API into their JDBC driver, which allows stream based inserts and selects. The so called Copy API grants access to COPY TO/COPY FROM SQL commands, which read text data from a stream/reader into one table at a time or write text data to a stream/writer from one table. Constraints and triggers are regarded for insert operations. Basic transformations (delimiter, quotation, null values etc.) are available. The performance gain is quite impressive, which probably is because of less object instantiation and a much simpler protocol between client and server backend.
Has anyone experiences with this API, good or bad. Is it production ready? Are there any pitfalls one has to be aware of? BTW: The fact that it is a proprietary API is a non-issue for me.
The COPY API is present in PostgreSQL C library for at least 6 years. It is very stable.
See: http://www.postgresql.org/docs/9.0/interactive/libpq-copy.html
and http://www.postgresql.org/docs/9.0/interactive/sql-copy.html
JDBC implementation should have same properties, but I haven't used it.
PS. I think there is a misunderstanding when you call this "proprietary". Both protocol specification and server/client/driver source code is free (as in freedom).

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?

entity framework performance

I am using Entity Framework to layer on my SQL Server 2008 database. The EF is present in my web service and the webservice is invoked by a Silverlight client.
I am seeing a serious performance issue in terms of the duration taken by a query to execute in the EF. This wouldn't happen in the consecutive calls.
A little bit of googling revealed that, it's caused per app domain to construct the in-memory model of the db objects. I found this Microsoft link explaining pre-generation of views for performance improvement. Even after implementing the steps, the performance actually degraded instead of improving. I am curious, if anyone has tried this approach successfully and if there are any other avenues for improving performance.
I am using .NET 3.5.
A couple areas to look at for EF performance
Do as much of the processing before calling things like tolist(). ToList will bring everything in the set into memory. By default, EF will keep building the expression tree and only actually process it when you need the data in memory. That first query will be against the database, but afterwards the processing will be in memory. When working with large data, you definitely want as much of the heavy lifting done by the database as possible.
EF 1 only has the option to pull the entire row back. Therefore if you have a column that is a large string or binary blob, it is going to be pulled down and into memory whether you need it or not. You can create a projection that doesn't include this column, but then you don't get the benefits of having it be an entity.
You can look at the sql generated by EF using the suggestion in this post
How do I view the SQL generated by the Entity Framework?
The same laws of physics apply for EF queries as they do for ordinary SQL. Check your database tables and make sure that you have indexes on primary and foreign keys, that your database is properly normalized, and so forth. If performance is degrading after Microsoft's suggestions, then that's my guess as to the problem area.
Are you hosting the webservice in IIS? Is it running on the same site as the Silverlight App? What about the database itself? Is it running on a dedicated machine? Are there other apps hitting it? The first call to a dormant database is painful (I've had situations where it would actually time out in my environment.)
There are a number of factors to take into consideration here. But it comes down to more than just EF's overhead.
edit I didn't fully qualify but the process of opening the first connection to SQL Server is slow regardless of your data access solution.
Use SQL Profiler to check how many queries executed to retrieve your data.If it's large number use Include() method of ObjectQuery to retrieve child objects with parent in one query.