Accessing SQL Server in parallel - ado.net

I'm trying to use the Task-Parallel-Library to offload expensive ADO.NET database access from the UI thread (formerly the program I'm re-writing would simply freeze, occasionally updating a VB6 text box with its progress, until the data in the database was fully loaded). I have an complex dependency structure (26 individual tasks), and I'm trying to figure out how much of it is worth parallelizing.
I'd like to know whether or not IO access like this can be parallelized at all with performance bonuses. If not I'll just sequentially load the data and update the UI whenever enough information is loaded to perform that task, but It'd be nice to get an extra boost by loading maybe two things at a time instead of just one (even if I don't get double speedup).

It's possible that parallelizing this will increase performance, but not guaranteed. It all depends on where your bottleneck is.
For example, if a request is expensive because it loads lots of data, then it probably consumes much of your clients network bandwith. Parallelizing in this case wouldn't help much, if at all.
If, on the other hand, the bottleneck is the SQL processing and your SQL request leaves the SQL Server with spare capacity in its own bottleneck, then you can profit from SQL Servers (very good) parallelizing capabilities.
It is also possible that parallelizing slows you down. If for example the SQl Server has not much RAM and access only to a single disk, forcing it to do multiple queries in parallel may lead to more seek activity on the harddisk, which can dramatically slow down the overall read rate.
So, as it often is, the answer isn't a simple yes or no, but "it depends".

Related

SQL Stored Procedures Call vs Multiple RecordSets

Im work with Classic ASP and all my pages do multiples calls (stored procedures) to database to construct the page (reports, forms...).
Is it better to do 1 call with multiple recordset or do what Im doing (multiple calls)?
I know, maybe, there is something better with another languages (PHP, C#...), but my app was built entirely in Classic ASP.
Tks
As always, there is a case for both ways.
To optimize for amount of total work done, as Blam said, you should do one big call to reduce the round trip time. Not only for network latency, but also for all the network overhead of putting together packets and handling sockets.
However, this would mean that your page gets no data until all database accesses are done. So to improve response time, you may want to consider doing a pipeline where there are some database calls, but you are also processing some of the database results while other calls are made. This is a fairly unusual case since most of the time, processing is fairly light.
A common reason to break up the stored procedure is for reuse. If you have one big stored procedure, then to reuse any part of the stored procedure, you have to reuse all of it. (Unless you do messy branches and conditions inside your stored procedure that probably hurt performance due to query plan optimizations.) If you have multiple pages that can share some of the code, you probably want to break it up.
In a typical web farm, the database and the page servers are fairly close together so that network latency is not too bad. I profiled some of our production loads, and there are several places where we make multiple database calls serially, taking less than 1 ms for 10 database calls.
If your database network latency is significant, it may be worth it to do database calls in parallel. This way, you can break your stored procedure up for code reuse and not worry about network latency.
As a general rule, make your code clean and pretty without worrying about performance. Throw more hardware at the problem until you can't make it faster by paying more money. Typically, hardware is a lot cheaper than developers.

How to use Task Parallel Iibrary with ADO .NET data access

I'm trying to optimise ADO .NET (.Net 4.5) data access with Task parallel library (.Net 4.5), For an example when selecting 1000,000,000 records from a database how can we use the machine multicore processor effectively with Task parallel library. If anyone has found use full sources to get some idea please post :)
The following applies to all DB access technologies, not just ADO.NET.
Client-side processing is usually the wrong place to solve data access problems. You can achieve several orders of magnitude improvement in performance by optimizing your schema, create proper indexes and writing proper SQL queries.
Why transfer 1M records to a client for processing, over a limited network connection with significant latency, when a proper query could return the 2-3 records that matter?
RDBMS systems are designed to take advantage of available processors, RAM and disk arrays to perform queries as fast as possible. DB servers typically have far larger amounts of RAM and faster disk arrays than client machines.
What type of processing are you trying to do? Are you perhaps trying to analyze transactional data? In this case you should first extract the data to a reporting, or better yet, an OLAP database. A star schema with proper indexes and precalculated analytics can be 1000x times faster than an OLTP schema for analysis.
Improved SQL coding can also result in 10x-50x times improvement or more. A typical mistake by programmers not accustomed to SQL is to use cursors instead of set operations to process data. This usually leads to horrendous performance degradation, in the order of 50x times and worse.
Pulling all data to the client to process them row-by-row is even worse. This is essentially the same as using cursors, only the data has to travel over the wire and processing will have to use the client's often limited memory.
The only place where asynchronous processing offers any advantage, is when you want to fire off a long operation and execute code when processing finishes. ADO.NET already provides asynchronous operations using the APM model (BeginExecute/EndExecute). You can use TPL to wrap this in a task to simplify programming but you won't get any performance improvements.
It could be that your problem is not suited to database processing at all. If your algorithm requires that you scan the entire dataset multiple times, it would be better to extract all the data to a suitable file format in one go, and transfer it to another machine for processing.

Memcache inhibits the website

I added memchached to my website.
And site started running very slow.
If I cancel memchached ,application backs to work quickly.
Why is this happening?And how to avoid it?
Thanks,
kukuwka
That is impossible to answer without knowing how you are using it and what data you are storing. For example, if you are using it as the HttpCache provider (if you are using ASP.NET), and you were previously using the in-process cache provider, then it will behave very differently; the in-process provider has no serialization or network costs, so you might be storing some insanely large objects in the cache. That is fine when it is in-process, but for any other provider this is very very bad; you will have to transfer and deserialize for every usage (and serialize and transfer for every storage).
There are ways to improve the serialization/deserialization/network times, but it sounds like you are simply storing too much data (or inappropriate data) in the cache at the moment. I'd address that first, and then look at tuning it.
Memcached doesn't mean "make things faster." It provides fast and very scalable access to a shared cache of something that is otherwise expensive to acquire.
If you add caching to something that's cheap, it may end up being slower.
For example, if it takes you five seconds to do something and you can cache that, then you'll save almost five seconds on each subsequent request assuming the results are still useful.
If it takes you a few nanoseconds to do it, then it'll slow you down considerably to fetch the results over the network.

One big call vs. multiple smaller TSQL calls

I have a ADO.NET/TSQL performance question. We have two options in our application:
1) One big database call with multiple result sets, then in code step through each result set and populate my objects. This results in one round trip to the database.
2) Multiple small database calls.
There is much more code reuse with Option 2 which is an advantage of that option. But I would like to get some input on what the performance cost is. Are two small round trips twice as slow as one big round trip to the database, or is it just a small, say 10% performance loss? We are using C# 3.5 and Sql Server 2008 with stored procedures and ADO.NET.
I would think it in part would depend on when you need the data. For instance if you return ten datasets in one large process, and see all ten on the screen at once, then go for it. But if you return ten datasets and the user may only click through the pages to see three of them then sending the others was a waste of server and network resources. If you return ten datasets but the user really needs to see sets seven and eight only after making changes to sets 5 and 6, then the user would see the wrong info if you returned it too soon.
If you use separate stored procs for each data set called in one master stored proc, there is no reason at all why you can't reuse the code elsewhere, so code reuse is not really an issue in my mind.
It sounds a wee bit obvious, but only send what you need in one call.
For example, we have a "getStuff" stored proc for presentation. The "updateStuff" proc calls "getStuff" proc and the client wrapper method for "updateStuff" expects type "Thing". So one round trip.
Chatty servers are one thing you prevent up front with minimal effort. Then, you can tune the DB or client code as needed... but it's hard to factor out the roundtrips later no matter how fast your code runs. In the extreme, what if your web server is in a different country to your DB server...?
Edit: it's interesting to note the SQL guys (HLGEM, astander, me) saying "one trip" and the client guys saying "multiple, code reuse"...
I am struggling with this problem myself. And I don't have an answer yet, but I do have some thoughts.
Having reviewed the answers given by others to this point, there is still a third option.
In my appllication, around ten or twelve calls are made to the server to get the data I need. Some of the datafields are varchar max and varbinary max fields (pictures, large documents, videos and sound files). All of my calls are synchronous - i.e., while the data is being requested, the user (and the client side program) has no choice but to wait. He may only want to read or view the data which only makes total sense when it is ALL there, not just partially there. The process, I believe, is slower this way and I am in the process of developing an alternative approach which is based on asynchronous calls to the server from a DLL libaray which raises events to the client to announce the progress to the client. The client is programmed to handle the DLL events and set a variable on the client side indicating chich calls have been completed. The client program can then do what it must do to prepare the data received in call #1 while the DLL is proceeding asynchronously to get the data of call #2. When the client is ready to process the data of call #2, it must check the status and wait to proceed if necessary (I am hoping this will be a short or no wait at all). In this manner, both server and client side software are getting the job done in a more efficient manner.
If you're that concerned with performance, try a test of both and see which performs better.
Personally, I prefer the second method. It makes life easier for the developers, makes code more re-usable, and modularizes things so changes down the road are easier.
I personally like option two for the reason you stated: code reuse
But consider this: for small requests the latency might be longer than what you do with the request. You have to find that right balance.
As the ADO.Net developer, your job is to make the code as correct, clear, and maintainable as possible. This means that you must separate your concerns.
It's the job of the SQL Server connection technology to make it fast.
If you implement a correct, clear, maintainable application that solves the business problems, and it turns out that the database access is the major bottleneck that prevents the system from operating within acceptable limits, then, and only then, should you start persuing ways to fix the problem. This may or may not include consolidating database queries.
Don't optimize for performance until a need arisess to do so. This means that you should analyze your anticipated use patterns and determine what the typical frequency of use for this process will be, and what user interface latency will result from the present design. If the user will receive feedback from the app is less than a few (2-3) seconds, and the application load from this process is not an inordinate load on server capacity, then don't worry about it. If otoh the user is waiting an unacceptable amount of time for a response (subjectve but definitiely measurable) or if the server is being overloaded, then it's time to begin optimization. And then, which optimization techniques will make the most sense, or be the most cost effective, depend on what your analysis of the issue tells you.
So, in the meantime, focus on maintainability. That means, in your case, code reuse
Personally I would go with 1 larger round trip.
This will definately be influenced by the exact reusability of the calling code, and how it might be refactored.
But as mentioned, this will depend on your exact situation, where maintainability vs performance could be a factor.

Reasons for & against a Database

i had a discussion with a coworker about the architecture of a program i'm writing and i'd like some more opinions.
The Situation:
The Program should update at near-realtime (+/- 1 Minute).
It involves the movement of objects on a coordinate system.
There are some events that occur at regular intervals (i.e. creation of the objects).
Movements can change at any time through user input.
My solution was:
Build a server that runs continously and stores the data internally.
The server dumps a state-of-the-program at regular intervals to protect against powerfailures and/or crashes.
He argued that the program requires a Database and i should use cronjobs to update the data. I can store movement information by storing startpoint, endpoint and speed and update the position in the cronjob (and calculate collisions with other objects there) by calculating direction and speed.
His reasons:
Requires more CPU & Memory because it runs constantly.
Powerfailures/Crashes might destroy data.
Databases are faster.
My reasons against this are mostly:
Not very precise as events can only occur at full minutes (wouldn't be that bad though).
Requires (possibly costly) transformation of data on every run from relational data to objects.
RDBMS are a general solution for a specialized problem so a specialized solution should be more efficient.
Powerfailures (or other crashes) can leave the Data in an undefined state with only partially updated data unless (possibly costly) precautions (like transactions) are taken.
What are your opinions about that?
Which arguments can you add for any side?
Databases are not faster. How silly... How can a database be faster than writing a custom data structure and storing it in memory ?? Databases are Generalized tools to persist data to disk for you so you don't have to write all the code to do that yourself. Because they have to address the needs of numerous disparate (and sometimes inconsistent) business functions (Persistency (Durability), Transactional integrity, caching, relational integrity, atomicity, etc. etc. ) and do it in a way that protects the application developer from having to worry about it so much, by definition it is going to be slower. That doesn't necessarilly mean his conclusion is wrong however.
Each of his other objections can be addressed by writing the code to address that issue yourself... But you see where that is going... At some point, the development efforts of writing the custom code to address the issues that are important for your application outweigh the performance hit of just using a database - which already does all that stuff out of the box... How many of these issues are important ? and do you know how to write the code necessary to address them ?
From what you've described here, I'd say your solution does seem to be the better option. You say it runs once a minute, but how long does it take to run? If only a few seconds, then the transformation to relational data would likely be inconsequential, as would any other overhead. most of this would take likely 30 seconds. This is assuming, again, that the program is quite small.
However, if it is larger, and assuming that it will get larger, doing a straight dump is a better method. You might not want to do a full dump every run, but that's up to you, just remember that it could wind up taking a lot of space (same goes if you're using a database).
If you're going to dump the state, you would need to have some sort of a redundancy system in place, along with quasi-transactions. You would want to store several copies, in case something happens to the newest version. Say, the power goes out while you're storing, and you have no backups beyond this half-written one. Transactions, you would need something to tell that the file has been fully written, so if something does go wrong, you can always tell what the most recent successful save was.
Oh, and for his argument of it running constantly: if you have it set to a cronjob, or even a self-enclosed sleep statement or similar, it doesn't use any CPU time when it's not running, the same amount that it would if you're using an RDBMS.
If you're writing straight to disk, then this will be the faster method over a database, and faster retrieval, since, as you pointed out, there is no overhead.
Summary: A database is a good idea if you have a lot of idle processor time or historical records, but if resources are a legitimate concern, then it can become too much overhead and a dump with precautions taken is better.
mySQL can now model spatial data.
http://dev.mysql.com/doc/refman/4.1/en/gis-introduction.html
http://dev.mysql.com/doc/refman/5.1/en/spatial-extensions.html
You could use the database to keep track of world locations, user locations, items locations ect.