calculating the best speedup - cpu-architecture

calculating the best speedup - cpu-architecture

I have been going over a previous exam for my computer architecture course that i got an incorrect answer, how could i calculate the best possibly speedup?
I understand theres a limit as to how mucha program can be sped up im just unsure of the forumla (he problem is part b). Any help will be upvoted and very much appreciated thanks!
(6 points) To accelerate an application, two enhancements with the following speedups are proposed:
　　Speedup1 = 25
　　Speedup2 = 15
Enhancement 1 is usable for 40% of the instructions and enhancement 2 is usable for 30% of the instructions. Two enhancements do not overlap.
a) What is the speedup if both enhancements are applied?
b) If you keep improving these two enhancements, what is the best speedup you can reach?

Rather than trying to memorize a formula, use common sense. Imagine that both portions of speed-up-able code could be sped up infinitely: that is, made to take no time at all. What would be left? How much time would it take?

Let t be the total run rime, then:
(a) Since you are not asking for this section, I am giving a full solution, for future readers.
t' = modified run time = 0.4t / 25 + 0.3t / 15 + 0.3t = 0.336t,
Thus, speedup = t/t' = t / 0.336t ~= 2.97
(b) The question asks keep advancing THESE speed ups, so you cannot improve the whole program. Then the best speed up you can get, according to amdahl's law is bounded by the sequential, un-improveable part. Amdahl's law says that the maximum speed up will be 1/SEQUENTIAL_PART What is the sequential part in your case? Make sure you understand why.
The idea of amdahl's law is, assuming you can speed up the improved part to infinite speed up, the total speed up will still be bounded by the non-improved part.

Related

Too many for loop iterations - for loop terminates

In a classification task, I need to do feature selection. So out of featSize = 98 features (variables), I want to know which ones are applicable. For each combination I train the classifier by tuning its hyperparameters. I've come across a problem in my usage of a for loop:
for b = 1:(2^featSize) - 1
% this is to choose the features. e.g. [1 0 0] selects the first
% feature out of three features if featSize = 3.
end
Matlab gives a warning: Warning: Too many FOR loop iterations. Stopping after 9223372036854775806 iterations.
Am I using the for loop in a prohibitive way? Is there another alternative method of completing this step?

Building a model for every possible combination of features is intractable. It's clear from your for loop that you would have to build an exponential number of models to cover every feature subset.
There are many approaches to feature selection that are practical to implement. The one most similar to your method is forward-selection. Many algorithms offer a regularization parameter instead (e.g. LASSO or ridge-regression). Some options for regression are discussed here https://stats.stackexchange.com/questions/127444/a-guide-to-regularization-strategies-in-regression
This talk covers many approaches to the problem of feature selection https://www.youtube.com/watch?v=JsArBz46_3s&index=21&list=PLGVZCDnMOq0ovNxfxOqYcBcQOIny9Zvb-&t=0s

2^98 = 316.9e27 = 300 thousand million million million million. If you run a billion* loop iterations a second, it would take ten thousand million** years to run that loop. I don't think you can afford the electricity bill... :)
It is scary, isn't it, how quickly exponential things explode?
Luckily, you don't need to loop this often to visit all pairs of features. If you have 98 features, then you have 98^2 pairs, not 2^98. Actually, you have 98*97, if you don't want to pair a feature with itself, and 98*97/2 if the order doesn't matter.
You can write a double loop to visit each pair:
N = 98
for ii = 1:N-1
for jj = ii+1:N
% do something with the pair [ii,jj]
end
end
* A billion as in a million million -- not the US billion.
** 2^98 /1e12 /60 /60 /24 /365 == 10.049e+9 -- I didn't take leap years or leap seconds into account... :)

I think you are requesting the for loop to do 2^98 = 316,910,000,000,000,000,000,000,000,000 iterations, so you will need to reduce the number of iterations.

As others have noted, yes, you are using the for loop in a prohibitive way, almost destructively. It's absurd to ask any regular computer, much less a super computer to run that many iterations of a loop. So that is that part of your question answered.
Regarding developing another method of tackling this, I don't know much about machine learning (I guess this is bad to say as I'm attempting to solve this), but regardless, it doesn't seem like you've provided enough information for us to help you there. Either way, you will need to somehow drastically reduce the number of iterations of the loop for this to run efficiently, and avoid the error.

What size tour can I reasonably expect to solve with GLPK?

I'm playing around with the travelling salesman example provided with GLPK, and trying to get a feel for what problem size I can reasonably expect to solve. I've managed to solve a 50 node graph, but 100 nodes doesn't seem to be converging in a reasonable timescale (30 minutes or so on modern hardware).
GLPK has lots of options for the MIP solver. I've tried various combinations but I'm not at all clear what options might help. This page has some discussion but is somewhat out of date and the advice is rather general.
Is it reasonable to expect GLPK to solve a 100 node tour or greater in a practical time frame (say, less than 4 hours)? When does the problem size become intractable? Are any of the many command-line options likely to help?

Are any of the many command-line options likely to help?
The tspsol command line applicaton (based on the GLPK C library) or the C API in glptsp.h are tailor-made for solving TSP.
Is it reasonable to expect GLPK to solve a 100 node tour or greater in a practical time frame (say, less than 4 hours)? When does the problem size become intractable?
My guess is that it also greatly depends on the problem instance as well. If you look into the C code, you will see that heuristics are used to generate an initial tour. How well a heuristic works, well...
I assume you know that the TSP is famous for being difficult to solve, see computational complexity.

I was able to solve eil51 in 2 seconds and rat99 in about 50 seconds by starting with a relaxation of the TSP and then eliminating sub-tours until a feasible solution is found. Doing a better modelling is going to achieve more than fiddling with the solver options. GLPK should be able to deal with bigger instances if the modelling is good enough. eil51 and rat99 are instances documented on TSPLIB.
I recommend reading some material on tsp modelling, there are a lot of articles and books out there. A nice starting point maybe 'Applied Integer Programming' by Der-San Chen et al.

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)

On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.

xCode has instruments to measure performance of each function/operation, you can simply use them.

Fastest language for FOR loops

I'm trying to figure out the best programming language for an analytical model I'm building. Primary consideration is speed at which it will run FOR loops.
Some detail:
The model needs to perform numerous (~30 per entry, over 12 cycles) operations on a set of elements from an array -- there are ~300k rows, and ~150 columns in the array. Most of these operations are logical in nature, e.g., if place(i) = 1, then j(i) = 2.
I've built an earlier version of this model using Octave -- to run it takes ~55 hours on an Amazon EC2 m2.xlarge instance (and it uses ~10 GB of memory, but I'm perfectly happy to throw more memory at it). Octave/Matlab won't do elementwise logical operations, so a large number of for loops are needed -- I'm relatively certain that I've vectorized as much as possible -- the loops that remain are necessary. I've gotten octave-multicore to work with this code, which makes some improvement (~30% speed reduction when I get it running on 8 EC2 cores), but ends up being unstable with file locking, etc.
+I'm really looking for a step change in run-time -- I know that actually using Matlab might get me as much as a 50% improvement from looking at some benchmarks, but that is cost-prohibitive. The original plan when starting this was to actually run a Monte Carlo with this, but at 55 hours a run, that's completely impractical.
The next version of this is going to be a complete rebuild from the ground up (for IP reasons I won't get in to if nothing else), so I'm completely open to any programming language. I'm most familiar with Octave/Matlab, but have dabbled in R, C, C++, Java. I'm also proficient w/ SQL if the solution involves storing the data in a database. I'll learn any language for this -- these aren't complicated functionality we're looking for, no interfacing with other programs, etc., so not too concerned about learning curve.
So with all that said, what's the fastest programming language specifically for FOR loops? From a search of SO and Google, Fortran and C bubble to the top, but looking for some more advice before diving in to one or the other.
Thanks!

This for loop looks no more complex than this when it hits the CPU:
for(int i = 0; i != 1024; i++) translates to
mov r0, 0 ;;start the counter
top:
;;some processing
add r0, r0, 1 ;;increment the counter by 1
jne top: r0, 1024 ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements)
;;continue on
As you can tell, this is sufficiently simple you can't really optimize it very well[1]... Refocus towards the algorithm level.
The first cut at the problem is to look at cache locality. Look up the classic example of matrix multiplication and swapping the i and j indexes.
edit: As a second cut, I would suggest evaluating the algorithm for data-dependencies between iterations and data dependency between localities in your 'matrix' of data. It may be a good candidate for parallelization.
[1] There are some micro-optimizations possible, but those will not produce the speedsups you're looking for.

~300k * ~150 * ~30 * ~12 = ~16G iterations, right?
This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent CPU.
Fortran, C/C++ should do it almost equally well. Even managed languages, such as Java and C#, would only fall behind by a small margin (if at all).
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more. (as was already suggested, is disk access limiting performance? is it network access?)

As #Rotsor said, 16G operations / 55 hours is about 80,000 operations per second, or one operation every 12.5 microseconds. That's a lot of time per operation.
That means your loops are not the cause of poor performance, it's what's in the innermost loop that's taking the time. And Octave is an interpreted language. That alone means an order of magnitude slowdown.
If you want speed, you first need to be in a compiled language. Then you need to do performance tuning (aka profiling) or, just single step it in a debugger at the instruction level. That will tell you what it is actually doing in its heart of hearts. Once you've got it to where it's not wasting cycles, fancier hardware, cores, CUDA, etc. will give you a parallelism speedup. But it's silly to do that if your code is taking unnecessarily many cycles. (Here's an example of performance tuning - a 43x speedup just by trimming the fat.)
I can't believe the number of responders talking about matlab, APL, and other vectorized languages. Those are interpreters. They give you concise source code, which is not at all the same thing as fast execution. When it comes down to the bare metal, they are stuck with the same hardware as every other language.
Added: to show you what I mean, I just ran this C++ code, which does 16G operations, on this moldy old laptop, and it took 94 seconds, or about 6ns per iteration. (I can't believe you baby-sat that thing for 2 whole days.)
void doit(){
double sum = 0;
for (int i = 0; i < 1000; i++){
for (int j = 0; j < 16000000; j++){
sum += j * 3.1415926;
}
}
}

In terms of absolute speed, probably Fortran followed by C, followed by C++. In practical application, well written code in any of the three, compiled with a descent compiler should be plenty fast.
Edit- generally you are going to see much better performance with any kind of looped or forking (e.g.- if statements) code with a compiled language, versus an interpreted language.
To give an example, on a recent project I was working on, the data sizes and operations were about 3/4 the size of what you're talking about here, but like your code, had very little room for vectorization, and required significant looping. After converting the code from matlab to C++, runtimes went from 16-18 hours, down to around 25 minutes.

For what you're discussing, Fortran is probably your first choice. The closest second place is probably C++. Some C++ libraries use "expression templates" to gain some speed over C for this kind of task. It's not entirely certain those will do you any good, but C++ can be at least as fast as C, and possibly somewhat faster.
At least in theory, there's no reason Ada couldn't be competitive as well, but it's been so long since I used it for anything like this that I hesitate to recommend it -- not because it isn't good, but because I just haven't kept track of current Ada compilers well enough to comment on them intelligently.

Any compiled language should perform the loop itself on roughly equal terms.
If you can formulate your problem in its terms, you might want to look at CUDA or OpenCL and run your matrix code on the GPU - though this is less good for code with lots of conditionals.
If you want to stay on conventional CPUs, you may be able to formulate your problem in terms of SSE scatter/gather and bitmask operations.

Probably the assembly language for whatever your platform is. But compilers (especially special-purpose ones that specifically target a single platform (e.g., Analog Devices or TI DSPs)) are often as good as or better than humans. Also, compilers often know about tricks that you don't. For example, the aforementioned DSPs support zero-overhead loops and the compiler will know how to optimize code to use those loops.

Matlab will do element-wise logical operations and they are generally quite fast.
Here is a quick example on my computer (AMD Athalon 2.3GHz w/3GB) :
d=rand(300000,150);
d=floor(d*10);
>> numel(d(d==1))
ans =
4501524
>> tic;d(d==1)=10;toc;
Elapsed time is 0.754711 seconds.
>> numel(d(d==1))
ans =
0
>> numel(d(d==10))
ans =
4501524
In general I've found matlab's operators are very speedy, you just have to find ways to express your algorithms directly in terms of matrix operators.

C++ is not fast when doing matrixy things with for loops. C is, in fact, specifically bad at it. See good math bad math.
I hear that C99 has __restrict pointers that help, but don't know much about it.
Fortran is still the goto language for numerical computing.

How is the data stored? Your execution time is probably more effected by I/O (especially disk or worse, network) than by your language.
Assuming operations on rows are orthogonal, I would go with C# and use PLINQ to exploit all the parallelism I could.

Might you not be best with a hand-coded assembler insert? Assuming, of course, that you don't need portability.
That and an optimized algorithm should help (and perhaps restructuring the data?).
You might also want to try multiple algorithms and profile them.

APL.
Even though it's interpreted, its primitive operators all operate on arrays natively, therefore you rarely need any explicit loops. You write the same code, whether the data is scalar or array, and the interpreter takes care of any looping needed internally, and thus with the minimum overhead - the loops themselves are in a compiled language, and will have been heavily optimised for the specific architecture of the CPU it's running on.
Here's an example of the simplicity of array handling in APL:
A <- 2 3 4 5 6 8 10
((2|A)/A) <- 0
A
2 0 4 0 6 8 10
The first line sets A to a vector of numbers.
The second line replaces all the odd numbers in the vector with zeroes.
The third line queries the new values of A, and the fourth line is the resulting output.
Note that no explicit looping was required, as scalar operators such as '|' (remainder) automatically extend to arrays as required. APL also has built-in primitives for searching and sorting, which will probably be faster than writing your own loops for these operations.
Wikipedia has a good article on APL, which also provides links to suppliers such as IBM and Dyalog.

Any modern compiled or JITted language is going to render down to pretty much the same machine language code, giving a loop overhead of 10 nano seconds or less, per iteration, on modern processors.
Quoting #Rotsor:
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more.
80k operations per second is around 12.5 microseconds each - a factor of 1000 greater than the loop overhead you'd expect.
Assuming your 55 hour runtime is single threaded, and if your per item operations are as simple as suggested, you should be able to (conservatively) achieve a speedup of 100x and cut it down to under an hour very easily.
If you want to run faster still, you'll want to look at writing multi-threaded solution, in which case a language that provides good support would be essential. I'd lean towards PLINQ and C# 4.0, but that's because I already know C# - YMMV.

what about a lazy loading language like clojure. it is a lisp so like most lisp dialects lacks a for loop but has many other forms that operate more idiomatically for list processing. It might help your scaling issues as well because the operations are thread safe and because the language is functional it has fewer side effects. If you wanted to find all the items in the list that were 'i' values to operate on them you might do something like this.
(def mylist ["i" "j" "i" "i" "j" "i"])
(map #(= "i" %) mylist)
result
(true false true true false true)

Is LOC correct parameter for project estimation? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Is LOC correct parameter for project estimation?
there are so many scenarios where complexity takes much more time for a single line of code,
other than LOC what could be the suggested parameter for project estimation?
As peoples are talking about functional point of program does it mean for use case related information?
i am trying to find out any solid base for full software developement estimation which can consist analysis, design, testcase preparation, and coding, please suggest?

Steve McConnell in Rapid Development (Microsoft Press, 1996):
Because different programming
languages produce such different bangs
for a given number of lines of code,
much of the software industry is
moving toward a measure called
"function points" to estimate program
sizes. A function point is a synthetic
measure of program size that is based
on a weighted sum of the number of
inputs, outputs, inquiries, and files.
Function points are useful because
they allow you to think about program
size in a languageindependent way.
Google "Function Point" for more information.

Seeing as developers are likely to* spend most of their time trying to test changes, lines-of-code is never a good indicator of size of a problem.
Let's suppose you have an existing large application - changing a single line of code may seem trivial, but the test planning and execution could take weeks.
Likewise, adding a relatively large amount of code in a single limited-scope module which is easily testable might be only a few days.
* they should do, at least. If they're spending more time writing code than testing it, it is probably full of bugs. And I mean BEFORE it reaches your dedicated QA team.

Only if you use it in the inverse.
-- Edit
But no. It isn't. It's a mostly useless measure, and generally harmful. As you note, less code is almost always better.
Other things to check? Well, what are you trying to measure? What result do you want to see from a change in the things that you would be checking? What sort of decisions will you be making on the basis of these changes?

LOC is one proxy measure for measuring the problem size.
LOC estimate can be used, and LOC count is relatively cheap to measure from historical projects. But LOC can be problematic if used for anything else than a proxy for problem size, as already pointed out by other answers.
Problem size is rather constant given the requirements. From a size estimate you can go to effort, schedule and cost estimates. It depends on your planning drivers such as cost or schedule. From the historical data you can find correlation how problem size translates to effort and how other planning drivers further influence the outcome. So you need to measure size measure and effort vs. other parameters and keep on fine-tuning your estimation process. There are some LOC-to-effort measures available in the literature, but they are not very accurate in your domain, using the technology you are using, and the team you have.
Other proxies for problem size are function points and story points. My experience on function points is that they are rarely worth the effort. On the other hand, story points in agile methods work very well since they are deliberately abstract (thus avoiding a lot of problems with with LOC) and measured on a sprint-by-sprint basis, with instant feedback into following sprints.

No, it isn't. The reason is simple: if you produce a new line of code during your development, are you one step closer to a solution? If you estimated 1000 lines of code to complete a task, are you now 0.1% complete with that task?
Lines of code can be used as a metric but only in the negative sense: for a greater number of lines of code, it is reasonable to assume that you have a greater number of bugs. Based on historical data, there is generally a linear correlation between lines of code and bug count.
Here are some useful and measurable factors that are worth considering:
Hours of labor.
Dollars spent: this is a good one because it strongly enforces the reality that you'd rather find bugs at the developer's desktop than in the hands of a tester or customer).
Milestones met: is the system available for the customers on the right date?
Requirements completed: this can be a funny one - what if you discover a new customer need during the project?
In short, lines of code is very nearly the worst possible metric you could ever use.

The only way to get any reasonable estimate on project duration is to COMPLETELY implement and deliver some subset of the final requirements. Then you can estimate the remaining requirements by comparing their complexity against the completed work.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse