What is the Von Neuman bottleneck? [closed] - scala

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What is the Von Neuman bottleneck and how does functional programming reduce its effect? Can someone explain in a simple way through a practical and comprehensive example that shows, for instance, the advantage of using Scala over Java, if there is any?
More importantly, why is avoiding imperative control-structures and preferring functions so significant to improving performance? Ideally, an actual coding example that explains how a problem solved with a function and without one is affected by the Von Neuman bottleneck would be very helpful.

Using Scala will not necessarily fix your performance problems, even if you use functional programming.
More importantly, there are many causes of poor performance, and you don't know the right solution without profiling.
The von Neumann Bottleneck has to do with the fact that, in a von Neumann architecture, the CPU and memory are separate and therefore the CPU often has to wait for memory. Modern CPUs solve this by caching memory. This isn't a perfect fix, since it requires the CPU to guess correctly about which memory it needs to cache. However, high-performance code makes it easy for the CPU to guess correctly by structuring data efficiency and iterating over data linearly (i.e. good data locality).
Scala can simplify parallel programming, which is probably what you are looking for. This is not directly related to the von Neumann Bottleneck.
Even so, Scala is not automatically the answer if you want to do parallel programming. There are several reasons for this.
Java is also capable of parallel programming, and has many types of parallel collections for that purpose.
Java 8 Streams are Java's answer to Scala's parallel collections. They can be used for functional programming.
Parallel programming is not guaranteed to improve performance, and can make a program slower on small data sets, due to setup costs.
There is one case where you are correct that Scala overcomes the von Neumann Bottleneck, and that is with big data. When the data won't fit easily on a single machine, you can store the data on many machines, such as a Hadoop cluster. Hadoop's distributed filesystem is designed to keep data and CPUs close together to avoid network traffic. The easiest way to program for Hadoop is currently with Apache Spark in Scala. Here are some Spark examples; as of Spark 2.x, the Scala examples are much simpler than the Java examples.

Related

Identification and Diagnosis of Efficiency Problems in HPC

There are many articles and books on problems in HPC, but I feel like I am missing on the diagnose of scaling and efficiency issues. For example, I am reading a books called "Introduction to High Performance Computing for Scientists and Engineers" by Horst Simon where he discusses a wide variety of problems and solutions such as,
Cache misses
Load Imbalance
Poor Vectorization of code
etc.
But if I were handed a piece of code even remotely complex (ie more than nested for-loops) I would have a very hard time discovering what the bottleneck was or proving that the code had reached the limits of a given piece of hardware.
In analog with medicine, I can currently list out a bunch of possible diseases that make people "less efficient", but this is hardly useful. I need to figure out how to diagnose my "patients" and then prescribe a "cure".
Could I please be referred to literature that teaches how to diagnosis of HPC problems (efficiency, scalability, etc)? Almost a step-by-step guide. Like put stethoscope of chest, then listen, ...
This question is two questions: one is how do I find bottlenecks, the other is how do I know the limits of my hardware and if I am at them.
The first is that you must run the code inside a profiler. Any profiler with a "top down" view of your code according to time is showing you the bottlenecks.
Try the profilers suggested here (answer applies to c++ and Fortran): Good profiler for Fortran and MPI - both Allinea MAP and HPC Toolkit have the sort of presentation you need. (NB I work for Allinea).
The second question is the most "open" part. That one needs your book or optimization guide. However, a good start is to see how much vectorization you have (Some of the profiler examples can show this) as this is where the most compute power can be found.
The bigger question is what the theoretical limit of your problem is - eg. Some problems are not amenable to vectorization, some have memory access needs that can never be cache friendly, some have communication needs that are simple whereas others require costly regular global updates.

Which language should I prefer working with if I want to use the Fast Artificial Neural Network Library (FANN)?

I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).

ways for speed up MATLAB application [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have a question on speed up application built by MATLAB software, i need to know the affect of using vectorization and parallel computation on speed up the application ? and if there is better method than both previous way in such case ? thanks
The first thing you need to do when your MATLAB code runs too slow is to run it in the profiler. In recent versions of MATLAB, this can by done by pressing the "Run and Time" button on the main toolbar. This way, you will now which functions and which lines in these function take up the most time. Once you know this, you may do one of the following, depending on your circumstances and the nature of the particular piece of code:
Think if your algorithm is the most optimal one in terms of O() complexity.
Try turning loops into vector operations. The efficacy of this has declined in recent versions of MATLAB because of improvements in how loops are executed.
If you have a multi-core CPU try using the parallel computing toolbox. If your code parallelizes well, you will get a sped up nearly equal to the number of cores.
If you have an nVidia GPU try using the GPU support. You can get a speed-up by a factor of 10 or more with some problems, but not all problems are amicable to this sort of optimization.
If everything else fails, you may outsource the slowest piece of your code to a low level language like C. See here for how to do this. You could then use low-level profiling tools like Intel vTune to get the absolute maximum speed from the low-level code.
If it is still too slow, you may need to buy an FPGA. See here for a brief tutorial.

What is the bottleneck algorithm for medical imaging applications? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
What is the computational bottleneck algorithm for medical imaging applications? We are trying to figure out if there is a benefit to run these algorithms on regular cloud server instances or GPU accelerated server instances.
Unless the software has been specifically designed with GPU processing power in mind, GPU accelerated instances will be about the same performance as regular commodity server instances, only at a higher price.
I'm willing to gamble and say that the bottleneck of any algorithm, medical or not, imaging or not is the rate at which you can throw data at the CPU, and the number of cores, and the clock rate.
Get some fast CPUs, Insanely fast RAM, blindingly fast striped/mirrored storage, and do it that way.
I suspect that you'll probably find that running on "the cloud" is actually counter-intuitive, or at least counterproductive, as many cloud service providers don't tune their storage backends to cater for high performance computing, but more to providing a little bit of IO to the masses.
I think you'd be better off with owned dedicated hardware, that way, you can spend more time and money in efficiently tuning the hardware stack to match your software stack. Any cloud service provider (including Amazon) will give you some trade offs and compromises.
Oh, and don't forget about not putting all your eggs in one basket. What happens when Amazon goes offline, and nobody can examine any X-Rays, or the poor schmuck who put a heart monitoring application on Amazon Cloud instances, and Amazon went offline in a massive outage.
Aside from the compromises of cloud hosting, the problems of being redundant and resilient to provider outages, not putting critical infrastructure on the cloud, there's other questions surrounding the architecture of your application itself.. Will it scale linearly?
I bet it won't.
By benching a GPU-like implementation against Cloud Server instances, you can see huge FPS differences [1, 2] for operations on large (e.g., CR) images. However, on the other hand, the GPU can be occupied with a lot of memory and therefore delaying and giving continuously dropouts. Therefore, a Cloud Server solution could be more stable with not as many dropouts and a smoother feeling but with lower FPS.
[1] Zhang, Lequan, et al. "A high-frequency, high frame rate duplex ultrasound linear array imaging system for small animal imaging." IEEE transactions on ultrasonics, ferroelectrics, and frequency control 57.7 (2010).
[2] Miguez, D., et al. "A technical note on variable inter-frame interval as a cause of non-physiological experimental artefacts in ultrasound." Royal Society open science 4.5 (2017): 170245.

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.