parallel programming mode: Scala vs OpenCL - scala

In term of the parallel programing model, what is the difference between what Scala and OpenCL provided/supported?
Taking a trivial task for example, how to parallel a task of add for two vector with 1 billion elements?
I assume Scale should be much easy ,from a programer's point of view.
vectorA+ vectorB -> setC
Or, they are not at same level for comparison?

come across a very interesting article
http://www.theinquirer.net/inquirer/news/2257035/amd-thinks-most-programmers-will-not-use-cuda-or-opencl
While AMD has worked hard to drive OpenCL, as Nvidia has with CUDA,
both companies are now looking at delivering the performance
advantages of using those two languages and incorporating them into
languages such as Java, Python and R.
Maybe they need to look into Scala as well :)

scala is based on JVM. That means any Java optimized GPU stuff can be easily ported to Scala..
If the JVM will optimize bytecode on the fly then automatically Scala will also support it.
GPU programming is the future - unless we start seeing hundreds of i7 cores etc..
The issue wit CPU is that is very complex therefore higher watt consumption per core - heat issues etc..
However GPU can offload taks from CPU same way as the math coprocesor wasoffloading tasks early days.
A desktop CPU + GPU die would be interesting though.. moving the CPU inside the GPU card :-)..

Related

AnyLogic Computer Processor Advice needed - Single-core speed vs. number of cores?

I model on an ancient PC and recently got some lab funds for a new modeling computer. The choice of processor confounds me. For optimal AnyLogic simulation modeling, should I focus on maxing out the single-core speed or max the number of processor cores? Also, would a high-end graphics card help? I have heard from my engineering colleagues that for certain modeling tools that they do help with the work load. Any advice helps. Thanks.
This is what AnyLogic answered when I asked for the perfect computer to buy:
The recommended platform for AnyLogic is a powerful PC/laptop running
64-bit operating system (Windows preferable), plus CPU with multiple
cores like i7 and at least 8 Gb of RAM.
In general, faster CPU (3GHz or more recommended) means faster single
run execution. More cores means faster execution of the experiments
running the model multiple times in parallel (optimization, parameter
variation, monte carlo, etc.). Also, pedestrians and transporters
benefit from many cores (even single run, since the algorithm causing
movement of pedestrians and transporters uses all available cores).
For the time being, AnyLogic doesn't support GPU processing. RAM is
crucial when you have a lot of agents and many parallel runs (e.g. if
single run takes 1GB, then 8 parallel runs will take 8 Gb). For
working with GIS map, it may be needed to have a good connection to
the Internet. For example, if model requests a lot of routes from
online route provider.
On average, a middle-end PC/laptop in sufficient for most of the
models, high-end PC or server/instance will be useful in case of
really heavy models.
Just to add to Felipe's reply: graphic card is completely irrelevant, AnyLogic does not support outsourcing computations to their tensor cores.
Focus on decent processor speed and 8-12 cores as well as at least 16 GB of RAM and (crucial!!) an SSD harddrive. Good to go :)
Oh, and you may want to use Windows. Linux and Mac OS seem to feature more problems/bugs in AnyLogic than Windows

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community
The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.

MATLAB and using multiple cores to run calculations

Hey all. Im trying to sort out how to get MATLAB running as best as possible. I have a pretty decent new machine.
12GB RAM
Core i7 3.2Ghz Cpu
and lots of free space.
and a strong graphics card.
However when I run the benchmark test of MATLAB (command bench) it lists the computer as being near the worst, around a Windows XP single core 1.7Ghz machine.
Any ideas why and how I can improve this??
Thanks very much
Firstly, I would recommend re-running the bench command a few times to make sure MATLAB has fully loaded all the libraries etc. it needs. Much of MATLAB is loaded on demand, so it's always best to time the second or third run.
MATLAB automatically takes advantage of multiple cores when executing certain operations which are multithreaded. For example lots of elementwise operations such as +, .* and so on as well as BLAS-backed operations (and probably others). This page lists those things which are multithreaded.
Parallel Computing Toolbox is useful when MATLAB's intrinsic multithreading can't help (if it can, then it's usually the fastest way to do things). This gives you explicit parallelism via PARFOR, SPMD and distributed arrays.
You need the Parallel Processing Toolbox. A lot of MATLAB functions are multithreaded but to parallelize your own code, you'll need it. A dumb hack is to open several instances of command-line MATLAB. You could also write multithreaded MEX files but the right way to go about it would be the purchase and use the aforementioned toolbox.
This may be obvious, but make sure that you have enabled multithreaded computation in the preferences (File > Preferences > General > Multithreading). In some versions of MATLAB, it's not enabled by default.