What is the difference between Binary Semaphores and Monitors? - mutex

I know the difference between Semaphores and monitors, but what in case of binary semaphores, how it is different from Monitor?
Explanation with some example.

Related

Do modern CPU's have compression instructions

I have been curious about this for awhile since compression is used in about everything.
Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?
They don’t have general-purpose compression instructions.
AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.
On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.
If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.
However, modern CPUs have a lot of stuff for lossy multimedia compression.
Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).
Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.
Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.
It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:
Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.
That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.
Intel has a number of related patents:
Apparatus for Hardware Implementation of Lossless Data Compression.
Hardware apparatuses and methods for data decompression.
Systems, Methods, and Apparatuses for Decompression using Hardware and Software.
Systems, methods, and apparatuses for compression using hardware and software.
Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

parallel programming mode: Scala vs OpenCL

In term of the parallel programing model, what is the difference between what Scala and OpenCL provided/supported?
Taking a trivial task for example, how to parallel a task of add for two vector with 1 billion elements?
I assume Scale should be much easy ,from a programer's point of view.
vectorA+ vectorB -> setC
Or, they are not at same level for comparison?
come across a very interesting article
http://www.theinquirer.net/inquirer/news/2257035/amd-thinks-most-programmers-will-not-use-cuda-or-opencl
While AMD has worked hard to drive OpenCL, as Nvidia has with CUDA,
both companies are now looking at delivering the performance
advantages of using those two languages and incorporating them into
languages such as Java, Python and R.
Maybe they need to look into Scala as well :)
scala is based on JVM. That means any Java optimized GPU stuff can be easily ported to Scala..
If the JVM will optimize bytecode on the fly then automatically Scala will also support it.
GPU programming is the future - unless we start seeing hundreds of i7 cores etc..
The issue wit CPU is that is very complex therefore higher watt consumption per core - heat issues etc..
However GPU can offload taks from CPU same way as the math coprocesor wasoffloading tasks early days.
A desktop CPU + GPU die would be interesting though.. moving the CPU inside the GPU card :-)..

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.

Is there any difference in execution times between two different processors with same amount of gigaflops?

I have a hardware related question that I debated with a friend.
Consider two processors from two different manufacturers with the same amount of gigaflops put into the same computers (i.e. RAM and such as are the same for both computers).
Now given a simple program will there be any difference in execution times between the two computers with the same processors. I.e. will the two computers handle the code differently (for-loops, while-loops, if-statements and such)?
And if, is that difference noticably or can one say that the computers would approximately perform the same?
Short answer: Yes, they will be different, possibly very much so.
Flops are just about floating point operations, so it is a very very crude measure of CPU performance. It is, in general, a decent proxy for performance for scientific computations of certain kinds, but not for general performance.
There are CPUs which are strong in FLOPS - the Alpha is a historical example - but which have more moderate performance in integer computations. This means that an alpha and an x86 CPU with similar FLOPS would have very different MIPS-performance.
The truth is that it is very hard to make a good generic benchmark, though many have tried.
Another critical factor in comparing the performance of two processors with the same FLOP measure is the rate at which they can move data between CPU and RAM. Add memory cache into your thinking to complicate matters further.

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.