Are WAW and WAR hazards unique to RISC processors? - microprocessors

Whenever I find something on hazard, I find it in the context of RISC processors like MIPS.
Are WAW and WAR hazards unique to RISC processors?
Or, CISCs can also encounter those hazards?

No, WAW and WAR hazards are common to any system in which read and write transactions of data are potentially executed in an order different than the instructions appear (Not just processors even!).
MIPS (and sometimes RISCV) are just frequency used examples as they are easier to understand and are good learning processors.

Related

Factors in designing Instruction set Arcitecture

What are the two majors factors to be considered while designing Instruction Set Architecture ?
I know what ISA is . But What are the factors to be considered? I already reviewed Wikipedia but it doesn't help much.
I found this as design issues for ISA.
Backward Compatibility
Are interrupts needed?
But are this two factors I am bit confused ! Please help any one ! Preparing For exams of Computer Organization and Architecture
you can read full article from here
The Importance of the Design of the Instruction Set
In this chapter we will be exploring one of the most interesting and important aspects of CPU design: the design of the CPU's instruction set. The instruction set architecture (or ISA) is one of the most important design issues that a CPU designer must get right from the start. Features like caches, pipelining, superscalar implementation, etc., can all be grafted on to a CPU design long after the original design is obsolete. However, it is very difficult to change the instructions a CPU executes once the CPU is in production and people are writing software that uses those instructions. Therefore, one must carefully choose the instructions for a CPU.
You might be tempted to take the "kitchen sink" approach to instruction set design1 and include as many instructions as you can dream up in your instruction set. This approach fails for several reasons we'll discuss in the following paragraphs. Instruction set design is the epitome of compromise management. Good CPU design is the process of selecting what to throw out rather than what to leave in. It's easy enough to say "let's include everything." The hard part is deciding what to leave out once you realize you can't put everything on the chip.
Nasty reality #1: Silicon real estate. The first problem with "putting it all on the chip" is that each feature requires some number of transistors on the CPU's silicon die. CPU designers work with a "silicon budget" and are given a finite number of transistors to work with. This means that there aren't enough transistors to support "putting all the features" on a CPU. The original 8086 processor, for example, had a transistor budget of less than 30,000 transistors. The Pentium III processor had a budget of over eight million transistors. These two budgets reflect the differences in semiconductor technology in 1978 vs. 1998.
Nasty reality #2: Cost. Although it is possible to use millions of transistors on a CPU today, the more transistors you use the more expensive the CPU. Pentium IV processors, for example, cost hundreds of dollars (circa 2002). A CPU with only 30,000 transistors (also circa 2002) would cost only a few dollars. For low-cost systems it may be more important to shave some features and use fewer transistors, thus lowering the CPU's cost.
Nasty reality #3: Expandability. One problem with the "kitchen sink" approach is that it's very difficult to anticipate all the features people will want. For example, Intel's MMX and SIMD instruction enhancements were added to make multimedia programming more practical on the Pentium processor. Back in 1978 very few people could have possibly anticipated the need for these instructions.
Nasty reality #4: Legacy Support. This is almost the opposite of expandability. Often it is the case that an instruction the CPU designer feels is important turns out to be less useful than anticipated. For example, the LOOP instruction on the 80x86 CPU sees very little use in modern high-performance programs. The 80x86 ENTER instruction is another good example. When designing a CPU using the "kitchen sink" approach, it is often common to discover that programs almost never use some of the available instructions. Unfortunately, you cannot easily remove instructions in later versions of a processor because this will break some existing programs that use those instructions. Generally, once you add an instruction you have to support it forever in the instruction set. Unless very few programs use the instruction (and you're willing to let them break) or you can automatically simulate the instruction in software, removing instructions is a very difficult thing to do.
Nasty reality #4: Complexity. The popularity of a new processor is easily measured by how much software people write for that processor. Most CPU designs die a quick death because no one writes software specific to that CPU. Therefore, a CPU designer must consider the assembly programmers and compiler writers who will be using the chip upon introduction. While a "kitchen sink" approach might seem to appeal to such programmers, the truth is no one wants to learn an overly complex system. If your CPU does everything under the sun, this might appeal to someone who is already familiar with the CPU. However, pity the poor soul who doesn't know the chip and has to learn it all at once.
These problems with the "kitchen sink" approach all have a common solution: design a simple instruction set to begin with and leave room for later expansion. This is one of the main reasons the 80x86 has proven to be so popular and long-lived. Intel started with a relatively simple CPU and figured out how to extend the instruction set over the years to accommodate new features.

The output of the Wordcount is being stored in different files

The output of the WordCount is being stored in multiple files.
However the developer doesn't have control on where(ip,path) the files stay on cluster.
In MapReduce API, there is a provision for developers to write reduce program to address this.How to handle this in ApacheBeam with DirectRunner or any other runners?
Indeed -- the WordCount example pipeline in Apache Beam writes its output using TextIO.Write, which doesn't (by default) specify the number of output shards.
By default, each runner independently decides how many shards to produce, typically based on its internal optimizations. The user can, however, control this via .withNumShards() API, which would force a specific number of shards. Of course, forcing a specific number may require more work from a runner, which may or may not result in a somewhat slower execution.
Regarding "where the files stay on the cluster" -- it is Apache Beam's philosophy that this complexity should be abstracted away from the user. In fact, Apache Beam raises the level of abstraction such that user don't need to worry about this. It is runner's and/or storage system's responsibility to manage this efficiently.
Perhaps to clarify -- we can make an easy parallel with low-level programming (e.g., direct assembly), vs. non-managed programming (e.g., C or C++), vs. managed (e.g., C# or Java). As you go higher in abstraction, you no longer can control data locality (e.g., processor caching), but gain power, ease of use, and portability.

Tools for one-off processing and conversion of large data

I am about to start a research project that will require a lot of data conversion and processing operations. On one hand, the data is rather massive - 10GB is typical for a raw dataset - so efficiency is an issue. On the other hand, many of these operations will be one-off, and rarely re-run, so building a deploy-able application is an overkill. It is not a user application, but mostly an experiment.
Some characteristics and constraints:
A lot of chained format conversions - JSON and XML to tabular format, then some patching, then text indexing, then exporting to some other format, etc.
I have a multi-core machine, but not several machines, at least to begin with.
Data does not fit as a whole in main memory, and from my experience, exploiting several cores is called for.
What are some recommended tools for handling such a project? My preferences are:
Easy-as-possible handling of multiple formats (JSON, XML, CSV)
Supporting multiple sources and sinks (text files, archives, databases)
Makes use of multiple cores
Little as possible administration, deployment issues, etc.
Programming language is not an issue, and I can manage Windows or Linux. Thanks!

Multi-Core Programming. Boost's MPI, OpenMP, TBB, or something else?

I am totally a novice in Multi-Core Programming, but I do know how to program C++.
Now, I am looking around for Multi-Core Programming library. I just want to give it a try, just for fun, and right now, I found 3 APIs, but I am not sure which one should I stick with. Right now, I see Boost's MPI, OpenMP and TBB.
For anyone who have experienced with any of these 3 API (or any other API), could you please tell me the difference between these? Are there any factor to consider, like AMD or Intel architecture?
As a starting point I'd suggest OpenMP. With this you can very simply do three basic types of parallelism: loops, sections, and tasks.
Parallel loops
These allow you to split loop iterations over multiple threads. For instance:
#pragma omp parallel for
for (int i=0; i<N; i++) {...}
If you were using two threads, then the first thread would perform the first half of the iteration. The second thread would perform the second half.
Sections
These allow you to statically partition the work over multiple threads. This is useful when there is obvious work that can be performed in parallel. However, it's not a very flexible approach.
#pragma omp parallel sections
{
#pragma omp section
{...}
#pragma omp section
{...}
}
Tasks
Tasks are the most flexible approach. These are created dynamically and their execution is performed asynchronously, either by the thread that created them, or by another thread.
#pragma omp task
{...}
Advantages
OpenMP has several things going for it.
Directive-based: the compiler does the work of creating and synchronizing the threads.
Incremental parallelism: you can focus on just the region of code that you need to parallelise.
One source base for serial and parallel code: The OpenMP directives are only recognized by the compiler when you run it with a flag (-fopenmp for gcc). So you can use the same source base to generate both serial and parallel code. This means you can turn off the flag to see if you get the same result from the serial version of the code or not. That way you can isolate parallelism errors from errors in the algorithm.
You can find the entire OpenMP spec at http://www.openmp.org/
Under the hood OpenMP is multi-threaded programming but at a higher level of abstraction than TBB and its ilk. The choice between the two, for parallel programming on a multi-core computer, is approximately the same as the choice between any higher and lower level software within the same domain: there is a trade off between expressivity and controllability.
Intel vs AMD is irrelevant I think.
And your choice ought to depend on what you are trying to achieve; for example, if you want to learn TBB then TBB is definitely the way to go. But if you want to parallelise an existing C++ program in easy steps, then OpenMP is probably a better first choice; TBB will still be around later for you to tackle. I'd probably steer clear of MPI at first unless I was certain that I would be transferring from shared-memory programming (which is mostly what you do on a multi-core) to distributed-memory programming (on clusters or networks). As ever , the technology you choose ought to depend on your requirements.
I'd suggest you to play with MapReduce for sometime. You can install several virtual machines instances on the same machine, each of which runs a Hadoop instance (Hadoop is a Yahoo! open source implementation of MapReduce). There are a lot of tutorials online for setting up Hadoop.
btw, MPI and OpenMP are not the same thing. OpenMP is for shared memory programming, which generally means, multi-core programming, not parallel programming on several machines.
Depends on your focus. If you are mainly interested in multi threaded programming go with TBB. If you are more interested in process level concurrency then MPI is the way to go.
Another interesting library is OpenCL. It basically allows you to use all your hardware (CPU, GPU, DSP, ...) in the best way.
It has some interesting features, like the possibility to create hundreds of threads without performance penalties.

Garbage-collectors for multi-core llvm?

I've been looking at LLVM for quite some time as a new back-end for the language I'm currently implementing. It seems to have good performance, rather high-level generation APIs, enough low-level support to optimize exotic optimizations. In addition, and although I haven't checked it myself, Apple seems to have successfully demonstrated the use of LLVM for garbage-collected multi-core programs.
So far, so good. As I'm interested in both garbage-collection and multi-core, the next step would be to choose a LLVM multi-core-able garbage-collector. Which brings me to the question: what is available? I'm aware of Jon Harrop's HLVM work, but that's about it.
Note that I need cross-platform, so Apple's GC is probably not what I'm looking for (unless there's a cross-platform version). Also note that I have nothing against stop-the-world garbage-collectors.
Thanks in advance,
Yoric
LLVM docs say that it does not support multi-threaded collectors yet.
As the matrix indicates, LLVM's
garbage collection infrastructure is
already suitable for a wide variety of
collectors, but does not currently
extend to multithreaded programs. This
will be added in the future as there
is interest.
The docs do say that to do multi-threaded garbage collection you need to stop the world and that this is a non-portable thing:
Threaded
Denotes a multithreaded mutator; the collector must still stop the
mutator ("stop the world") before
beginning reachability analysis.
Stopping a multithreaded mutator is a
complicated problem. It generally
requires highly platform specific code
in the runtime, and the production of
carefully designed machine code at
safe points.
However, shared state between threads is a nasty scaling issue. If your language communicates solely through message passing between 'tasks', and therefore there was no shared state between worker threads, then you could use a per-thread collector for the per-thread heap?
The quotes that Will gave are about LLVM's intrinsic support for GC, where you augment LLVM with C++ code telling it how to walk the stack, interpret stack frames, inject read and write barriers and so on. The primary goal of my HLVM project is to become useful with minimal effort and risk so I chose to use the shadow stack for an "uncooperative environment" in order to avoid hacking on immature internals of LLVM. Consequently, those statements about LLVM's intrinsic support for GC do not apply to HLVM's garbage collector because it does not use that infrastructure at all. My results are extremely compelling: you can achieve excellent performance with minimal effort (serial performance and parallel performance).
I believe HLVM already runs out-of-the-box across Unixs including Mac OS X because it requires only POSIX threads. I strongly disagree with the claim that writing a stop-the-world GC is difficult: it took me 5 days to write a 100-line multicore garbage collector and I barely know anything about computers. I cannot believe it would be difficult to port to Windows either.