what's the difference of strong ordered architecture and poorly ordered architecture - cpu-architecture

We known that X86_64 is a strongly ordered architecture. On the contrary, ARM64 or PowerPC or ... Alpha are poorly ordered ones.
What's the difference between them? Does this infulence the process schedule? Will it make executable program more unpredictable if architecture is poorly ordered?

Related

Mach vs Spring Operating Systems

I have to write a paper on an operating system topic. I decided to get some information about Mach and Spring (being both microkernel and object-oriented operating systems). I'd like the paper to be a comparison between some specific aspects about the two systems (a full comparison would be too large for a 6-7 page paper). Which aspects of the two systems can I compare?
Spring was designed as a post-Mach system, so any information you can find out about Spring from the Sun authors is very likely to (at least indirectly) provide comparisons to Mach. Some direct points of comparison:
Spring's domains are analogous to Mach's tasks
Spring's nucleus is directly comparable with the Mach 3 kernel, and the two provide different virtual memory managers
Spring IPC uses doors (which made their way into the Solaris OS), which have different characteristics from Mach's ports
IPC in both is abstracted behind an Interface Definition Language, and the two can be compared

Factors in designing Instruction set Arcitecture

What are the two majors factors to be considered while designing Instruction Set Architecture ?
I know what ISA is . But What are the factors to be considered? I already reviewed Wikipedia but it doesn't help much.
I found this as design issues for ISA.
Backward Compatibility
Are interrupts needed?
But are this two factors I am bit confused ! Please help any one ! Preparing For exams of Computer Organization and Architecture
you can read full article from here
The Importance of the Design of the Instruction Set
In this chapter we will be exploring one of the most interesting and important aspects of CPU design: the design of the CPU's instruction set. The instruction set architecture (or ISA) is one of the most important design issues that a CPU designer must get right from the start. Features like caches, pipelining, superscalar implementation, etc., can all be grafted on to a CPU design long after the original design is obsolete. However, it is very difficult to change the instructions a CPU executes once the CPU is in production and people are writing software that uses those instructions. Therefore, one must carefully choose the instructions for a CPU.
You might be tempted to take the "kitchen sink" approach to instruction set design1 and include as many instructions as you can dream up in your instruction set. This approach fails for several reasons we'll discuss in the following paragraphs. Instruction set design is the epitome of compromise management. Good CPU design is the process of selecting what to throw out rather than what to leave in. It's easy enough to say "let's include everything." The hard part is deciding what to leave out once you realize you can't put everything on the chip.
Nasty reality #1: Silicon real estate. The first problem with "putting it all on the chip" is that each feature requires some number of transistors on the CPU's silicon die. CPU designers work with a "silicon budget" and are given a finite number of transistors to work with. This means that there aren't enough transistors to support "putting all the features" on a CPU. The original 8086 processor, for example, had a transistor budget of less than 30,000 transistors. The Pentium III processor had a budget of over eight million transistors. These two budgets reflect the differences in semiconductor technology in 1978 vs. 1998.
Nasty reality #2: Cost. Although it is possible to use millions of transistors on a CPU today, the more transistors you use the more expensive the CPU. Pentium IV processors, for example, cost hundreds of dollars (circa 2002). A CPU with only 30,000 transistors (also circa 2002) would cost only a few dollars. For low-cost systems it may be more important to shave some features and use fewer transistors, thus lowering the CPU's cost.
Nasty reality #3: Expandability. One problem with the "kitchen sink" approach is that it's very difficult to anticipate all the features people will want. For example, Intel's MMX and SIMD instruction enhancements were added to make multimedia programming more practical on the Pentium processor. Back in 1978 very few people could have possibly anticipated the need for these instructions.
Nasty reality #4: Legacy Support. This is almost the opposite of expandability. Often it is the case that an instruction the CPU designer feels is important turns out to be less useful than anticipated. For example, the LOOP instruction on the 80x86 CPU sees very little use in modern high-performance programs. The 80x86 ENTER instruction is another good example. When designing a CPU using the "kitchen sink" approach, it is often common to discover that programs almost never use some of the available instructions. Unfortunately, you cannot easily remove instructions in later versions of a processor because this will break some existing programs that use those instructions. Generally, once you add an instruction you have to support it forever in the instruction set. Unless very few programs use the instruction (and you're willing to let them break) or you can automatically simulate the instruction in software, removing instructions is a very difficult thing to do.
Nasty reality #4: Complexity. The popularity of a new processor is easily measured by how much software people write for that processor. Most CPU designs die a quick death because no one writes software specific to that CPU. Therefore, a CPU designer must consider the assembly programmers and compiler writers who will be using the chip upon introduction. While a "kitchen sink" approach might seem to appeal to such programmers, the truth is no one wants to learn an overly complex system. If your CPU does everything under the sun, this might appeal to someone who is already familiar with the CPU. However, pity the poor soul who doesn't know the chip and has to learn it all at once.
These problems with the "kitchen sink" approach all have a common solution: design a simple instruction set to begin with and leave room for later expansion. This is one of the main reasons the 80x86 has proven to be so popular and long-lived. Intel started with a relatively simple CPU and figured out how to extend the instruction set over the years to accommodate new features.

Are atomic operations enough for a mutex?

Are just atomic operations enough to implement a mutex in x86. I am asking this in relation to out of order execution. Except atomic access to the integer that specifies whether the mutex is locked or not, are there any additional actions that must happen and why?
The answer depends upon what you mean by "just atomic operations". Properly aligned reads/writes on x86 are atomic, and both Dekker's and Peterson's algorithms for a mutex use only reads and writes. But neither algorithm works work correctly without also using (possibly implicit) memory fences. The problem is that both algorithms assume a stronger memory consistency model than x86 has. Specifically, x86 allows a load programmed after a store to happen earlier if the two accesses are not to the same address. See here for a detailed example.
If by "just atomic operations", you include LOCK-prefixed instructions and XCHG (which has an implicit LOCK prefix), then the answer is yes, since said instructions have an implicit memory fence. For example, an XCHG instruction can be used to perform the sequentially consistent store required by Dekker's and Peterson's algorithms, or used to implement the usual test-and-set approach.

Is Scala faster than Java 7 for number crunching and for heavy string processing?

Assume there are two class of applications:
(1) Intensive number crunching and numerical and mathematical computations
(2) Intensive string regex expression matching, xpath searching, and other string manipulations where strings are mostly stored in collection classes.
In Both cases assume clients access these applications thousands of times per second or even in parallel.
So if I have the choice to implement the applications in the server backends, I can choose either Java 7 or Scala. Which one should I choose to get faster performance and produce more reliable code?
Google did some benchmarks recently that you might find interesting - see paper linked to here: http://www.readwriteweb.com/hack/2011/06/cpp-go-java-scala-performance-benchmark.php
The paper is surprisingly un-scientific, but you will get a rough feel for what can be done. Of particular interest may be section V.F
Daniel Mahler improved the Scala version by creating a
more functional version, which is kept in the Scala Pro
directories. This version is only 270 lines of code, about 25%
of the C++ version, and not only is it shorter, run-time also
improved by about 3x. It should be noted that this version
performs algorithmic improvements as well, and is therefore
not directly comparable to the other Pro versions.
It's not clear to me whether this version with algorithmic improvements is included in their speed benchmark table (I don't think so), but it does indicate that you may be able to produce performance improvements by adopting algorithmic improvements that are more viable to implement in Scala. It won't do much for simple string processing, however.
A big factor will be how competent you are in programming these languages, and how good you are at optimizing them. Java is obviously more verbose but you're less likely to run into performance "gotchas".
Two points which might enable better performance for numerical computations than in Java:
The practical one: Scala makes it extremely easy to enable parallel computation of "embarrassingly parallel" problems. While the same could be done in Java it would require much more time and expertise, making it likely that it will only be done in rare circumstances.
The technical one: Scala can specialize generic data structures for primitive types, making boxing/unboxing unnecessary. The Java compiler is not able to do that.
Scala uses Java's String so the amount of possible improvements here is quite limited. But there are other data structures like ropes which provide better performance than String in some cases.
Depending on your expertise and effort, I would expect that you can get better results here or there. Normally, with an infinite amount of development time and money, you can improve, improve and improve your code in every language. (Think of bigger and bigger caches, specialised sorters, precomputed defaults and so on).
With a good understanding of both languages and some experience in performance questions of your field, I wouldn't expect much differences, but you could save some time by the more collection friendly scala approach, and the time, saved on normal development, could be spend in performance analysis and improvement.
There is in principle not really a reason why Scala would be faster than Java for number crunching applications.
I would not choose Java or Scala or any other JVM language if I wanted to write a serious high-performance number crunching application.
From my own experience (and ofcourse this is only anecdotal evidence and definitely not proof that this is true in all cases) the JVM is not the best suited platform for heavy number crunching. If raw number crunching speed is important you would probably be better off with something that's more close to the "metal", for example C++, which allows you to for example use Intel SSE instructions and do other low-level optimizations, or use the GPU with CUDA if your algorithm is suitable for that.

Garbage-collectors for multi-core llvm?

I've been looking at LLVM for quite some time as a new back-end for the language I'm currently implementing. It seems to have good performance, rather high-level generation APIs, enough low-level support to optimize exotic optimizations. In addition, and although I haven't checked it myself, Apple seems to have successfully demonstrated the use of LLVM for garbage-collected multi-core programs.
So far, so good. As I'm interested in both garbage-collection and multi-core, the next step would be to choose a LLVM multi-core-able garbage-collector. Which brings me to the question: what is available? I'm aware of Jon Harrop's HLVM work, but that's about it.
Note that I need cross-platform, so Apple's GC is probably not what I'm looking for (unless there's a cross-platform version). Also note that I have nothing against stop-the-world garbage-collectors.
Thanks in advance,
Yoric
LLVM docs say that it does not support multi-threaded collectors yet.
As the matrix indicates, LLVM's
garbage collection infrastructure is
already suitable for a wide variety of
collectors, but does not currently
extend to multithreaded programs. This
will be added in the future as there
is interest.
The docs do say that to do multi-threaded garbage collection you need to stop the world and that this is a non-portable thing:
Threaded
Denotes a multithreaded mutator; the collector must still stop the
mutator ("stop the world") before
beginning reachability analysis.
Stopping a multithreaded mutator is a
complicated problem. It generally
requires highly platform specific code
in the runtime, and the production of
carefully designed machine code at
safe points.
However, shared state between threads is a nasty scaling issue. If your language communicates solely through message passing between 'tasks', and therefore there was no shared state between worker threads, then you could use a per-thread collector for the per-thread heap?
The quotes that Will gave are about LLVM's intrinsic support for GC, where you augment LLVM with C++ code telling it how to walk the stack, interpret stack frames, inject read and write barriers and so on. The primary goal of my HLVM project is to become useful with minimal effort and risk so I chose to use the shadow stack for an "uncooperative environment" in order to avoid hacking on immature internals of LLVM. Consequently, those statements about LLVM's intrinsic support for GC do not apply to HLVM's garbage collector because it does not use that infrastructure at all. My results are extremely compelling: you can achieve excellent performance with minimal effort (serial performance and parallel performance).
I believe HLVM already runs out-of-the-box across Unixs including Mac OS X because it requires only POSIX threads. I strongly disagree with the claim that writing a stop-the-world GC is difficult: it took me 5 days to write a 100-line multicore garbage collector and I barely know anything about computers. I cannot believe it would be difficult to port to Windows either.