What is/ how to select "prefetch" for parallel flowable? - rx-java2

I was reading the parallel flows documentation here and it mentioned:
By default, the parallelism level is set to the number of available CPUs (Runtime.getRuntime().availableProcessors()) and the prefetch amount from the sequential source is set to Flowable.bufferSize() (128). Both can be specified via overloads of parallel().
I still don't understand the purpose of this prefetch, and why it is so big. I guess this means the operators below it will hold onto more than 1 emissions (by default 128). However, I can't imagine this is a good idea, since downstream operators will effectively be single threaded until we have more than 128 emissions from upstream? (e.g. if we have 130, the first 128 will be prefetched by one thread, and the last 2 will be given to the second one. And all other threads will do nothing.).
I guess smaller objects in faster flowables should have a larger prefetch, since the cost of passing data between the rx chain will cost relatively more, so we want prefetch to be higher. I am not sure which numbers to pick here though.

Related

Why LRU implementation is expensive in full associative TLB?

I have a book statement:
Implementation of LRU in full associative TLB is very expensive, so the general way is to use random substitution.
I don't understand why it's expensive under full associative cache. Isn't that just adding an additional reference bit...?
LRU requires maintaining a total order relation between all valid cache lines in a cache set. For example, consider a 3-way cache set with the following lines A, B, and C ordered from the most recently accessed to the least recently accessed (represented as ABC). If C is accessed next, then the order becomes CAB. If a new line, D, needs to be filled in the same cache set, since there are no invalid lines, the LRU replacement policy will choose B to be evicted and replaced by the new line. Then the order becomes DCA.
For a 3-way cache, there are up to 3*2 = 6 possible orders for the lines in each set. In general, for an N-way cache, there are up to N! (N factorial) possible orders. Theoretically, you need at least log2(N!) bits (rounded up to the nearest integer) per cache set to maintain the LRU property accurately. Note that log2(N!) is Θ(Nlog(N)), so it grows superlinearly with respect to the number of ways. No normal person likes anything whose cost grows superlinearly.
A particularly cheap case is a 2-way cache, where the LRU state requires only log2(2!) = 1 bits, i.e., a single bit. It is much more expensive for any other number of ways though.
In practice, though, there is no easy way to maintain a single number that represents the LRU state of a set. If the current LRU state is X and then some access to a line occurs, how can the next LRU state be determined? There is no simple mathematical relation that can be implemented in hardware. So instead of using a single number, a realistic implementation would use multiple numbers, one per cache line. In this case, these numbers are called ages. Such design would even require (many) more bits than the theoretical minimum log2(N!) to maintain the LRU state.
Aside from the hardware overhead, the LRU replacement policy is not necessarily optimal for performance. It depends on the memory access patterns of the applications in the target market domain and the rest of the cache hierarchy.
LRU has been used in many real processors. Caches that are 2-way associative typically use LRU. For example, AMD SledgeHammer uses LRU for both L1I and L1D caches. The Itanium 2 processor's L1 instruction cache uses LRU and it is 4-way associative. Usually, when the number of ways is larger than two, caches don't use LRU.

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

CPU clock cycle misunderstanding

I can't well understand about CPU clock such as 3.4Ghz. I know this is that 3.4 billions clock cycle per second.
So here if machine use single clock cycle instruction, then It can execute about 3.4 billions instructions per second.
But in pipeline, basically it needs more cycles per instruction, but each cycle length is shorter than single clock cycle.
But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?
Pipelining
Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:
We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.
With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.
Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.
In an ideal situation, the process works (roughly) like this:
It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to pipelining the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.
That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.
The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.
To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.
Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).
So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).
A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.
To do that, it typically ends up with a basic structure vaguely like this:
So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).
From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).
The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.
There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.
Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.
Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.
In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.
With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).
Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.
As others have noted the full details of how a modern CPU operates are complicated. But part of your question has a simple answer:
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle,
not for based on single cycle implentation?
The clock frequency of a CPU refers to how many times per second the clock signal switches. The clock signal is not divided into smaller pipelined segments. The purpose of pipelining is to allow for faster clock switching speeds. So 3.4GHz refers to the number of times per second that a single pipeline stage can perform whatever work it needs to do when executing an instruction. The total work for executing an instruction is done over multiple cycles each of which could be in a different pipeline stage.
Your question also shows a some misconceptions about how pipelining works:
But although pipeline has more throughput, anyway cpu can do 3.4
billions cycle per second. So, it can execute 3.4 billions/5
instructions(if one instruction needs 5 cycles), which means less than
single cycle implementation(3.4 > 3.4/5). What am I missing?
In the simple case the throughput of a single cycle CPU and a pipelined CPU is the same. The latency of the pipelined CPU is higher because it requires more cycles (i.e. 5 in your example) to execute a single instruction. But after the pipeline is full the throughput could be the same as for a single cycle non-pipelined CPU. So in the simple case using your example a single-cycle CPU could execute 3.4 billion instructions in 1 seconds, while the pipelined CPU with 5 stages could execute 3.4 billion minus 5 instructions in 1 second. Subtracting 5 from 3.4 billion is a negligible difference, whereas dividing by 5 would be a very significant difference.
A couple of other things to note are that the simple case I described isn't really true because of dependencies between instructions that require pipeline stalls. And most modern CPUs can execute more than one instructions per cycle.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.