Different structures in out of order processors - x86-64

I am studying the out of order processors, so in different types of out of order processors, there are different structures like ARF(architectural register file), PRF(physical register file), ROB(reorder buffer) and FSB(Finished store buffer). Which stores some kind of information for out of order execution.
So these structures are storage units? if yes, then on which memory technology they are implemented? are they the same as registers (multi-ported SRAMs)?

Related

The output of the Wordcount is being stored in different files

The output of the WordCount is being stored in multiple files.
However the developer doesn't have control on where(ip,path) the files stay on cluster.
In MapReduce API, there is a provision for developers to write reduce program to address this.How to handle this in ApacheBeam with DirectRunner or any other runners?
Indeed -- the WordCount example pipeline in Apache Beam writes its output using TextIO.Write, which doesn't (by default) specify the number of output shards.
By default, each runner independently decides how many shards to produce, typically based on its internal optimizations. The user can, however, control this via .withNumShards() API, which would force a specific number of shards. Of course, forcing a specific number may require more work from a runner, which may or may not result in a somewhat slower execution.
Regarding "where the files stay on the cluster" -- it is Apache Beam's philosophy that this complexity should be abstracted away from the user. In fact, Apache Beam raises the level of abstraction such that user don't need to worry about this. It is runner's and/or storage system's responsibility to manage this efficiently.
Perhaps to clarify -- we can make an easy parallel with low-level programming (e.g., direct assembly), vs. non-managed programming (e.g., C or C++), vs. managed (e.g., C# or Java). As you go higher in abstraction, you no longer can control data locality (e.g., processor caching), but gain power, ease of use, and portability.

What are the data structures used to implement the Process Control Block in Unix?

I am taking an operating systems course and we have talked about what a process control block is, what is stored in it, what purpose it servers, and I understand all of that, but we didn't really touch on what data structure is actually used to make it. After googling it, I've come across two structures: either a linked list or an array is used. I realize that the structure could possible differ based on the operating system, but I was wondering exactly what data structures are used to create one, specifically in the Unix operating system (since I am using a Unix machine)?
The doubly-linked list data-structure is used 'generally' to implement Process Control Block! In UNIX also, the PCB is implemented as a doubly-linked list.
But,in case if your OS (talking about custom OS) is lightweight,then you can get off by using simpler data-structure like arrays! But,in general,PCB's are a very large data-structure and hence, advised to store in doubly-linked list which can be accommodated for any process to any level(storing all possible sort of info about process).
Also, check this answer of mine,here also I have mentioned in the last line the same answer...

What is the difference between a store queue and a store buffer?

I am reading a number of papers and they are either using store buffer and store queue interchangeably or they are relating to different structures, and I just cannot follow along. This is what I thought a store queue was:
It is an associatively searchable FIFO queue that keeps information about store instructions in fetch order.
It keeps store addresses and data.
It keeps store instructions' data until the instructions become non-speculative, i.e. they reach retirement stage. Data of a store instruction is sent to the memory (L1 cache in this case) from the store queue only when it reaches retirement stage. This is important since we do not want speculative store data to be written to the memory, because it would mess with the in-order memory state, and we would not be able to fix the memory state in case of a misprediction.
Upon a misprediction, information in the store queue corresponding to store instructions that were fetched after the misprediction instruction are removed.
Load instructions send a read request to both L1 cache and the store queue. If data with the same address is found in the store queue, it is forwarded to the load instruction. Otherwise, data fetched from L1 is used.
I am not sure what a store buffer is, but I was thinking it was just some buffer space to keep data of retired store instructions waiting to be written to the memory (again, L1).
Now, here is why I am getting confused. In this paper, it is stated that "we propose the scalable store buffer [SSB], which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers." I am thinking that the non-scalable associatively searchable conventional structure they are talking about is what I know as a store queue, because they also say that
SSB eliminates the non-scalable associative search of conventional
store buffers by forwarding processor-visible/speculative values to
loads directly from the L1 cache.
As I mentioned above, as far as I know data forwarding to loads is done through store queue. In the footnote on the first page, it is also stated that
We use "store queue" to refer to storage that holds stores’ values
prior to retirement and "store buffer" to refer to storage containing
retired store values prior to their release to memory.
This is in line with what I explained above, but then it conflicts with the 'store buffer' in the first quote. The footnote corresponds to one of the references in the paper. In that reference, they say
a store buffer is a mechanism that exists in many current processors
to accomplish one or more of the following: store access ordering,
latency hiding and data forwarding.
Again, I thought the mechanism accomplishing those is called a store queue. In the same paper they later say
non-blocking caches and buffering structures such as write buffers,
store buffers, store queues, and load queues are typically employed.
So, they mention store buffer and store queue separately, but store queue is not mentioned again later. They say
the store buffer maintains the ordering of the stores and allows
stores to be performed only after all previous instructions have been
completed
and their store buffer model is the same as Mike Johnson's model. In Johnson's book (Superscalar Microprocessor Design), stores first go to store reservation station in fetch order. From there, they are sent to the address unit and from the address unit they are written into a "store buffer" along with their corresponding data. Load forwarding is handled through this store buffer. Once again, I thought this structure was called a store queue. In reference #2, authors also mention that
The Alpha 21264 microprocessor has a 32-entry speculative store buffer
where a store remains until it is retired."
I looked at a paper about Alpha 21264, which states that
Stores first transfer
their data across the data buses into the speculative store buffer.
Store data remains in the speculative store buffer until the stores retire.
Once they retire, the data is written into the data cache on idle cache cycles.
Also,
The internal memory system maintains a 32-entry load queue (LDQ) and
a 32-entry store queue (STQ) that manages the references while they
are in-flight. [...] Stores exit the STQ in fetch order after they
retire and dump into the data cache. [...] The STQ CAM logic controls
the speculative data buffer. It enables the bypass of speculative
store data to loads when a younger load issues after an older store.
So, it sounds like in Alpha 21264 there is a store queue that keeps some information about store instructions in fetch order, but it does not keep data of store instructions. Store instructions' data are kept in the store buffer.
So, after all of this I am not sure what a store buffer is. Is it just an auxiliary structure for a store queue, or is it a completely different structure that stores data which is waiting to be written to L1. Or is it something else? I feel like some authors mean "store queue" when they say "store buffer". Any ideas?
Your initial understanding is correct - Store Buffer and Store Queue are distinct terms and distinct hardware structures with different uses. If some authors use them interchangeably, it is plain incorrect.
Store Buffer :
A store buffer is a hardware structure closer to the memory hierarchy
and "buffers" up the write traffic (stores) from the processor so that
the Write-back stage of the processor is complete as soon as possible.
Depending on whether the cache is write-allocate/write-no-allocate, a write to the cache may take variable amount of cycles. The store buffer essentially decouples the processor pipeline with the memory pipeline. You can read some more info here.
Other use of store buffer is typically in speculative execution. This means that when the processor indulges in speculative execution and commit (like in hardware transactional memory systems, or thers like transmeta crusoe), the hardware must keep track of the specualtive writes and undo them in case of misspeculation. This is where such a processor would use the store buffer for.
Store Queue :
Store Queue is an associate array where the processor stores the data and addresses of the in-flight stores. These are typically used in out-of-order processors for memory disambiguation. The processor needs a Load-Store Queue (LSQ) really to perform the memory disambiguation because it must see through all the memory accesses to the same address before concluding to schedule one memory operation before the other.
All the memory disambiguation logic is accompalished via the Load-Store Queues in an out-of-order processor. Read more about memory disambiguation here
If your confusion is solely because of the paper you refering to, consider asking the authors - it is likely that their use of terminology is mixed up.
You seem to be making a big deal out of names, it's not that critical. A buffer is just some generic storage, which in this particular case should be managed as a queue (to maintain program order as you stated). So it could either be a store buffer (I'm more familiar with this one actually, and see also here), but in other cases it could be described as a store queue (some designs have it combined with the load queue, forming an LSQ).
The names don't matter that much because as you see in your second quote - people may overload them to describe new things. In this particular case, they chose to split the store buffer into 2 parts, divided by the retirement pointer, since they believe they could use it to avoid certain store related stalls in some consistency models. Hey, it's their paper, for the remainder of it they get to define what they want.
One note though - the last bullet of your description of the store buffer/queue seems very architecture specific, forwarding local stores to loads at highest priority may miss later stores to the same address from other threads, and break most of the memory ordering models except the most relaxed ones (unless you protect against that otherwise).
This is in line with what I explained above, but then it conflicts
with the 'store buffer' in the first quote.
There is really no conflict and your understanding seems to be consistent with the they way these terms are used in the paper. Let's carefully go through what the authors have said.
SSB eliminates the non-scalable associative search of conventional
store buffers...
The store buffer holds stores that have been retired but are yet to be written into the L1 cache. This necessarily implies that any later issued load is younger in program order with respect to any of the stores in the store buffer. So to check whether the most recent value of the target cache line of the load is still in the store buffer, all that needs to be done is to search the store buffer by the load address. There can be either zero stores that match the load or exactly one store. That is, there cannot be more than one matching store. In the store buffer, for the purpose of forwarding, you only need to keep track of the last store to a cache line (if any) and only compare against that. This is in contrast to the store queue as I will discuss shortly.
...by forwarding processor-visible/speculative values to loads
directly from the L1 cache.
In the architecture proposed by the authors, the store buffer and the L1 cache are not in the coherence domain. The L2 is the first structure that is in the coherence domain. Therefore, the L1 contains private values and the authors use it to forward data.
We use "store queue" to refer to storage that holds stores’ values
prior to retirement and "store buffer" to refer to storage containing
retired store values prior to their release to memory.
Since the store queue holds stores that have not yet been retired, when comparing a load with the store queue, both the address and the age of each store in the queue need to be checked. Then the value is forwarded from the youngest store that is older than the load targeting the same location.
The goal of the paper you cited is to find an efficient way to increase the capacity of the store buffer. It just doesn't make any changes to the store queue because that is not in the scope of the work. However, there is another paper that targets the store queue instead.
a store buffer is a mechanism that exists in many current processors
to accomplish one or more of the following: store access ordering,
latency hiding and data forwarding.
These features apply to both store buffers and store queues. Using store buffers (and queues) is the most common way to provide these features, but there are others.
In general, though, these terms might be used by different authors or vendors to refer to different things. For example, in the Intel manual, only the store buffer term is used and it holds both non-retired and retired-but-yet-to-be-committed stores (obviously the implementation is much more complicated than just a buffer). In fact, it's possible to have a single buffer for both kinds of stores and use a flag to distinguish between them. In the AMD manual, the terms store buffer, store queue, and write buffer are used interchangeably to refer to the same thing as what Intel calls the store buffer. Although the term write buffer does have a specific meaning in other contexts. If you are reading a document that uses any of these terms without defining them, you'll have to figure out from the context how they are used. In that particular paper you cited, the two terms have been defined precisely. Anyway, I understand that it's easy to get confused because I've been there.

What's the difference between page and block in operating system?

I have learned that in an operating system (Linux), the memory management unit (MMU) can translate a virtual address (VA) to a physical address (PA) via the page table data structure. It seems that page is the smallest data unit that is managed by the VM. But how about the block? Is it also the smallest data unit transfered between the disk and the system memory?
What is the difference between pages and blocks?
A block is the smallest unit of data that an operating system can either write to a file or read from a file.
What exactly is a page?
Pages are used by some operating systems instead of blocks. A page is basically a virtual block. And, pages have a fixed size – 4K and 2K are the most commonly used sizes. So, the two key points to remember about pages is that they are virtual blocks and they have fixed sizes.
Why pages may be used instead of blocks
Pages are used because they make processing easier when there are many storage devices, because each device may support a different block size. With pages the operating system can deal with just a fixed size page, rather than try to figure out how to deal with blocks that are all different sizes. So, pages act as sort of a middleman between operating systems and hardware drivers, which translate the pages to the appropriate blocks. But, both pages and blocks are used as a unit of data storage.
http://www.programmerinterview.com/index.php/database-sql/page-versus-block/
Generally speaking, the hard-disk is one of those devices called "block-devices" as opposed to "character-devices" because the unit of transferring data is in the block.
Even if you want only a single character from a file, the OS and the drive will get you a block and then give you access only to what you asked for while the rest remains in a specific cache/buffer.
Note: The block size, however, can differ from one system to another.
To clear a point:
Yes, any data transferred between the hard disk and the RAM is usually sent in blocks rather than actual bytes.
Data which is stored in RAM is managed, typically, by pages yes; of course the assembly instructions only know byte addresses.

G-WAN Key-Value Store

What would you consider the best solution to store via G-WAN Key-Value Store my values in RAM and multi-threaded, and able to be used by all my scripts (from other virtual servers or not) ?
Thank you in advance.
I wish to store different values in different "storages" so as to be able to recover each one via a "key" (type char).
The G-WAN KV store does that (for any type of data: binary too).
Once your application will have millions of concurrent users, one way to speed-up lookups will be to use different G-WAN servers to host either a partitioned data set or a redundant data set (it all depends on the type of your application).
The G-WAN reverse-proxy featuring an elastic load-balancer makes such things this almost transparent for developers.
I do not care that the data is lost when you restart g-wan.
Then you won't have to use a persistant layer like mySQL, etc.
So it would be fine (I think) to have a persistent pointer but I'm not sure that this is the most suitable solution
Look at the persistence.c example for about how to share common data among all worker threads in G-WAN.
But you can avoid that if you are using G-WAN with one single worker thread (./gwan -w 1). One thread is more than enough to start developing and even to operate your application until the point you will need to process more requests.
With one single thread, you can just use a static pointer to your G-WAN KV store (unless different scripts need to access it).