How do MemReq and MemResp exactly work in RoccIO - RISCV - cpu-architecture

I'm trying to figure out how can I read from and write to memory in RISCV when I'm using RoCCIO. But I couldn't clearly get what is happening. Especially how can I address the memory or how should I work with memory tag.
Are there any resources that I can find how I can transfer data between Rocket core and my Accelerator?
In the uncore/src/main/scala/consts.scala path they have mentioned different type of memory cmd. But what else?
For example I want to pass starting address of an array and number of elements that I plan to fetch into the accelerator and then start fetching them. What signalling should I use?
Thanks

Within the RoCC interface, the mem field is a connection to the L1 cache. The dmem field is a connection to the L2 cache. Which one you want to use depends upon the memory bandwidth requirements of your accelerator.
Rocket and the RoCC accelerator can either share data through the caches (remember to use a fence instruction on the Rocket core so the memory ordering is correct) or you can directly give data to Rocket through the resp field in the RoCCIO.
The L1 cache's IO can be found in Rocket's (https://github.com/ucb-bar/rocket/blob/master/src/main/scala/nbdcache.scala) whereas the L2 IO can be found in the uncore's (https://github.com/ucb-bar/uncore/blob/master/src/main/scala/tilelink.scala).
Although I don't know which memory tag you are referring to, typically the tag is passed through the memory system and returned to you with the response untouched (if you have multiple requests inflight, this returning tag helps you identify which is which).
I suspect if you want to fetch an array of data, you will need a state machine to request each individual address in your accelerator. Unless you go through the L2 cache interface, in which case I believe it comes in cache-line sizes.

Related

How to delete memory usage during an Experiment?

I am constructing an experiment in Anylogic, which saves data in the Parameter variation tab under a custom-class list. The model needs to perform a lot of simulations, and repetitions to optimize for Setting variables in the model itself. After x amount of iterations, I use a Python connector to run some code in finding new possible parameters for the underlaying model.
The problem I am having right now, is that around Simulation-run number 200, the memory usage is maximum (4Gb), and it proceeds to run super-slow. I have found some interesting ways to cut on memory usage, but I believe there is only one thing that could help me right now: let the system delete memory that is used for past iterations. After each iteration, the data of a simulation is stored, so I am fine with anylogic deleting the logs of the specific simulation afterwards.
Is such a thing possible? If so, how can I implement that?
Java makes use of a Garbage collector to manage memory usage and you have no control over it. How it works is that every now and then, based on some internal logic, it will collect and remove all instances of classes in memory that do not contain any active references and remove them.
Thus to reduce memory you must ensure that any instances that are no longer needed are not referenced by any of the objects currently active in your model.
To identify these you must use a Java profiler like JProfiler, or some of the free alternatives - see here for more.
This will show you exactly what classes are using up all your memory and with some deep diving you should be able to identify who is keeping reference to them.

paging and binding schemes in memory management

The concept of paging in memory management can be used with which all schemes of binding?
By binding, I mean "mapping logical addresses to physical addresses". In my knowledge there are three types of binding schemes compile time, load time and execution time binding.
Paging is not involved in compiling, so we can rule that out.
Load time can have to meanings - combining the object modules of a program and libraries to produce an executable image (program) with no unresolved symbols (unix definition) OR transferring a program into memory so it may execute (non-unix).
What unix calls loading, some other systems call link editting.
Unix loading/link-editting is really part of compiling so doesn't involve paging at all. This operation does need to know the valid program addresses it can assign, which will permit the program to load. Conventionally these are from 0 to a very large number like 2^31 or 2^47.
Transferring an image to memory and executing can be considered either phases of the same thing, or in demand loading environments, exactly the same thing. Either way, the bit of the system that prepares the program address space has to fill out a set of tables which relate a program address to a physical address.
The program address of main() might be 0x12345; which might be viewed as offset 0x345 from page 0x12. The operating system might attach that to physical page 0x100, meaning that main() might temporarily be at 0x100345. Temporarily, because the operating system is free to change this relation (conventionally called a mapping) at any time.
The dynamic nature of these mappings is a positive attribute of paging, as it permits the system to reformulate its use of physical memory to meet changing demands.

In MESI cache coherence protocol, when exactly does the state of a cache line change if the data needs to be fetched from memory?

In MESI protocol when a CPU:
Performs a read operation
Finds out the cache line is in Invalid state
There is no other non-invalid copies in other caches
It will need to fetch the data from the memory. This will take a certain number of cycles to do this. So does the state of the cache line change from (I) to (E) instantly or only after data is fetched from memory?
I think a cache would normally wait for the data to arrive; when it's not there yet you can't actually get a hit in cache for other requests to the same line, only to other lines that actually are present (hit under miss). Therefore the state for that line is still Invalid; the data for that tag isn't valid, so you can't set it to a valid state yet.
You'd want another miss to same line (miss under miss) to notice there was already an outstanding request for that line and attach itself to that line-request buffer. (e.g. Intel x86 LFB = line fill buffer). Since finding Invalid triggers looking at fill buffers but Exclusive doesn't, you want Invalid based on this reasoning as well.
e.g. the Skylake perf-counter event mem_load_retired.fb_hit counts, from perf list output:
[Retired load instructions which data sources were load missed L1 but
hit FB due to preceding miss to the same cache line with data not
ready.
Supports address when precise (Precise event)]
In a cache in a very old / simple or toy CPU with no memory-level parallelism (whole pipeline or just memory access totally stalls execution until the data arrives), the distinction is meaningless; nothing else happens to cache while the requested data is in-flight.
In such a CPU it's just an implementation detail. (Except it should still process MESI requests from other cores while a load is in flight so again tags need to reflect the correct state, otherwise it's extra stuff to check when deciding how to reply.)
After data is fetched from memory.
In practice, MESI (or any other protocol) has many transition states in addition to the main states of M/E/S/I. In your example, the coherence protocol would transition to a "Wait for Data Fill" state and will transition to E only after data is fetched and valid bit is set.
Reference: Cache coherence protocols in gem5/ruby-- http://learning.gem5.org/book/part3/MSI/cache-transitions.html (search for "was invalid, going to shared") may be useful.

Page Fault - How does os search for the page in secondary storage?

My question is when a page fault occurs and the required page is not in RAM ,after that how does the os know where to look for the given page in the entire secondary memory to bring it to the RAM? So is the logical address the address of the secondary memory store or is the required secondary storage address stored in the page table itself or some other way?
I feel like i am probably missing something very basic here but this doubt came in my mind and a quick google search is not providing any answers.
My question is when a page fault occurs and the required page is not in RAM ,after that how does the os know where to look for the given page in the entire secondary memory to bring it to the RAM?
If there were 50 different operating systems that supported an average of 10 different architectures each, there would be up to a maximum of 500 different answers; where one of the answers would be "all software uses physical addresses and there is no virtual memory and there is no secondary memory" and another answer would be "a virtual address is a location on the disk and RAM is just used as disk cache to speed it up" (see https://en.wikipedia.org/wiki/Single-level_store ).
For most typical modern operating systems running on most typical architectures; if you worked out all of the information the kernel needs to know about each virtual page (e.g. what the page is pretending to be, what the page actually is, location on disk if any, location in RAM if any, something to keep track of "least recently used", something to keep track of "number of copy-on-write copies", etc); then you could scatter all the information across multiple different data structures such that:
some of the data structures are used/required by the CPU itself and some aren't
the same information may or may not be in 2 or more data structures at the same time
some data structures have an entry for each virtual page and some just have an entry for each range of multiple pages
some data structures are arrays/tables, some are trees, some are trees of tables, and some are something else.
some use "virtual address" or "virtual page number" as a key to find information; and some use something else (e.g. inverted page tables on PowerPC and Itanium use "physical address" as an index because using what you're trying to find as an index is the least intelligent thing you could possibly do, so why not?).
some of the data structures may be in the kernel and some may not be (e.g. the L4 micro-kernel manages virtual memory mapping purely in user-space via. an "abstract hierarchical address space" model).
In general; the information about where a page's data is in (each different piece of?) secondary memory (if there is secondary memory) will be stored in one or more places in one or more things.
Note that when a page fault occurs the page fault handler typically needs to make multiple decisions; possibly starting with figuring out what made the access (a process, the kernel itself?) and whether the access should be allowed or denied, then figuring out what to do about it (send SIGSEGV? do a kernel panic? fetch data into the CPU's TLB? invalidate stale data from CPU's TLB? do copy-on-write cloning? fetch data from swap space? fetch data from file?); so the page fault handler ends up finding multiple different pieces of data from (potentially) multiple different places.
A Concrete Example
For my OS designs (which are based on asynchronous message passing and use micro-kernels); a micro-kernel is small enough that it can be custom designed and optimised for a specific architecture (without any regard to portability). The operating system design is intended for distributed systems, and for that reason shared memory (and fork()) are not supported (you don't want page fault handler to have to fetch data from a remote computer over a congested network connection to do a "copy on write"); and the only case for "copy on write" is memory mapped files where the page is shared by one or more processes and the (local) VFS cache.
For 64-bit 80x86, the CPU requires a tree of 4 levels of tables (page tables, page directories, page directory pointer tables and page map level 4), and to improve efficiency (reduce memory consumption and reduce cache misses, etc) I use these tables as much as possible.
For page table entries (or page directory entries if 2 MiB pages are being used); if the page is not present there are 63 bits that are ignored by the CPU that the OS can use for its own purposes; and if the page is present then (depending on which features CPU supports) there are at least 9 bits that the OS can use for its own purposes and flags that the CPU uses (e.g. the "read, write, no-execute" flags) can be used to augment the OS's own information.
When a page is not present, the 63-bits are split into 2 fields - one 8 bit field to keep track of the virtual type of the page (if it's supposed to act like RAM, if it's supposed to be executable, if it's supposed to use "write-back" caching, etc), and one 55 bit "where" field. If the highest bit in the "where" field is set the page was sent to swap space and the other 54 bits are a "swap space handle" (allowing for a maximum of "2**54 * 4 KiB" of swap space); and if the highest bit in the "where" field is clear then the other 54 bits are a "memory mapped file handle". If a page fault occurs because of a "not present" page, the page fault handler uses the 8-bit field to determine if the access should be allowed or denied (or if it's already being handled due to a different thread accessing it already), then (if the access should be allowed) the page fault handler tells the scheduler to put the thread in a "WAITING FOR PAGE" state and marks the page as "being fetched" (so that other threads that belong to the same process know that it's being fetched already), then uses the "where" field to either send a request message asking for the page's data to the Swap Manager (which is a process in user-space), or to find a "memory mapped file descriptor" structure in kernel space that contains more information (that didn't fit in the page table entry) to determine the offset of the page within the file and a file handle, and send a request to the VFS for the page's data (the VFS or Virtual File System is another process in user-space). Later; when Swap Manager or VFS send a reply message containing the page's data back to the kernel, the kernel fixes up the page table entry (putting the page of data from the message into the virtual address space) and tells scheduler to unblock the thread/s (shift them from the "WAITING FOR PAGE" state to the "READY TO RUN" state).
For both of these cases (memory mapped file and swap space) if the access was an "allowed read" then the page is mapped as read only (regardless of whether the page is supposed to be writeable). If the access was an "allowed write", or if a later "allowed write" is done to a page that was previously fetched and mapped as read only; then if the page's data came from swap space the page fault handler informs the Swap Manager that the copy of the page in swap space can be discarded (can't be re-used if the same page is sent to swap space later), and if the page's data came from a memory mapped file the page fault handler informs the VFS that there's one less process with a copy of that page and copies the "copy on write" page's data to a newly allocated page.
When a page is "present", it may still be part of a memory mapped file and there may still be a copy in swap space; but there isn't enough space in the page table entry to store the "where" field. In this case, if the page is in swap space and in RAM, the Swap Manager has to accept "Process ID + virtual address" instead of a "swap space handle" (which causes a little extra overhead in Swap Manager because it has to convert "Process ID + virtual address" into "swap space handle" itself). If the page is a "copy on write" memory mapped file, then the page fault handler searches the process' list of "memory mapped file descriptors" (which causes a little extra overhead).
Note that (in theory) when an OS is running low on free RAM it wants to select a "least likely to be needed soon" page to send to swap space, but this isn't easy/practical so most operating systems use "least recently used" instead.
My kernels don't do this at all. Instead they just send "random" pages to the Swap Manager, and (initially) the Swap Manager keeps the data in RAM and doesn't send it to any of the swap providers to store; and the Swap Manager uses "least recently sent to swap manager" to figure out which pages to send to a swap provider to store. A page that is used often may be sent to swap manager many times without ever actually being sent to a swap provider (and without causing slow disk IO for frequently used pages). Also note that, because "copy on write memory mapped file" is the only case that "copy on write" is used and because there is no other form of shared memory, the VFS can keep track of how many processes are sharing a copy of pages itself and the kernel never need to keep track of how many processes are sharing a copy of any page (like most kernels for most operating systems do).

How can I limit the number of blocks written in a Write_10 command?

I have a product that is basically a USB flash drive based on an NXP LPC18xx microcontroller. I'm using a library provided from the manufacturer (LPCOpen) that handles the USB MSC and the SD card media (which is where I store data).
Here is the problem: Internally the LPC18xx has a 64kB (limited by hardware) buffer used to cache reads/writes which means it can only cache up to 128 blocks(512B) of memory. The SCSI Write-10 command has a total-blocks field that can be up to 256 blocks (128kB). When originally testing the product on Windows 7 it never writes more than 128 blocks at a time but when tested on Linux it sometimes writes more than 128 blocks, which causes the microcontroller to crash.
Is there a way to tell the host OS not to request more than 128 blocks? I see references[1] to a Read-Block-Limit command(05h) but it doesn't seem to be widely supported. Also, what sense key would I return on the Write-10 command to tell Linux the write is too large? I also see references to a block limit VPD page in some device spec sheets but cannot find a lot of documentation about how it is implemented.
[1]https://en.wikipedia.org/wiki/SCSI_command
Let me offer a disclaimer up front that this is what you SHOULD do, but none of this may work. A cursory search of the Linux SCSI driver didn't show me what I wanted to see. So, I'm not at all sure that "doing the right thing" will get you the results you want.
Going by the book, you've got to do two things: implement the Block Limits VPD and handle too-large transfer sizes in WRITE AND READ.
First, implement the Block Limits VPD page, which you can find in late revisions of SBC-3 floating around on the Internet (like this one: http://www.13thmonkey.org/documentation/SCSI/sbc3r25.pdf). It's probably worth going to the t10.org site, registering, and then downloading the last revision (http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc3r36.pdf).
The Block Limits VPD page has a maximum transfer length field that specifies the maximum number of blocks that can be transferred by all the READ and WRITE commands, and basically anything else that reads or writes data. Of course the downside of implementing this page is that you have to make sure that all the other fields you return are correct!
Second, when handling READ and WRITE, if the command's transfer length exceeds your maximum, respond with an ILLEGAL REQUEST key, and set the additional sense code to INVALID FIELD IN CDB. This behavior is indicated by a table in the section that describes the Block Limits VPD, but only in late revisions of SBC-3 (I'm looking at 35h).
You might just start with returning INVALID FIELD IN CDB, since it's the easiest course of action. See if that's enough?