arm caches flush to uncached region - operating-system

I wondering how the cache subsystem will act in the following situation:
Let's consider Cortex-a8 and VIPT L1-cache.
Suppose a cacheable virtual memory mapping [A,B] was created, for some time the one worked with this memory region so the data cache was filled up. After this, suppose, the mapping [A,B] was substituted by uncacheable. After that we perform a flush operation (DCCIMVAC) on virtual region [A,B]. What will happen in this situation? The cache data will be just discarded (invalidated)? Or data will be flushed to non cacheable pages anyway? Or something else?
Update:
The main reason why I'am asking is - my board stucks immediately after one of this flush instructions in the context I described above. I just have no idea why this could happen. More literally, it hangs in the middle of the page, flushing one of the cache lines at the offset 0xd80. If skip this cache line flush - it goes further. If change attrs to cacheable (from uncacheable) it works ok. And looks like it should be changed to cacheable, but I try to figure out what this code had done. I'am mostly interested in - if the scenario I had described is legal, or it may lead to some undefined behavior.

Related

How to delete memory usage during an Experiment?

I am constructing an experiment in Anylogic, which saves data in the Parameter variation tab under a custom-class list. The model needs to perform a lot of simulations, and repetitions to optimize for Setting variables in the model itself. After x amount of iterations, I use a Python connector to run some code in finding new possible parameters for the underlaying model.
The problem I am having right now, is that around Simulation-run number 200, the memory usage is maximum (4Gb), and it proceeds to run super-slow. I have found some interesting ways to cut on memory usage, but I believe there is only one thing that could help me right now: let the system delete memory that is used for past iterations. After each iteration, the data of a simulation is stored, so I am fine with anylogic deleting the logs of the specific simulation afterwards.
Is such a thing possible? If so, how can I implement that?
Java makes use of a Garbage collector to manage memory usage and you have no control over it. How it works is that every now and then, based on some internal logic, it will collect and remove all instances of classes in memory that do not contain any active references and remove them.
Thus to reduce memory you must ensure that any instances that are no longer needed are not referenced by any of the objects currently active in your model.
To identify these you must use a Java profiler like JProfiler, or some of the free alternatives - see here for more.
This will show you exactly what classes are using up all your memory and with some deep diving you should be able to identify who is keeping reference to them.

What happens to performance.mark entries when the resource buffer gets full

I am building a large private (ie used behind a firewall) PWA and wondering how to improve the diagnostics if/when my users hit issues. I already have an error manager which uses navigator.sendBeacon to log the error on the server, but that lacks detailed info of what led up to that point.
A thought I had was to liberally mark the code with performance.mark() statements and on an error dump the performance buffer to the server. It would give me an ordered list of recent activity.
However it only makes sense to do this if the browser throws away the oldest entry to make way for the new when the internal buffer is full. However all the documentation I found with a google search doesn’t mention it. I am aware I can get an event when it is full and could use that to copy and clear it but I can find no words on what happens if I ignore the event. Neither can I find a typical size. I don’t want to keep getting entries filling up the entire computer memory either
Can anyone give me a definite answer
Edit: The more I look into this, the more confused I become. It appears that you can control the size of the resourceTimingBuffer but "resource" performance entries are related to fetch and not Performance.mark(). I can't find any statement on limitations.
There are no meaningful limitations I could find. I did a test and generated more than 4000 marks and they were all there and the memory usage did not increase in any measurable way.

In MESI cache coherence protocol, when exactly does the state of a cache line change if the data needs to be fetched from memory?

In MESI protocol when a CPU:
Performs a read operation
Finds out the cache line is in Invalid state
There is no other non-invalid copies in other caches
It will need to fetch the data from the memory. This will take a certain number of cycles to do this. So does the state of the cache line change from (I) to (E) instantly or only after data is fetched from memory?
I think a cache would normally wait for the data to arrive; when it's not there yet you can't actually get a hit in cache for other requests to the same line, only to other lines that actually are present (hit under miss). Therefore the state for that line is still Invalid; the data for that tag isn't valid, so you can't set it to a valid state yet.
You'd want another miss to same line (miss under miss) to notice there was already an outstanding request for that line and attach itself to that line-request buffer. (e.g. Intel x86 LFB = line fill buffer). Since finding Invalid triggers looking at fill buffers but Exclusive doesn't, you want Invalid based on this reasoning as well.
e.g. the Skylake perf-counter event mem_load_retired.fb_hit counts, from perf list output:
[Retired load instructions which data sources were load missed L1 but
hit FB due to preceding miss to the same cache line with data not
ready.
Supports address when precise (Precise event)]
In a cache in a very old / simple or toy CPU with no memory-level parallelism (whole pipeline or just memory access totally stalls execution until the data arrives), the distinction is meaningless; nothing else happens to cache while the requested data is in-flight.
In such a CPU it's just an implementation detail. (Except it should still process MESI requests from other cores while a load is in flight so again tags need to reflect the correct state, otherwise it's extra stuff to check when deciding how to reply.)
After data is fetched from memory.
In practice, MESI (or any other protocol) has many transition states in addition to the main states of M/E/S/I. In your example, the coherence protocol would transition to a "Wait for Data Fill" state and will transition to E only after data is fetched and valid bit is set.
Reference: Cache coherence protocols in gem5/ruby-- http://learning.gem5.org/book/part3/MSI/cache-transitions.html (search for "was invalid, going to shared") may be useful.

IBM Window Services (DWS) csrevw function on MVS

I'm working on IBM MVS (z/OS) and trying to make Window Services working.
On the function CSREVW I don't understand what the purpose of the parameter pfcount.
Acording to the documentation this will ask to the window services to read more than one block after my program references a block that is not in my window.
But how the window services is suposed to know that I tried to reference data that are not in my window? I mean, it can't know that I'm reading data out of my window if i don't call CSREVW or CSRVIEW again.
Maybe my major issue is that I have trouble to understand english but this seems clear to me...
Here is the link to the documentation, this is explained at pages 23-24 :
http://publibz.boulder.ibm.com/epubs/pdf/iea3c102.pdf
I know this is a very specific problem about an IBM service and I apologize about that.
Thank you !
Tim
I think the problem you're having is that you need to understand a little bit about how the underlying objects behind the windowing service work in virtual storage.
At the core, the various windowing services work to give you what amounts to a "private" page dataset. You allocate and reference storage, but the objects in that virtual space aren't really in memory - the system's page fault mechanism brings them in as you reference them. So yes, you're accessing data within a "window", but in reality, the data you expect to see may not be "paged in" at that moment.
Going a little deeper, when you first allocate the object, the virtual storage it's mapped to has all of the pages marked "invalid" in the underlying page table entries. That means that as soon as you touch this storage, a page fault interrupt occurs. At this point, the operating system steps in and resolves the page fault by brining the necessary data into memory, then your program continues, oblivious to all of this processing on your behalf. You're correct that you're just referencing data within the window, but there's a lot under the covers going on to support this.
This is where PFCOUNT comes in...
Let's say you have structures that are, say, 64K long inside your virtual window. It would be sloppy and slow to reference each page of this structure and cause a page fault each time. Much better would be to use PFCOUNT to cause the page you reference and all 15 other pages needed by your object to be paged-in with a single operation. Conversely, if your data was small and you were highly random about how you access it, PFCOUNT isn't going to help you - the next page you reference could be anywhere, and it's actually wasteful to have a large PFCOUNT since you end up bringing in a lot of data you never use.
Hope that makes sense - if you'd like a challenge, take yourself a system dump and examine the system trace entries as you reference data...you'll see a very distinct pattern of page faults, I/O and resumption of your program, and hopefully it will all make sense to you.
From the manual
,pfcount
Specifies the number of additional blocks you want window services to bring into the window each time your program references data that
is not already in the window. The number you specify is added to the
minimum of one block that window services always brings in. That is,
if you specify a value of 20, window services brings in up to 21. The
number of additional blocks ranges from zero through 255.
Note that you get 1 block without asking.

What does 'Mutex lock' exactly do?

You can see an interesting table at this link. http://norvig.com/21-days.html#answers
The table described,
Mutex lock/unlock 25 nanosec
fetch from main memory 100 nanosec
Nanosec?
I surprised because mutex lock is faster than fetch data from memory. If so, what mutex lock exactly do? And what does Mutex lock mean at the table?
Let's say that ten people had to share a pen (maybe they work at a really cash-strapped company). Since they have to write long documents with the pen, but most of the work in writing a document is just thinking of what to say, they agree that each person gets to use the pen to write one sentence of the document, and then has to make it available to the rest of the group.
Now we have a problem: what if two people are done thinking about the next sentence, and both want to use the pen at once? We could just say that both people can grab the pen, but this is a fragile old pen, so if two people grab it then it breaks. Instead, we draw a chalk line around the pen. First you put your hand across the chalk line, then you grab the pen. If one person's hand is inside the chalk line, then nobody else is allowed to put their hand inside the chalk line. If two people try to put their hand across the chalk line at the same time, under these rules only one of them will get inside the chalk line first, so the other has to pull back their hand and keep it just outside the chalk line until the pen is available again.
Let's relate this back to mutexes. A mutex is a way to protect a shared resource (the pen) for a short period of time called the critical section (the time to write one sentence of a document). Whenever you want to use the resource, you agree to call mutex_lock first (put your hand inside the chalk line). Whenever you're done with the resource, you agree to call mutex_unlock (take your hand out from the chalk line area).
Now to how mutexes are implemented. A mutex is usually implemented with shared memory. There is some shared opaque data object called a mutex, and the mutex_lock and mutex_unlock functions both take a pointer to one of these. The mutex_lock function checks and modifies data inside the mutex using an atomic test-and-set or load-linked/store-conditional instruction sequence (on x86, xhcg is often used), and either "acquires the mutex" - sets the contents of the mutex object to indicate to other threads that the critical section is locked - or has to wait. Eventually, the thread gets the mutex, does the work inside the critical section, and calls mutex_unlock. This function sets the data inside the mutex to mark it as available, and possibly wakes up sleeping threads that have been trying to acquire the mutex (this depends on the mutex implementation - some implementations of mutex_lock just spin in a tight look on xchg until the mutex is available, so there is no need for mutex_unlock to notify anybody).
Why would locking a mutex be faster than going out to memory? In short, caching. The CPU has a cache that can be accessed very quickly, so the xchg operation doesn't need to go all the way out to memory as long as the processor can ensure that there is no other processor accessing that data. But x86 has a notion of "owning" a cache line - if processor 0 owns a cache line, any other processor that wants to use data in that cache line has to go through processor 0. This way, there is no need for the xhcg operation to look at any data beyond the cache, and cache access tends to be very fast, so acquiring an uncontested mutex is faster than a memory access.
There is one caveat to that last paragraph, though: the speed benefit only holds for an uncontested mutex lock. If two threads try to lock the same mutex at the same time, the processors that are running those threads have to communicate and deal with ownership of the relevant cache line, which greatly slows down the mutex acquisition. Also, one of the two threads will have to wait for the other thread to execute the code in the critical section and then release the mutex, further slowing down the mutex acquisition for one of the threads.
The article you linked does not mentioned the architecture, but judging by mentions of L1 and L2 cache it's Intel. If this is so, then I think that by mutex they meant LOCK instruction. In this respect this post seems relevant: Intel 64 and IA-32 | Atomic operations including acquire / release semantic
Also Intel software developer's manual can help if you know, what you are looking for. I'd read anything relevant I could find about the LOCK instruction.