Globally Invisible load & store memory operations

Globally Invisible load & store memory operations - x86-64

Reading this answer Can x86 reorder a narrow store with a wider load that fully contains it? I have some questions about to it (sorry, my low reputation does not allow me to add there further comments).
That thread describe a simple program run on a Core in which a store is followed by a load in program order. The Core uses store-to-load forwarding to forward to the following load the content of the store waiting in the store queue (write buffer) to commit to L1D cache
This by itself isn't reordering yet (the load sees the store's data, and they're adjacent in the global order), but it leaves the door open for reordering. The cache line can be invalidated by another core after the load, but before the store commits. A store from another core can become globally visible after our load, but before our store.
So the load includes data from our own store, but not from the other store from another CPU. The other CPU can see the same effect for its load, and thus both threads enter the critical section.
Now my point is: If the cache line was invalidated by another core after our load has executed (i.e. it has returned data from L1D cache), our load could never return the new value of the store executed from another thread that is become global visible after our load but before our store commits in L1D cache, don't you ?

Related

Why page faults are usually handled by the OS, not hardware?

I find that during TLB missing process, some architecture use hardware to handle it while some use the OS. But when it comes to page fault, most of them use the OS instead of hardware.
I tried to find the answer but didn't find any article explains why.
Could anyone help with this?
Thanks.

If the hardware could handle it on its own, it wouldn't need to fault.
The whole point is that the OS hasn't wired the page into the hardware page tables, e.g. because it's not actually in memory at all, or because the OS needs to catch an attempt to write so the OS can implement copy-on-write.
Page faults come in three categories:
valid (the process logically has the memory mapped, but the OS was lazy or playing tricks):
hard: the page needs to be paged in from disk, either from swap space or from a disk file (e.g. a memory mapped file, like a page of an executable or shared library). Usually the OS will schedule another task while waiting for I/O.
soft: no disk access required, just for example allocating + zeroing a new physical page to back a virtual page that user-space just tried to write. Or copy-on-write of a writeable page that multiple processes had mapped, but where changes by one shouldn't be visible to the other (like mmap(MAP_PRIVATE)). This turns a shared page into a private dirty page.
invalid: There wasn't even a logical mapping for that page. A POSIX OS like Linux will deliver SIGSEGV signal to the offending process/thread.
The hardware doesn't know which is which, all it knows was that a page walk didn't find a valid page-table entry for that virtual address, so it's time to let the OS decide what to do next. (i.e. raise a page-fault exception which runs the OS's page-fault handler.) valid/invalid are purely software/OS concepts.
These example reasons are not an exhaustive list. e.g. an OS might remove the hardware mapping for a page without actually paging it out, just to see if the process touches it again soon. (In which case it's just a cheap soft page fault. But if not, then it might actually page it out to disk. Or drop it if it's clean.)
For HW to be able to fully handle a page fault, we'd need data structures with a hardware-specified layout that somehow lets hardware know what to do in some possible situations. Unless you build a whole kernel into the CPU microcode, it's not possible to have it handle every page fault, especially not invalid ones which require reading the OS's process / task-management data structures and delivering a signal to user-space. Either to a signal handler if there is one, or killing the process.
And especially not hard page faults, where a multi-tasking OS will let some other process run while waiting for the disk to DMA the page(s) into memory, before wiring up the page tables for this process and letting it retry the faulting load or store instruction.

How does process start execution under pure demand paging?

I was studying memory management in OS and suddenly got this doubt. It is said that in pure demand paging, process starts execution with zero pages brought to the main memory. The virtual address space contains many things including data, stack,heap and the text-area or code. So if a process is going to execute and it has no page in main memory,how will Instruction register store its first instruction which will executed by CPU resulting in further page faults?

This is a bad way of viewing the address space.
The virtual address space contains many things including data, stack,heap and the text-area or code.
The address space consists of memory with different attributes: readonly, readonly/execute, read/write and rarely read/write/execute.
Virtual memory is the use of secondary storage to simulate physical memory. The program loader reads the executable file and builds the address space on disk. For example, on some systems the executable file itself became the page file for code and data.
Once the program is loaded, the address space consists of pages that are valid to the operating system but have no mapping to physical addresses.
When the program starts running it accesses valid pages without mappings that results in page faults. The operating system page fault handler finds where the page is stored in secondary storage, maps the page to a physical page frame, and loads the data into the page.
So if a process is going to execute and it has no page in main memory,how will Instruction register store its first instruction which will executed by CPU resulting in further page faults?
The starting instruction is specified in the executable. That value gets loaded into a register. There is no first instruction in memory. When the program tries to execute its first instruction, it gets a page fault.

Core Data - How to force NSPrivateQueueConcurrencyType context save in serial?

I was excited about the newly supported concurrent functions of CoreData since iOS 5.
A private queue is maintained and all save or fetch requests can be done via that queue.
However, can I set up the private queue for CoreData so that it executes request one by one?
My app is downloading news items from a number of feeds. Each time after downloading and parsing from one feed are finished, I just save the feed's items into CoreData via the private queue.
However, since I am downloading and parsing from multiple feeds simultaneously, I always have multiple groups of items, i.e., multiple save requests, for the CoreData.
Now the situation is that I guess CoreData just have a number of threads and each one is saving a group of items into the db. My UI got stuck in the mean time.
Do you think I can control the private queue so that no matter how many simultaneous save requests are, they will be done one by one?

Core Data is (probably) only using one serial queue or thread since its serial. I recently converted my app from using a serial queue I had created (app was 4.3) to use this new option in iOS 5. In all cases when you 'performBlock' the method is handled in a serial fashion. Also, you can now call '[moc performBlocK:...]' from any queue as that call is thread safe!
I believe what you want to do is have your background threads, which are most likely adding options, to use 'performBlock:' (without the wait). The block you provide is then queued and processed in a FIFO fashion. Later on, if your table wants to get objects, it can issue a 'performBlockAndWait:', or optionally your code can ask for the latest objects using performBlock, and at the end of the supplied block message back to your app the set of objects you need.
Also, I only ever save often in development builds, to verify validity. Once you are pretty sure things are working OK, you can then just perform a background save once all the data is downloaded.
EIDT: To reiterate - if you are downloading and also using images or other data while loading a viewController, and lots of things are going on, this is the WORST time to do a save. Use a timer or dispatch_after, and many seconds after everything seems stable THEN do the save.

IPhone Image Caching

I want to save images fetched from a url into the NSTempDirectory(). If I store more than 100 images the application gets slow and sometimes quits. How many images can be stored in NSTempDirectory()? Would deleting the files continuously after it reaches 50 or more images be a good solution? Is there any other alternative to store images without affecting the application performance?

Clawoo is right. Check your memory management and do one more thing.
You can add your code for deleting data from temp directory inside this function: didReceiveMemoryWarning.
This function is called whenever you receive memory warning error.

If the application becomes sluggish the problem lies in your memory management implementation. Make sure you release all your objects, especially the ones that you use to download the images (NSURLConnections, NSData, UIImage, etc).
Writing all those images to disk (whether it's a temp directory or not doesn't really matter) should NOT impact your application's performance in the long term, let alone outright kill it. The application is being shut down because it most probably runs out of memory.

Memory mapped files and "soft" page faults. Unavoidable?

I have two applications (processes) running under Windows XP that share data via a memory mapped file. Despite all my efforts to eliminate per iteration memory allocations, I still get about 10 soft page faults per data transfer. I've tried every flag there is in CreateFileMapping() and CreateFileView() and it still happens. I'm beginning to wonder if it's just the way memory mapped files work.
If anyone there knows the O/S implementation details behind memory mapped files I would appreciate comments on the following theory: If two processes share a memory mapped file and one process writes to it while another reads it, then the O/S marks the pages written to as invalid. When the other process goes to read the memory areas that now belong to invalidated pages, this causes a soft page fault (by design) and the O/S knows to reload the invalidated page. Also, the number of soft page faults is therefore directly proportional to the size of the data write.
My experiments seem to bear out the above theory. When I share data I write one contiguous block of data. In other words, the entire shared memory area is overwritten each time. If I make the block bigger the number of soft page faults goes up correspondingly. So, if my theory is true, there is nothing I can do to eliminate the soft page faults short of not using memory mapped files because that is how they work (using soft page faults to maintain page consistency). What is ironic is that I chose to use a memory mapped file instead of a TCP socket connection because I thought it would be more efficient.
Note, if the soft page faults are harmless please note that. I've heard that at some point if the number is excessive, the system's performance can be marred. If soft page faults intrinsically are not significantly harmful then if anyone has any guidelines as to what number per second is "excessive" I'd like to hear that.
Thanks.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse