What sort of things can cause a whole system to appear to hang for 100s-1000s of milliseconds? - operating-system

I am working on a Windows game and while rendering, some computers will experience intermittent pauses ("hitches" for lack of a better term). When profiled they appear in seemingly random places in the code. Eventually I noticed that it wasn't just my process that was affected, but (seemingly) every process on the system. All of the threads in my application hitch at once. The CPU utilization drops during these hitches and it appears as if most processes make no progress.
This leads me to believe this may be an Operating System or Driver issue, but it only occurs while playing the game (and only on some systems). What sort of operations might the operating system be doing that would require the kernel to pause all user threads and block. Some kind of I/O? At first I thought of paging but my impression is that would only affect a single process, no?
Some systems in use: Windows, DirectX (3d), nVidia cards (unknown if replicates on ATI), using overlapped io for streaming

If you have a lot of graphics in use, it may be paging graphics memory into the swap file.
Or perhaps the stream is getting buffered on disk?
It's worth seeing if the hitches coincide with the PC's disk activity LED.

Heavy use of memory mapped IO. This of course includes the system pagefile, but can also include user applications that use mmio heavily (gcc for one)

Related

Stored Program Computer in modern computing

I was given this exact question on a quiz.
Question
Answer
Does the question make any sense? My understanding is that the OS schedules a process and manages what instructions it needs the processor to execute next. This is because the OS is liable to pull all sorts of memory management tricks, especially in main memory where fragmentation is a way of life. I remember that there is supposed to be a special register on the processor called the program counter. In light of the scheduler and memory management done by the OS I have trouble figuring out the purpose of this register unless it is just for the OS. Is the concept of the Stored Program Computer really relevant to how a modern computer operates?
Hardware fetches machine code from main memory, at the address in the program counter (which increments on its own as instructions execute, or is modified by executing a jump or call instruction).
Software has to load the code into RAM (main memory) and start the process with its program counter pointing into that memory.
And yes, if the OS wants to page that memory out to disk (or lazily load it in the first place), hardware will trigger a page fault when the CPU tries to fetch code from an unmapped page.
But no, the OS does not feed instructions to the CPU one at a time.
(Unless you're debugging a program by putting the CPU into "single step" mode when returning to user-space for that process, so it traps after executing one instruction. Like x86's trap flag, for example. Some ISAs only have software breakpoints, not HW support for single stepping.)
But anyway, the OS itself is made up of machine code that runs on the CPU. CPU hardware knows how to fetch and execute instructions from memory. An OS is just a fancy program that can load and manage other programs. (Remember, in a von Neumann architecture, code is data.)
Even the OS has to depend on the processing architecture. Memory today often is virtualized. That means the memory location seen by the program is not the real physical location, but is indirected by one or more tables describing the actual location and some attributes (e.g. read/write/execute allowed or not) for memory accesses. If the accessed virtual memory has not been loaded into main memory (these tables say so), an exception is generated, and the address of an exception handler is loaded into the program counter. This exception handler is by the OS and resides in main memory. So the program counter is quite relevant with today's computers, but the next instruction can be changed by exceptions (exceptions are also called for thread or process switching in preemptive multitasking systems) on the fly.
Does the question make any sense?
Yes. It makes sense to me. It is a bit imprecise, but the meanings of each of the alternatives are sufficiently distinct to be able to say that D) is the best answer.
(In theory, you could create a von Neumann computer which was able to execute instructions out of secondary storage, registers or even the internet ... but it would be highly impractical for various reasons.)
My understanding is that the OS schedules a process and manages what instructions it needs the processor to execute next. This is because the OS is liable to pull all sorts of memory management tricks, especially in main memory where fragmentation is a way of life.
Fragmentation of main memory is not actually relevant. A modern machine uses special hardware (and page tables) to deal with that. From the perspective of executing code (application or kernel) this is all hidden. The code uses virtual addresses, and the hardware maps them to physical addresses. (This is even true when dealing with page faults, though special care will be taken to ensure that the code and page table entries for the page fault handler are in RAM pages that are never swapped out.)
I remember that there is supposed to be a special register on the processor called the program counter. In light of the scheduler and memory management done by the OS I have trouble figuring out the purpose of this register unless it is just for the OS.
The PC is fundamental. It contains the virtual memory address of the next instruction that the CPU is to execute. For application code AND for OS kernel code. When you switch between the application and kernel code, the value in the PC is updated as part of the context switch.
Is the concept of the Stored Program Computer really relevant to how a modern computer operates?
Yes. Unless you are working on a special custom machine where (say) the program has been transformed into custom silicon.

When deadlocks occur in modern operating systems?

I know deadlocks was a hot research topic in past. But, even though I studied lots of modern operating systems, I cannot see any major problem about deadlocks now. I know some (most) resources which deadlocks can occur strictly managed by operating system itself and seems it prevent deadlocks someway, I really didn't see any case related to a deadlock. I know lots of features about resources handled different than others in popular systems with different design principles but, they can all maintain system deadlock-free.
Try to use two mutexes in your program and in first thread close in sequence: mutex1, sleep(500ms), mutex2, in second thread: mutex2, sleep(1000ms), mutex1.
In systems. In windows (including 8.1) if your application uses SendMessage and broadcast HWND_BROADCAST - if one application is hung, your application also will be in hung state. Also in part cases of DDE communication (including ShellExecute for part of programs), if one application is not responsive, your application can be in hung state.
But you can use SendMessageTimeout...
The deadlock will always be possible if processes or threads will be synchronized. Synchronization of processes and threads is a "must-have" element of applications.
AND... SYSTEM-WIDE deadlock (Windows):
Save all your documents before this action.
Create HWND h1 with parent=0 or parent=GetDesktopWindow and styles 0x96cf0000
Create HWND h2 with parent=h1 and styles 0x96cf0000
Create HWND h3 with parent=h2 and styles 0x56cf0000 (here must be a child window).
Use ::SetParent(h1, h3);
Then click any of these windows.
The system will in cyclic (triangle) order try to reorder windows. The application is hung but if any other application will try to use SetWindowPos, the application will newer return from this function. The Task Manager won't help, the Alt+Ctrl+Del also stops to work. 100% of usage of CPU... Only hard reset will help you.
There is possibility to prevent it but this situation must be detected ASAP.
Operating system deadlocks still happen. When a system has limited contended resources that it can't reclaim a deadlock is still possible.
In linux, look at kernel stalls, these happen when I/O doesn't release in a timely manner. Kernel stalls are particularly interesting between vmware and guest operating systems.
For external instigators, deadlocks happen when san systems and networks have issues.
New release deadlocks happen fairly often while maturing a kernel, not per user, but as a whole from the community.
Ever get a blue screen or instant reboot? Some of those are caused by lost resources.
Kernels are fairly mature, and have gotten good at reclaiming resources, but aren't perfect.
Most modern resource handlers tend to present as services now instead of being lockable objects. Most resource sharing within the operating system relies on separate channels, alleviating much of the overlap. There's a higher reliance on queues and toggles instead of direct locking contention on shared buffers. These are generalities of trends in OS parts and pieces that contribute to less opportunity for deadlocks, but there's not a way to guarantee a deadlock less system.

Unusual spikes in CPU utilization in CentOS 6.6 while starting pycharm

my system since last couple of days is behaving strangely. I am a regular user of pycharm software, and it used to work on my system very smoothly with no hiccups at all. But since last couple of days, whenever I start pycharm, my CPU utilization behaves strangly, like in the image: Unusual CPU util
I am confused as when I go to processes or try ps/top in terminal, there are no process which is utilizing cpu more then 1 or 2%. So I am not sure where these resources are getting consumed.
By unusual CPU util I mean, That first CPU1 is getting used 100% for couple or so minutes, then CPU2. Which is, only one cpu's utilization goes to 100% for sometime followed by other's. This goes on for 10 to 20 minutes. then system comes back to normal.
P.S.: I don't think this problem is related to pycharm, as I face similar issues while doing other work also, just that I always face this with pycharm for sure.
POSSIBLE CAUSE: I suspect you have a thrashing problem. The CPU usage of your applications are low because none of them are actually getting much useful work done. All the processing is being taken up by moving memory pages to and from the disk. Your CPU usage probably settles down after a time because your application has entered a state where its memory working set has shrunk to a point where it all can be held in memory at one time.
This has probably happened because one of the apps on your machine is handling a larger data set than before, and so requires more addressable memory. Another possibility is that, for some reason, a lot more apps are running on your machine.
POTENTIAL SOLUTION: There are several ways you can address this. The simplest is to put more RAM on your machine. If this doesn't work or isn't possible, you'll have to figure out which app is the memory hog. You may simply have to work with smaller problems/data-sets or offload some of the apps onto a different box.
MIGRATING CPU LOAD: Operating systems will move tasks (user apps, kernel) around for many different reasons. The reasons can range anywhere from it being just plain random to certain apps having more of their addressable memory in one bank vs another. Given that you are probably doing a lot of thrashing, I'm not surprised that the processor your app is running is randomized over time.

Music player process

I was reading a book which says that a processor with single core and no hyper-threading can process only one process at a time, so a doubt arises that when we do so many operations on a PC and also some background processes are always there then why not music player stops in between for short while. I know the CPU is pretty fast but still music player usually plays music in continuance without any small break ( that is observable ). Can anyone clarify this behavior?
1) A single-core CPU without hyperthreading can, as you say, only run one process at a time. Multiple processes are handled by context-switching, that is the CPU will run one process and then switch to the next process and the next and then back to the first process and so on. The frequency of how often a certain process is scheduled is dependent on lots of different factors, where process priority is one. (Back in the days it was often needed to run WinAmp with elevated priority to avoid glitches etc. Nowadays this is not needed as the CPU is a lot faster).
2) So, with this in mind, how come it still sounds great and without glitches?
When processing audio the CPU feeds the sound device with samples by putting them either in a hardware buffer on the sound card or in the RAM. The sound processor does not get its data directly from the CPU, instead it reads the samples from one of these two buffers. As long as we have samples in the buffer we are good, even though the CPU is off doing something else.
The details about the hardware buffer size is different on different sound cards. Some (older) sound cards does not have a sound buffer at all, and here the RAM comes into play instead.
Running out of samples is called buffer underrun. Even on modern computers this can happen, for example if you start a heavy process while running your audio player the CPU may not be able to switch back in time and we can clearly hear glitches and gaps in the sound feed.
This is due to an operating system which does preemptive multi-tasking. The process is in fact being interrupted for a very short amount of time, not long enough to notice for a human. Another reason is also that the audio card has a playback buffer which allows the playback continously, while data is being fed to it in chunks. So while the process of feeding the card with data is being interrupted for a very short time, the playback can still occur.
This is handled by the Operating System Scheduler.
The scheduler will allocate a time slice to each process (this maybe a few milliseconds) and will allow a process to execute what it needs to for that length of time. The length allocated is determined by the algorithm used by the OS (I.e. Short term scheduling, long term etc). The reason why you do not notice this is because the CPU can operate at such high frquencies, i.e. 1GHz which makes multi tasking on a single core / thread transparent to the user.
http://en.wikipedia.org/wiki/Scheduling_(computing)
http://web.cs.wpi.edu/~cs3013/c07/lectures/Section05-Scheduling.pdf

What prevents an OS to recover from a 'blue screen of death'?

If a program violates its instruction path and/or memory data the OS halts it with some message due to the program running in the 'virtual machine' like space of the OS and its unable to determine its next instruction.
The OS in tern is also a program, sharing the machine resources as any other program and can halt in a similar fashion but it's sometimes healthy enough to display some debugging info and blue screen. So as a programmer I'm thinking, if I can do that - emit debugging info and make the screen blue why wouldn't I be able to try to recover the OS altogether instead of requiring a cold reboot ? After all its the OS - it's supposed to be the rock solid foundation (not talking about Windows of course) of all software, if the space shuttle ran Windows then what would happen - it won't recover ?:)
So: is it only that MS hasn't taken care of trying everything to recover to the point that a reboot is not required or is it some other more deeper problem that has stop companies like MS to be unable to do that ?
It's nothing specific to Microsoft; Linux has a kernel panic mechanism, OS X has a kernel panic mechanism. I expect every non-toy operating system kernel has a panic mechanism of some sort when internal corruption is detected. The corruption could come from faulty hardware, faulty software, gamma rays hitting the memory boards just right, who knows.
The whole point behind the kernel panic is a recognition that something that shouldn't go wrong has gone wrong. What else might be invalid? Depending upon where the crash happened, it might not be safe to sync and unmount the filesystems because that might scribble corrupt data over good data on the drives.
Writing to the video card is a good way to inform the user of events (many systems have monitors attached, anyway) and writing to the video card isn't likely to corrupt on-disk data: it would take quite an error for the IOMMU or page tables to be so corrupted that they refer instead to on-disk files and most operating systems will refuse to write to block devices after a kernel panic to try to protect user data at all cost.
Consider what you could do to bring the system back up to a running state? You'd need to tear down all applications that might be associated with corrupted kernel data structures. You'd need to restart applications, in the right order, to bring system services back up. And a reboot is a very easy way to reliably do both those things.
You can't recover the OS for the same reasons a user-space program can't recover -- when certain types of errors are seen it means that your program is in an undefined state and therefore can't recover. Even if the problem in some sense isn't fatal (i.e. doesn't cause the program to immediately die), it's not safe to continue because things are or are likely corrupted.
For example, be it a user-space program or the OS kernel, say a buffer overrun or an messed up pointer causes the stack to be corrupted. How is the program supposed to recover from that? With a blown stack when the function that is currently executing ends, where will it return to? The return address is likely gone. Now what?
And it's not just Microsoft. Ever hear of a "kernel panic" in Unix?