How do real-time game engines work on Windows? - real-time

I come from an embedded background, where hard real-time is taken quite seriously. For me (fast) real time is tricky on Linux, and ludicrous on Windows, jet video games seem to be doing just fine.
Obviously, since it's only soft real time, and the frequency is (relatively) low with 60+ Hz, the simple solution is to render as fast as possible and hope for the best. But are there any advanced techniques in use that further enhance the responsiveness on Windows?
Some concrete questions:
Is there a way to handle Windows pre-empting at unfavourable moments? How do you share work on multiple cores when some of them are doing something else?
Is IO-latency in Windows deterministic? Can (should) you pressure the scheduler to context switch ASAP?
How common are scheduling techniques like Earliest Deadline First / Rate Monotonic Scheduling (or others)? (build on top of the existing one of course)
Any insight would be welcome!
Edit:
I've looked through a few resources and have added my conclusions:
1.
Q: What happens when a time critical calculation gets pre-empted?
A: The thread should have higher priority so pre-emption is rare, and when it happens, the interruption should be brief (short interrupts or IO-operations).
Windows has the capability to move threads to other cores, thou it is recommended to keep them locked ( -> Thread Affinity).
2.
Q: Should IO-operations have a higher priority than the render loop?
A: Yes, Naughty Dog's engine gives IO-operation higher priority.
Q: What kind of scheduling is used?
A: In the presentation, a job system is used (similar to reactive programming).
It uses three priorities that are non-preemptive.

I recommend you read Modern Operating System from Ph.D Andrew Tanenbaum if you haven't read it. His book also have Unix and Windows architecture.
One most major thing: Game programming is different than Embedded programming
I had worked with Windows and Android (Unix-like). Not only Windows but also Android very hard rendering too meet hard real-time. One of our tricks to reduce CPU, GPU latency is using lower feature driver support for the visual effect (which mean if hardware not support good this new graphic API, we will use older API so this will make object less shape and less beauty) or if device is not support the feature and effect is not important, just remove this effect for that device. It will push FPS (frames per second) better.
Game engine is also difference between many graphic APIs, use Vulkan or DX12 (Windows) or Metal (Apple devices) is make more FPS but require newer hardware, because they are low-level APIs. It will give you better performance, more beautiful than OpenGL or DX11, but old APIs is support on older devices.
And 3 your questions, I will answer with my knowledge (may be there is a better answer from other):
I also have a small question: Why don't you use thread, atomic, mutex C++ 11 APIs for share tasks with multi-core CPU? It easier for thread programming, cross-platform and no need to schedule anything.
Windows has classic Win32 thread APIs SetThreadPriority for set the relative priority of a thread. This API isn't recommended for Windows Runtime (Modern Windows concept from Microsoft for Windows 8 and later) which recommended using thread-pool concept instead (use modern API will automation parallel tasks with multi-core, which is better than manually config use classic API).
In Game engine, there is a presentation of Christian Gyrling from Naughty Dog talking concept Parallel and Sergey Makeev from Roblox implement open source TaskScheduler base this concept in Win32 APIs and Pthread APIs. Jeff Preshing from Ubisoft also have a presentation using C++ 11.
What did you mean about IO-latency? Keyboard, mouse or Disk IO? I will edit answer if you edit your question more clear
I can't find a document or anything similar Earliest Deadline First / Rate Monotonic Scheduling on Windows. May be this is limited by prevent kernel mode scheduling from Microsoft.

Related

Making an overlay with vulkan [duplicate]

I am a mathematician and not a programmer, I have a notion on the basics of programming and am a quite advanced power-user both in linux and windows.
I know some C and some python but nothing much.
I would like to make an overlay so that when I start a game it can get info about amd and nvidia GPUs like frame time and FPS because I am quite certain the current system benchmarks use to compare two GPUs is flawed because small instances and scenes that bump up the FPS momentarily (but are totally irrelevant in terms of user experience) result in a higher average FPS number and mislead the market either unintentionally or intentionally (for example, I cant remember the name of the game probably COD there was a highly tessellated entity on the map that wasnt even visible to the player which lead AMD GPUs to seemingly under perform when roaming though that area leading to lower average FPS count)
I have an idea on how to calculate GPU performance in theory but I dont know how to harvest the data from the GPU, Could you refer me to api manuals or references to help me making such an overlay possible?
I would like to study as little as possible (by that I mean I would like to learn what I absolutely have to learn in order to get the job done I dont intent to become a coder).
I thank you in advance.
It is generally what the Vulkan Layer system is for, which allows to intercept API commands and inject your own. But it is nontrivial to code it yourself. Here are some pre-existing open-source options for you:
To get to timing info and draw your custom overlay you can use (and modify) a tool like OCAT. It supports Direct3D 11, Direct3D 12, and Vulkan apps.
To just get the timing (and other interesting info) as CSV you can use a command-line tool like PresentMon. Should work in D3D, and I have been using it with Vulkan apps too and it seems to accept them.

What is the best way to start programming with Real Time Linux?

Although I have implemented many projects in C, I am completely new to operating systems. I tried real time linux on Discovery board (STM32) and got the correct results for blinking LED but I didn't really understand the whole process since I just followed the steps and could not find whole description for each step on the internet.
I want to implement scheduling on real time linux. What is the best way to start? Any sites, books, tutorials available?
Complete RTLinux process description will be appreciated.
Thanks in adv.
The transition from "bare metal" to OS based programming is something that I experienced in reverse. I started out a complete software guy, totally into the OS side of things and over time I have moved to the opposite of that (even designing circuits in VHDL!). My advice would be to start simple. Linux is pretty complex, and everywhere you look there are many layers of things all working together to deliver the final product. If you are dead set on a real time linux extension, I'd be happy to suggest https://xenomai.org/ which is a real time extension for linux.
However, to more specifically address your question about implementing scheduling in Linux, you can, but it will be a large amount of work and can be very complicated. The OS uses a completely fair scheduling process ( http://en.wikipedia.org/wiki/Completely_Fair_Scheduler ) and whenever you spin up a thread, it simply gets added to the list to run. This can differ slightly if you implement your code in kernel space as a driver, rely on hardware interrupts, etc., but in general, this is how Linux works. Real time generally means that it has the ability to assign threads one of several different priorities and utilize thread preemption fully at any given time which are concepts that aren't really a part of vanilla Linux. It has some notion of this, but it has limitations that can cause problems when you are looking for real time behavior from Linux.
What may be helpful to you is an RTOS. If you are looking for a full on Real Time Operating System, check out FreeRTOS http://www.freertos.org/ . It has a large community and supports a lot of different devices out of the box with a large amount of example code. They even support your specific board with an example package, so you can give it a shot with nothing to lose! http://www.freertos.org/FreeRTOS-for-Cortex-M3-STM32-STM32F100-Discovery.html . It gives you access to many OS ish constructs like network APIs, memory management, and threading without the overhead and latency of a huge OS. With an RTOS, you create tasks and assign them priorities so you become the scheduler and are no longer at the mercy of the OS. You run the OS, not the OS runs you (if that makes sense). Plus, the constructs offered within an RTOS will feel much like bare metal code and thus will be much easier to follow, understand, and fully learn. It is a more simple world to learn the base building blocks of a full blown OS such as Linux or Windows. If this option sounds good, I would suggest looking through the supported devices on FreeRTOS website and picking one you would like to experiment with and then go for it. I would highly recommend this as a way to learn about scheduling and OS constructs in general as it is as simple as you can get and open source. Once you have the basics of an RTOS down, buying a book about Linux specifically wouldn't be a bad idea. Although there are many free resources on the web related to learning about Linux, they are commonly contradictory, and can be misleading. Pile on learning Linux specific knowledge along with OS in general, and it can feel overwhelming. Starting simpler will help keep you from getting burnt out and minimize the amount of time you spend feeling lost. Linux is definitely a learning process, but like with any learning process, start simple, keep your ultimate goal in mind, make a plan, and take small, manageable steps along that plan until you look up and find yourself exactly where you want to be. Then go tackle the next mountain!
The real-time Linux landscape is quite confusing. 99.99% of the information out there is just plain obsolete.
First, there are lots of "microkernels" that run Linux as one task. (Such as the defunct RTLinux). The problem is that you must write your real-time task to a different API, and can't depend on anything in Linux, because Linux will be frozen in the background while your task runs. So unless your task is dead-simple ("stop the motors when I press this button"), this approach will cause more pain than gain.
Next, there is the realtime Linux patch set. This hasn't been doing so well. because of the next item:
Lastly, the current Linux kernel has gotten rid of the problems that caused people to need realtime in the past. You can even turn off Linux on one of your processors to have full control of the CPU. See also this paper.
To answer your question: I see two different paths you could take:
1) Start with a normal 3.xx Linux kernel and explore the various APIs and realtime techniques (i.e. realtime priorities, memory pinning, etc.) This can get you "close enough" for 99% of what people want "realtime" for. If it's good enough for high frequency trading, it's probably good enough for you.
2) If you have a hard realtime requirement and you are worried that Linux won't cut it, then (as Nick mentioned above), just go buy a processor and write your realtime code with no OS. By splitting up your "realtime" and "non-realtime" code onto different CPUs, you will make the whole system simpler and much more robust.
If you want to learn real-time operating systems then I suggest that you get an FPGA, for example the Altera DE2, and experiment with your own operating system and ucos. You can read a good text about embedded RTOS here.
You could also get a Linux Raspberry and write your own operating system for that.

The state of programming and compiling for multicore systems

I'm doing some research on multicore processors; specifically I'm looking at writing code for multicore processors and also compiling code for multicore processors.
I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.
I am aware of the following efforts (some of these don't seem directly related to multicore architectures, but seem to have more to do with parallel-programming models, multi-threading, and concurrency):
Erlang (I know that Erlang includes constructs to facilitate concurrency, but I am not sure how exactly it is being leveraged for multicore architectures)
OpenMP (seems mostly related to multiprocessing and leveraging the power of clusters)
Unified Parallel C
Cilk
Intel Threading Blocks (this seems to be directly related to multicore systems; makes sense as it comes from Intel. In addition to defining certain programming-constructs, it also seems have features that tell the compiler to optimize the code for multicore architectures)
In general, from what little experience I have with multithreaded programming, I know that programming with concurrency and parallelism in mind is definitely a difficult concept. I am also aware that multithreaded programming and multicore programming are two different things. in multithreaded programming you are ensuring that the CPU does not remain idle (on a single-CPU system. As James pointed out the OS can schedule different threads to run on different cores -- but I'm more interested in describing the parallel operations from the language itself, or via the compiler). As far as I know you cannot truly do parallel operations. In multicore systems, you should be able to perform truly-parallel operations.
So it seems to me that currently the problems facing multicore programming are:
Multicore programming is a difficult concept that requires significant skill
There are no native constructs in today's programming languages that provide a good abstraction to program for a multicore environment
Other than Intel's TBB library I haven't found efforts in other programming-languages to leverage the power of multicore architectures for compilation (for example, I don't know if the Java or C# compiler optimizes the bytecode for multicore systems or even if the JIT compiler does that)
I'm interested in knowing what other problems there might be, and if there are any solutions in the works to address these problems. Links to research papers (and things of that nature) would be helpful. Thanks!
EDIT
If I had to condense my question down to one sentence, it would be this: What are the problems that face multicore programming today and what research is going on in the field to solve these problems?
UPDATE
It also seems to me that there are three levels where multicore needs to be concerned:
Language level: Constructs/concepts/frameworks that abstract parallelization and concurrency and make it easy for programmers to express the same
Compiler level: If the compiler is aware of what architecture it is compiling for, it can optimize the compiled code for that architecture.
OS level: The OS optimizes the running process and perhaps schedules different threads/processes to run on different cores.
I've searched on ACM and IEEE and have found a few papers. Most of them talk about how difficult it is to think concurrently and also how current languages don't have a proper way to express concurrency. Some have gone so far as to claim that the current model of concurrency that we have (threads) is not a good way to handle concurrency (even on multiple cores). I'm interested in hearing other views.
I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.
Inertia. (BTW: that's pretty much the answer to all "what does prevent the widespread adoption" questions, whether that be models of parallel programming, garbage collection, type safety or fuel-efficient automobiles.)
We have known since the 1960s that the threads+locks model is fundamentally broken. By ~1980, we had about a dozen better models. And yet, the vast majority of languages that are in use today (including languages that were newly created from scratch long after 1980), offer only threads+locks.
The major problems with multicore programming is the same as writing any other concurrent applications, but whereas before it was uncommon to have multiple cpus in a computer, now it is hard to find any modern computer with only one core in it, so, to take advantage of multicore, multiple cpu architectures there are new challenges.
But, this problem is an old problem, whenever computer architectures go beyond compilers then it seems the fallback solution is to move back toward functional programming, as that programming paradigm, if strictly followed, can make very parallelizable programs, as you don't have any global mutable variables, for example.
But, not all problems can be done easily using FP, so the goal then is how to easily get other programming paradigms to be easy to use on multicores.
The first thing is that many programmers have avoided writing good mulithreaded applications, so there isn't a strongly prepared number of developers, as they learned habits that will make their coding harder to do.
But, as with most changes to the cpu, you can look at how to change the compiler, and for that you can look at Scala, Haskell, Erlang and F#.
For libraries you can look at the parallel framework extension, by MS as a way to make it easier to do concurrent programming.
It is at work, but I recently either IEEE Spectrum or IEEE Computer had articles on multicore programming issues, so look at what IEEE and ACM articles have been written on these issues, to get more ideas as to what is being looked at.
I think the biggest impediment will be the difficulty to get programmers to change their language as FP is very different than OOP.
One place for research besides developing languages that will work well this way, is how to handle multiple threads accessing memory, but, as with much in this area, Haskell seems to be at the forefront in testing ideas for this, so you can look at what is going on with Haskell.
Ultimately there will be new languages, and it may be that we have DSLs to help abstract the developer more, but how to educate programmers on this will be a challenge.
UPDATE:
You may find Chapter 24. Concurrent and multicore programming of interest, http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.html
One of the answers mentioned the Parallel Extensions for the .NET Framework and since you mentioned C#, it's definitely something I would investigate. Microsoft has done something interesting things there, though I have to think many of their efforts seem more suited for language enhancements in C# than a separate and distinct library for concurrent programming. But I think their efforts are worth applauding and respect that we're early here. (Disclaimer: I used to be the marketing director for Visual Studio about 3 years ago)
The Intel Thread Building Blocks are also quite interesting (Intel recently released a new version, and I'm excited to head down to Intel Developer Forum next week to learn more about how to use it properly).
Lastly, I work for Corensic, a software quality startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. A 30-day trial edition is available for Windows and Linux, so you might want to check it out. (www.corensic.com)
In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.
Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.
If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.
The bottleneck of any high-performance application (written in C or C++) designed to make efficient use of more than one processor/core is the memory system (caches and RAM). A single core usually saturates the memory system with its reads and writes so it is easy to see why adding extra cores and threads causes an application to run slower. If a queue of people can pass through a door one a time, adding extra queues will not only clog the door but also make the passage of any one individual through the door less efficient.
The key to any multi-core application is optimization of and economizing on memory accesses. This means structuring data and code to work as much as possible inside their own caches where they don't disturb the other cores with acceses to the common cache (L3) or RAM. Once in a while a core needs to venture there but the trick is to reduce those situations as much as possible. In particular, data needs to be structured around and adapted to cache lines and their sizes (currently 64 bytes) and code needs to be compact and not call and jump all over the place which also disrupts pipelines.
My experience is that efficient solutions are unique to the application in question. The generic guidelines (above) are a basis on which to construct code but the tweak changes resulting from profiling conclusions will not be obvious to those who were not themselves involved in the optimizing work.
Look up fork/join frameworks and work-stealing runtimes. Two names for the same, or at least related, approaches, which is to recursively subdivide large tasks into lightweight units, such that all available parallelism is exploited, without having to know in advance how much parallelism there is. The idea is that it should run at serial speed on a uniprocessor, but get a linear speedup with multiple cores.
Sort of a horizontal analogue of cache-oblivious algorithms if you look at it right.
But i'd say the main problem facing multicore programming is that the great majority of computations remain stubbornly serial. There's just no way to throw multiple cores at those computations and make them stick.

Real time system concept proof project

I'm taking an introductory course (3 months) about real time systems design, but any implementation.
I would like to build something that let me understand better what I'll learn in theory, but since I have never done any real time system I can't estimate how long will take any project. It would be a concept proof project, or something like that, given my available time and knowledge.
Please, could you give me some idea? Thank you in advance.
I programm in TSQL, Delphi and C#, but I'll not have any problem in learning another language.
Suggest you consider exploring the Real-Time Specification for Java (RTSJ). While it is not a traditional environment for constructing real-time software, it is an up-and-coming technology with a lot of interest. Even better, you can witness some of the ongoing debate about what matters and what doesn't in real-time systems.
Sun's JavaRTS is freely available for download, and has some interesting demonstrations available to show deterministic behavior, and show off their RT garbage collector.
In terms of a specific project, I suggest you start simple: 1) Build a work-generator that you can tune to consume a given amount of CPU time; 2) Put this into a framework that can produce a distribution of work-generator tasks (as threads, or as chunks of work executed in a thread) and a mechanism for logging the work produced; 3) Produce charts of the execution time, sojourn time, deadline, slack/overrun of these tasks versus their priority; 4) demonstrate that tasks running in the context of real-time threads (vice timesharing) behave differently.
Bonus points if you can measure the overhead in the scheduler by determining at what supplied load (total CPU time produced by your work generator tasks divided by wall-clock time) your tasks begin missing deadlines.
Try to think of real-time tasks that are time-critical, for instance video-playing, which fails if tasks are not finished (e.g. calculating the next frame) in time.
You can also think of some industrial solutions, but they are probably more difficult to study in your local environment.
You should definitely consider building your system using a hardware development board equipped with a small processor (ARM, PIC, AVR, any one will do). This really helped remove my fear of the low-level when I started developing. You'll have to use C or C++ though.
You will then have two alternatives : either go bare-metal, or use a real-time OS.
Going bare-metal, you can learn :
How to initalize your processor from scratch and most importantly how to use interrupts, which are the fastest way you have to respond to an externel event
How to implement lightweight threads with fast context switching, something every real-time OS implements
In order to ease this a bit, look for a dev kit which comes with lots of documentation and source code. I used Embedded Artists ARM boards and they give you a lot of material.
Going with the RT OS :
You'll fast-track your project, and will be able to learn how to fine-tune a RT OS
You may try your hand at an open-source OS, such as Linux or the BSDs, and learn a lot from the source code
Either choice is good, you will get a really cool hands-on project to show off and hopefully better understand your course material. Good luck!
As most realtime systems are still implemented in C or C++ it may be good to brush up your knowledge of these programming languages. Many realtime systems are also embedded systems, so you might want to play around with a cheap open source one like BeagleBoard (http://beagleboard.org/). This will also give you a chance to learn about cross compiling etc.

Comparison of embedded operating systems?

I've been involved in embedded operating systems of one flavor or another, and have generally had to work with whatever the legacy system had. Now I have the chance to start from scratch on a new embedded project.
The primary constraints on the system are:
It needs a web-based interface.
Inputs are required to be processed in real-time (so a true RTOS is needed).
The memory available is 32MB of RAM and FLASH.
The operating systems that the team has used previously are VxWorks, ThreadX, uCos, pSOS, and Windows CE.
Does anyone have a comparison or trade study regarding operating system choice?
Are there any other operating systems that we should consider? (We've had eCos and RT-Linux suggested).
Edit - Thanks for all the responses to date. A pity I can't flag all as "accepted".
I think it would be wise to evaluate carefully what you mean by "RTOS". I have worked for years at a large company that builds high-performance embedded systems, and they refer to them as "real-time", although that's not what they really are. They are low-latency and have deterministic schedulers, and 9 times out of 10, that's what people are really after when they say RTOS.
True real-time requires hardware support and is likely not what you really mean. If all you want is low latency and deterministic scheduling (again, I think this is what people mean 90% of the time when they say "real-time"), then any Linux distribution would work just fine for you. You could probably even get by with Windows (I'm not sure how you control the Windows scheduler though...).
Again, just be careful what you mean by "Real-time".
It all depends on how much time was allocated for your team has to learn a "new" RTOS.
Are there any reasons you don't want to use something that people already have experience with?
I have plenty of experience with vxWorks and I like it, but disregard my opinion as I work for WindRiver.
uC/OS II has the advantage of being fully documented (as in the source code is actually explained) in Labrosse's Book. Don't know about Web Support though.
I know pSos is no longer available.
You can also take a look at this list of RTOSes
I worked with QNX many years ago, and have nothing but great things to say about it. Even back then, QNX 4 (which is positively chunky compared to the Neutrino microkernel) was perfectly suited for low memory situations (though 32MB is oodles compared to the 1-2MB that we had to play with), and while I didn't explicitly play with any web-based stuff, I know Apache was available.
I purchased some development hardware from netburner
It has been very easy to work with and very well documented. It is an RTOS running uCLinux. The company is great to work with.
It might be a wise decision to select an OS that your team is experienced with. However I would like to promote two good open source options:
eCos (has you mentioned)
RTEMS
Both have a lot of features and drivers for a wide variety of architectures. You haven't mentioned what architecture you will be using. They provide POSIX layers which is nice if you want to stay as portable as possible.
Also the license for both eCos and RTEMS is GPL but with an exception so that the executable that is produced by linking against the kernel is not covered by GPL.
The communities are very active and there are companies which provide commercial support and development.
We've been very happy with the Keil RTX system....light and fast and meets all of our tight real time constraints. It also has some nice debugging features built in to monitor stack overflow, etc.
I have been pretty happy with Windows CE, although it is 'heavier'.
Posting to agree with Ben Collins -- your really need to determine if you have a soft real-time requirement (primarily for human interaction) or hard real-time requirement (for interfacing with timing-sensitive devices).
Soft can also mean that you can tolerate some hiccups every once in a while.
What is the reliability requirements? My experience with more general-purpose operating systems like Linux in embedded is that they tend to experience random hiccups due to their smart average-case optimizations that try to avoid starvation and similar for individual tasks.
VxWorks is good:
good documentation;
friendly developing tool;
low latency;
deterministic scheduling.
However, I doubt that WindRiver would convert their major attention to Linux and WindRiver Linux would break into the market of WindRiver VxWorks.
Less market, less requirement of engineers.
Here is the latest study. The last one was done more than 8 years ago so this is most relevant. The tables can be used to add additional RTOS choices. You'll note that this comparison is focused on lighter machines but is equally applicable to heavier machines provided virtual memory is not required.
http://www.embedded.com/design/operating-systems/4425751/Comparing-microcontroller-real-time-operating-systems