How does an OS with a 1 kHz tick rate make nanosecond measurements? - operating-system

I've worked on low level devices such as microcontrollers, and I understand how timers and interrupts work; but, I was wondering how an operating system with a 1 kHz tick rate is able to give time measurements of events down to the nanosecond precision. Are they using timers with input capture or something like that? Is there a limit to how many measurements you can take at once?

On modern x86 platforms, the TSC can be used for high resolution time measurement, and even to generate high resolution interrupts (that's a different interface of course, but it counts in TSC ticks). There is no inherent limit to the number of parallel measurements. Before the "Invariant TSC" feature was introduced it was much harder to use the TSC that way, because the TSC frequency changed with the C-state (and so usually went out of sync across cores and in any case was hard to use as a stopwatch).

Related

STM32 ADC: leave it running at 'high' speed or switch it off as much as possible?

I am using a G0 with one ADC and 8 channels. Works fine. I use 4 channels. One is temperature that is measured constantly and I am interested in the value every 60s. Another one is almost the opposite: it is measuring sound waves for a couple a minutes per day and I need those samples at 10kHz.
I solved this by letting all 4 channels sample at 10kHz and have the four readings moved to memory by DMA (array of length 4 with 1 measurement each). Every 60s I take the temperature and when I need the audio, I retrieve the audio values.
If I had two ADC's, I would start the temperature ADC reading for 1 conversion every 60s. Non-stop. And I would only start the audio ADC for the the couple of minutes a day that it is needed. But with the one ADC solution, it seems simple to let all conversions run at this high speed continuously and that raised my question: Is there any true downside in having 40.000 conversions per second, 24 hours per day? If not, the code is simple. I just have the most recent values in memory all the time. But maybe I ruin the chip? I use too much energy I know. But there is plenty of it in this case.
You aren't going to "wear it out" by running it when you don't need to.
The main problems are wasting power and RAM.
If you have enough of these, then the lesser problems are:
The wasted power will become heat, this may upset your temperature measurements (this is a very small amount though).
Having the DMA running will increase your interrupt latency and maybe also slow down the processor slightly, if it encounters bus contention (this only matters if you are close to capacity in these regards).
Having it running all the time may also have the advantage of more stable readings, not being perturbed turning things on and off.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.

Can a sub-microsecond clock resolution be achieved with current hardware?

I have a thread that needs to process a list of items every X nanoseconds, where X < 1 microsecond. I understand that with standard x86 hardware the clock resolution is at best 15 - 16 milliseconds. Is there hardware available that would enable a clock resolution < 1 microsecond? At present, the thread runs continuously as the resolution of nanosleep() is insufficient. The thread obtains the current time from a GPS reference.
You can get the current time with extremely high precision on x86 using the rdtsc instruction. It counts clock cycles (on a fixed reference clock, not the actually dynamic frequency CPU clock), so you can use it as a time source once you find the coefficients that map it to real GPS-time.
This is the clock-source Linux uses internally, on new enough hardware. (Older CPUs had the rdtsc clock pause when the CPU was halted on idle, and/or change frequency with CPU frequency scaling). It was originally intended for measuring CPU-time, but it turns out that a very precise clock with very low-cost reads (~30 clock cycles) was valuable, hence decoupling it from CPU core clock changes.
It sounds like an accurate clock isn't your only problem, though: If you need to process a list every ~1 us, without ever missing a wakeup, you need a realtime OS, or at least realtime functionality on top of a regular OS (like Linux).
Knowing what time it is when you do eventually wake up doesn't help if you slept 10 ms too long because you read a page of memory that the OS decided to evict, and had to get from disk.

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

Idle state in RTOS, sleep state or lowest frequency?

In real time systems using an RTOS, what how would the RTOS handle an idle period? Would it run nop instructions at the lowest frequency supported by a Dynamic Voltage Scaling capable processor? or would it turn to a sleep state? Can anyone refer me to actual practical implementations. Thanks
It will depend entirely on the target hardware and possibly the needs and design of the application. For example on ARM Cortex-M you would typically invoke the WFI instruction which shuts down the core until the occurrence of an interrupt.
In many microcontroller/SoC cases, reducing the PLL clock frequency would affect the on-chip peripherals from which hardware interrupts might occur, so that is less likely. It would affect baud rates and timer resolution, and is perhaps hard to manage easily. There is a paper here on a tickless idle power management method on FreeRTOS/Cortex-M3.
In most cases the idle loop source is provided as part of the board-support, so you can customise it to your needs.