Priority values used in Windows and Ubuntu - operating-system

I was wondering what is the range of process priorities available in Windows and Ubuntu ? And if whether low values corresponds to high or low priority in each case ?

Related

perf power consumption measure: How does it work?

I noticed that perf list now has the option to measure power consumption. You can use it as follows:
$ perf stat -e power/energy-cores/ ./a.out
Performance counter stats for 'system wide':
8.55 Joules power/energy-cores/
0.949871058 seconds time elapsed
How accurate is this measurement, and how does perf estimate the power consumption?
The power/energy-cores/ perf counter is based on an MSR register called MSR_PP0_ENERGY_STATUS, which is part of the Intel RAPL interface (Intel seems to call each individual RAPL MSR a RAPL interface). A complicated model based on system activity events is used to estimate (static and dynamic) energy consumption. The MSR register name has PP0 in it, which refers to power plane 0, which is one of the RAPL domains that contains all the cores of the socket including the private caches of the cores. PP0, however, excludes the last-level cache, the interconnect, the memory controller, the graphics processor, and everything else that is in the uncore. It's impossible to measure the accuracy of MSR_PP0_ENERGY_STATUS because there is no other way to estimate the energy consumption of power plane 0 only.
It's possible to measure the accuracy of other RAPL domains though. These include the Package, DRAM, and PSys domains. For example, the accuracy of the Package domain energy estimation can be measured by comparing against the energy consumption of the whole system (which can be measured using a power meter) and running a workload that keeps the energy consumption of everything outside the package a known constant as much as possible. The accuracy of MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS have been measured in different ways by different people on many different processors. You can refer to the recent paper entitled RAPL in Action: Experiences in Using RAPL for Power Measurements for more information, which also includes summaries of previous works. The paper covers Sandy Bridge, Ivy Bridge, Haswell, and Skylake. The conclusion is that MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS appear to be accurate on Haswell and Skylake (the implementation has changed on Haswell, see : An Energy Efficiency Feature Survey of the Intel Haswell Processor). But this is not necessarily true on all kinds of workloads, P states, and processors. So the accuracy does not just depend on the microarchitecture.
The RAPL interface is discussed in Section 14.9 of the Intel Manual Volume 3. I noticed there are errors in the section. For example, it says client processors don't support the DRAM domain, which is not true. The client Haswell processor I'm using to write this answer supports the DRAM domain. The section is probably outdated and applies only Sandy Bridge and Ivy Bridge processors. I think it's better to read the datasheet of the processor on which you want to use RAPL.
The power/energy-pkg/ perf counter can be used to measure energy consumption of the package domain. This is the only domain that is known be supported on all Intel processors starting from Sandy Bridge.
On x86 systems, these values are based on RAPL (Running Average Power Limit) - an interface that provides built in CPU energy counters. While originally designed by Intel, AMD also provides a compatible interface on Zen systems.
The accuracy depends on the actual microarchitecture. Originally, RAPL was backed by a model with certain biases. On Intel CPUs since the Haswell architecture, it is based on measurements which are quite accurate. As far as I know there is no good understanding of the accuracy on AMD's Zen RAPL implementation.
One important thing you have to consider is the scope of the measurements. On most systems, only package and DRAM is covered1. So if you need to know how much power / energy your entire system consumes - you usually cannot easily answer that with RAPL.
Also note that RAPL is updated every 1 ms, so short workloads will have significant errors from the update rate.
1 - Skylake Desktop systems can implement a full-system RAPL. It's accuracy depends on the manufacturer.

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

Does greater decimal value mean higher priority or low priority for a process?

For example P1 has priority 2 and P2 has priority 5. Which process has higher priority? Which process will be executed first?
This very much depends on the underlying operating system.
See
here for Windows
or there for Linux
or this for Mac OS
...
and so on. In other words: the answer is that you do some research for the operating system you intend to work with.

Tensorflow. Cifar10 Multi-gpu example performs worse with more gpus

I have to test the distributed version of tensorflow across multiple gpus.
I run the Cifar-10 multi-gpu example on an AWS g2.8x EC2 instance.
Running time for 2000 steps of the cifar10_multi_gpu_train.py (code here) was 427 seconds with 1 gpu (flag num_gpu=1). Afterwards the eval.py script returned precision # 1 = 0.537.
With the same example running for the same number of steps (with one step being executed in parallel across all gpus), but using 4 gpus (flag num_gpu=4) running time was about 530 seconds and the eval.py script returned only a slightly higher precision # 1 of 0.552 (maybe due to randomness in the computation?).
Why is the example performing worse with a higher number of gpus? I have used a very small number of steps for testing purposes and was expecting a much higher gain in precision using 4 gpus.
Did I miss something or made some basic mistakes?
Did someone else try the above example?
Thank you very much.
The cifar10 example uses variables on CPU by default, which is what you need for a multi-GPU architecture. You may achieve about 1.5x speed up compared to a single GPU setup with 2 GPUs.
Your problem has to do with the Dual GPU architecture for Nvidia Tesla K80. It has a PCIe switch to communicate both GPU cards internally. It shall introduce an overhead on communication. See block diagram:

32 bit and its relation with Ram?

Does 32 bit mean ram size should be 4GB ? or can a computer with say 32GB ram also have 32 bit provided adress space does not exceed 32 bit ?
When we say 32-bit windows or 64-bit OS, which part of OS exactly differs between the two ? I mean does some part of kernel differ ? if yes then which part ?
NOTE: this question is not a duplicate. please dont vote to close
No 32-bit does not necessarily refer to the size of the address bus. If the address bus is 32-bit then certainly the maximum RAM in the system is 4 gb, or 2^32. There have been several examples of 32-bit machines that could exceed 4gb of RAM, however, by using a concept of Page-Extended Addressing (PAE) That was introduces in the mid 1990s.
Another examples where this comes into play is the first IBM PC. It used a 16-bit microprocessor known as the 8088. The 8088 had a 20-bit address line and as such had the capacity of 2^20 (1MB) of RAM.
When we speak of a microprocessor having a certain number of 'bits', such as a 16-bit microprocessor or a 32-bit microprocessor, we are primarily referring to the basic data unit that the processor can handle at a time. This is determined by the size of the processor registers, which are the areas of the processor used for holding data for calculations and decisions.
Because there is a fundamental difference in how machine code is used to grab and process data in a 32-bit vs a 64-bit system, All code must be compiled specifically for the machine you want it to run on. This is why there are two version of many x86 operating systems. There is often one for 32-bit and one for 64-bit x86. x86 microprocessors have a legacy of backwards compatibility and are therefore able to run in 16, 32, or 64-bit modes. This means that you can run 32-bit windows on a 64-bit processor. If this backwards compatibility wasn't build in, however, this would not be possible.
So, as far as which part of the kernel differs, the answer is all of it. The same is true for desktop applications that are coded for 64-bit machines. If they have two versions, the entire code is different as the compiler optimizes for one or the other.