cpufreq - Is not possible to change "scaling_available_frequencies" - raspberry-pi

I am trying to change scaling_available_frequencies because I am working on a low power system.
I used to do this
$ cat /system/cpu/cpu0/cpufreq/scaling_available_frequencies
600000 1200000
but I want to reduce frequency to less than 600000.
So I tried the echo command to modify scaling_available_frequencies, but I'm stuck.
Is it possible to change scaling_available_frequencies?
Thank you for your help.

General answer: The scaling_available_frequencies is a read only property in userspace. It lists all the available frequencies which cpufreq kernel driver handles. So if you want to have more available frequencies, you first need to ensure that your hardware is capable of it. If there are other frequencies available and they aren't supported yet, you need to extend your driver or device tree (it depends on your particular hardware).
BTW the cpuidle driver might be also helpful if you are reducing the CPU power consumption. But again, this heavily depends on your use-case and hardware as well.

Related

Define the minimal configuration for a program

So I have my .exe ready to deploy, and for distribution, I need to know the minimal requirements for my program to run on a machine... and I really don't know how to do that.
Is there a way to know that ? Some kind of benchmark ? Or must I just set things as I think it'll work ?
Maybe should I just buy all existing components until I find the minimal ? :')
Well, thanks for your answers.
Start by seeing the first Windows' version you can deploy on (Windows XP? Vista?).
If your program is cpu or gpu intensive, and has a fixed time loop (eg. game) then you'll have to do benchmarks.
You should look at several old vs new CPUs/GPUs and trying to "guess" based on online specs posted online what the minimal requirement is. For example, if your program can't run on an old cpu, but runs blazingly fast on a new one, try to find the model that -barely- runs it, which will obviously be one somewhere in the middle.
If your program requires other special things, specify them (eg. USB 3.0, controllers supported...).
Otherwise, if your program loads slower but doesn't have runtime issues, the minimum specs should be indicative of a reasonable loading time (a minute seems to be the standard now, sadly).
Additionally, if your program is memory hungry (both hard drive or RAM), you must indicate this.
For hard drive memory, simply state your program size, along with the files included with it.
For RAM, use a profiler - it will tell you how much memory your program is using.
I've completely skipped over the fact that, in some computers, the bottleneck might be the cpu, and it might be the gpu in others. You need to know which is the bottleneck to make your judgement.
To find out is a rather simple process - remove expensive gpu operations (lower texture resolutions, turn off shaders). If the program still runs slowly, then the bottleneck is the cpu.
edit: this is a simplification of the problem and hardware is a little more complicated than this (slower multi-core cpus vs faster one-core cpus vary their performance depending on how many cores a program uses and how / a program may require a gpu to have less memory but more processing power, or the opposite... even heat dissipation can affect component efficiency: your program might run fine for 20 minutes but start to slow down if the cpu isn't cooled down properly), but "minimal hardware requirements" aren't exactly precise so this method is appropriate.
tl;dr of the spoiler:
In short, there are so many factors that affect performance that you can't measure it, so just a rough estimation is good.

Using a subset of a SUMO scenario for OMNeT++ network simulation (with VEINS)

I'm trying to evaluate an application that runs on a vehicular network using OMNeT++, Veins and SUMO. Because the application relies on realistic traffic behavior, so I decided to use the LuST Scenario, which seems to be the state of the art for such data. However, I'd like to use specific parts of this scenario instead of the entire scenario (e.g., a high and a low traffic load fragment, perhaps others). It'd be nice to keep the bidirectional functionality that VEINS offers, although I'm mostly interested in getting traffic data from SUMO into my simulation.
One obvious way to implement this would be to use a warm-up period. However, I'm wondering if there is a more efficient way -- simulating 8 hours of traffic just to get a several-minute fragment feels inefficient and may be problematic for simulations with sufficient repetitions.
Does VEINS have a built-in mechanism for warm-up periods, primarily one that avoids sending messages (which is by far the most time consuming part in the simulation), or does it have a way to wait for SUMO to advance, e.g., to a specific time stamp (which also avoids creating vehicle objects in OMNeT++ and thus all the initiation code)?
In case it's relevant -- I'm using the latest stable versions of OMNeT++ and SUMO (OMNeT++ 4.6 with SUMO 0.25.0) and my code base is based on VEINS 4a2 (with some changes, notably accepting the TraCI API version 10).
There are two things you can do here for reducing the number of sent messages in Veins:
Use the OMNeT++ Warm-Up Period as described here in the manual. Basically it means to set warmup-period in your .ini file and make sure your code checks this with if (simTime() >= simulation.getWarmupPeriod()). The OMNeT++ signals for result collection are aware of this.
The TraCIScenarioManager offers a variable double firstStepAt #unit("s") which you can use to delay the start of it. Again this can be set in the .ini file.
As the VEINS FAQ states, the TraCIScenarioManagerLaunchd offers two variables to configure the region of interest, based on rectangles or roads (string roiRoads and string roiRects). To reduce the simulated area, you can restrict simulation to a specific rectangle; for example, *.manager.rioRects="1000,1000-3000,3000" simulates a 2x2km area between the two supplied coordinates.
With both solutions (best used in combination) you still have to run SUMO - but Veins barely consums any of the time.

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command

The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Netlogo High performance Computing

Are there any high performance computing facilites available for running NetLogo behavior space like R servers.
Thanks.
You can use headless mode to run batches of experiments on a cluster/cloud computing platform. This involves simply running an executable so should be compatible with most setups. If you don't have access to a cluster through an institution, I know people use AWS and Google compute. You probably want an instance with many cores, since that allows a single instance of BehaviorSpace to automatically distribute the runs involved in an experiment across multiple processes. Higher processing power of course helps too. You shouldn't need much memory. The n1-highcpu-16 or n1-standard-16 instance types in Google compute looks pretty ideal to me.