Is it normal multiconfig is very slow to parse? - yocto

I have tried using the multiconfig setup for 6 machines but it is extremely slow, around 7 times slower.
I tested by just parsing (-p) one image for one machine on both setup.
Is it possible some optimizations were not done and the feature is not completely ready?
Thanks

multiconfig allows the setups to run in parallel and the seventh setup is likely the "no multiconfig" setup which many people overlook when counting them.
multiconfig is not expected to generate parsing speed improvements, its designed to let you run multiple builds with parallelisation and a simple command. This is therefore not something which has been optimized or is planned to be optimized either, at least as things stand as that isn't an area where we can see optimizations we can make.

Related

How does on-demand pruning of kernel work in context of Unikernel?

In Unikernel computing environment, kernel sometimes can be pruned targeting the need of specific user application.
How does this process work? is it manual or can it be automated?
There is no 'pruning' of the kernel. To be clear all unikernel implementations don't use linux - they have written their own kernel (whether they want to admit to that or not).
There are well over 10 different unikernel implementations out there today and some are very focused on providing the absolute bare minimum of things to work. Others, such as nanos of which I work on are in the 'posix' category meaning if it runs on linux it'll probably run on here.
Whenever you see references to people talking about 'only including what you need' and picking certain libraries to use -- most of these systems are not much more than a hello world -- which is simple enough to do in ~20 lines of assembly. When you start trying to run real world applications, especially ones that you use everyday (did you write your own database? did you write your own webserver?) and compare things like performance you start running into actual operating system components that you can't just pick/choose.
Having said that, supporting the set of syscalls that linux has is only a portion of the battle. There is a lot of code that goes into a general purpose operating system like linux (north of 20M loc and that's a 'small' kernel) or windows or macos that is not readily recognizable by it's end users. The easy things to point at are picking which network driver to use depending on which hypervisor you want to run it on (xen/kvm/hyper-v).
The harder, less clear things are making choices between things like apic or x2apic? How many levels are your page tables? Do you support smp or not? If so how?
You can't just 'prune' this type of stuff away. It needs to be consciously added and even if you did have a linux that you were 'unikernelizing' you wouldn't be able to prune it cause it just touches too much code. As a thought exercise try removing support for multiple processes (no unikernel supports this). Now you are touching multiple schedulers, shared memory, message passing, user privileges, (are there users?), etc. (this list goes on forever).
Your question is actually a very common question and it highlights a misunderstanding of how these things work in real life.
So to answer your question - there is no work that I'm aware of where people are trying to automatically 'prune' the kernel and even the projects that exist where you can select what type of functionality via config file or something is not something that I would expect to see much progress in because of the aforementioned reasons.

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

How does logical indexing work?

In some high-level languages like Matlab, you can use "logical indexing" to select a whole set of entries in an array for operating on.
I understand what logical indexing is and how to use it.
Instead, I am asking:
How does it work ("behind the scenes")?
Does it not boil down to just a for-loop?
If so, why is it so much faster than for-looping?
Interpreted languages can be thought of as a variation on assembler running on an emulated core. They have stacks and commands that work in ways like the assembler without actually being the assembler. They are a virtual machine.
A for loop can be thought of as telling the system, set a value, run a sequence of tasks, and when you are done then come back and check on that value. If it is not at a threshold, then change it in a prescribed way, and go repeat those tasks and come back. In assembler you are running screaming fast, but in the "VM" not so much. Consider the demonstration between 13:50 and 15:30 of this link: (link)
This means that what appears to be a for loop, isn't actually a for loop. It is operating system interrupts, and virtualized memory. It is virus-scans in the background and megasloth bloatware.
If you had a virtual system, could you make a short-cut for addressing memory that didn't use the virtualized for loop, that was reasonably efficient? MatLab tries to major on data processing, so it has to have very efficient ways of storing, sorting, and selecting data within its virtual machine.
MathWorks is not going to make the details of this accessible to the public. If it has a great idea then they don't want it implemented in Python, and R tomorrow. If it has a mediocre idea then they don't want to be beaten in execution by Python and R tomorrow. Either way, making the nuts and bolts of that particular approach accessible to the public without an NDA - it is likely a losing proposition for them.
Bottom lines:
its not a real "for", even for a for loop, because its running virtually
they are opening up some of the internals of their data handling to improve usability
they aren't likely to disclose actual code because of negative business consequences
It is worthy to note that vectorized code can outperform for loops while doing the same thing. This means they likely are applying more of that internals to execution of the "sequence of tasks" to get performance improvement.

Instruction detection at run-time

I would like to identify and analyze different machine instruction executed and required clock cycle for each of them, throughout running of a code.
Is there any way to do this simply? Dynamic binary translation might be a way but i am looking for more easier mechanism.
Thanks in advance
If you are programming, consider using a performance analysis tool such as a profiling tool, such as Intel VTune (http://en.wikipedia.org/wiki/VTune), or oprof.
It is much less common for most programmers to have access to a cycle accurate simulator, although in the embedded space it is quite common.
Dynamic binary translation is probably NOT a good way to measure your program at individual instruction granularity. DBT tools like http://www.pintool.org/ do allow you to insert code to read timers. You could do this around individual instructions is way too slow, and the instrumentation adds too much overhead. But doing this at function granularity can be okay. Basic block granularity, i.e. every branch, borderline.
Bottom line: try a profiling tool like VTune first. Then go looking for a cycle accurate simulator.

Using Drools in a heavy batch process

We used Drools as part of a solution to act as a sort of filter in a very intense processing application, maybe running up to 100 rules on 500,000 + working memory objects.
turns out that it is extremely slow.
anybody else have any experience using Drools in a batch type processing application?
Kind of depends on your rules - 500K objects is reasonable given enough memory (it has to populate a RETE network in memory, so memory usage is a multiple of 500K objects - ie space for objects + space for network structure, indexes etc) - its possible you are paging to disk which would be really slow.
Of course, if you have rules that match combinations of the same type of fact, that can cause an explosion of combinations to try, which even if you have 1 rule will be really really slow.
If you had any more information on the analysis you are doing that would probably help with possible solutions.
I've used a Drools with a stateful working memory containing over 1M facts. With some tuning of both your rules and the underlying JVM, performance can be quite good after a few minutes for initial start-up. Let me know if you want more details.
I haven't worked with the latest version of Drools (last time I used it was about a year ago), but back then our high-load benchmarks proved it to be utterly slow. A huge disappointment after having based much of our architecture on it.
At least something good I remember about drools is that their dev team was available on IRC and very helpful, you might give them a try, they're the experts after all: irc.codehaus.org #drools
I'm just learning drools myself, so maybe I'm missing something, but why is the whole batch of five hundred thousand objects added to working memory at once? The only reason I can think of is that there are rules that kick in only when two or more items in the batch are related.
If that isn't the case, then perhaps you could use a stateless session and assert one object at a time. I assume rules will run 500k times faster in that case.
Even if it is the case, do all your rules need access to all 500k objects? Could you speed things up by applying per-item rules one at a time, and then in a second phase of processing apply batch level rules using a different rulebase and working memory? This would not change the volume of data, but the RETE network would be smaller because the simple rules would have been removed.
An alternative approach would be to try and identify the related groups of objects and assert the objects in groups during the second phase, further reducing the volume of data in working memory as well as splitting up the RETE network.
Drools is not really designed to be run on a huge number of objects. It's optimized for running complex rules on a few objects.
The working memory initialization for each additional object is too slow and the caching strategies are designed to work per working memory object.
Use a stateless session and add the objects one at a time ?
I had problems with OutOfMemory errors after parsing a few thousand objects. Setting a different default optimizer solved the problem.
OptimizerFactory.setDefaultOptimizer(OptimizerFactory.SAFE_REFLECTIVE);
We were looking at drools as well, but for us the number of objects is low so this isn't an issue. I do remember reading that there are alternate versions of the same algorithm that take memory usage more into account, and are optimized for speed while still being based on the same algorithm. Not sure if any of them have made it into a real usable library though.
this optimizer can also be set by using parameter
-Dmvel2.disable.jit=true