How to use the multicore option in vowpal wabbit - multicore

I am running vowpal wabbit (ksvm option) on a single machine (8 core Macbook pro). How can I make it use all the 8 cores?
From the linked videos (videolectures.net) on the github page I see there is a --thread-bits option to control the number of threads but vw --help doesn't list this option so I guess It's not there any more in the latest version.
What is the right way to use the multicore capabilities of vowpal wobbit. I don't want to run it over multiple nodes but I am interested in using the multicore capabilities on a single machine.

Is vowpal-wabbit "multi-core"?
Only partially. It uses 2 cores by default (using C++ std::thread):
IO/Parsing thread: murmur-hash3 for hashing features, fast-atof for parsing numerics, & parse-example
Learning thread (SGD predict, estimate-error & update loop)
None of these are making examples appear out-of-order as real-parallelization might do. Data examples are all processed sequentially but in a (short) parallel pipeline.
Disabling multi-threading (switch to single core):
Use the option: --onethread
So the overall answer to your question is that the present options don't let you parallelize beyond 2 cores.
Cluster mode aka all-reduce
As you mentioned, there's the cluster-mode which supports data-partitioning and processing each part on a separate node in a cluster.
--thread-bits ?
Grepping the (latest) source code, I can't find any reference to --thread-bits.

Related

Is Google OR Tools TSP parallel by default?

I use the routing library a lot and am wondering if OR Tools uses all available cores on a particular machine by default. For example, when solving an integer program in Gurobi, it shows the number of cores available and the number of threads it uses automatically. How can we find that out when using the routing functions in Google OR Tools?
No. TSP is sequential.
CP-SAT is parallel by default.

Running Dymola parallel on the cluster

I am trying to run Dymola on the cluster so that everyone in my research group could submit a model and simulate jobs, is it possible to run Dymola on a cluster and utilize the power of HPC?
I could use some flags to make Dymola run parallel on a many-cores computer, but how to run a parallel simulation on many computers?
Parallelization on a single computer:
Parallelizing a Modelica model is possible, but the the model needs to be
suitable by nature (which doesn't happen too often, at least to my experience), for some examples where it works well see e.g. here
modified manually by the modeler to allow parallelization, e.g. by introducing delay blocks, see here or some similar approach here.
Often Dymola will output No Parallelization in the translation log, presumably due to the model not allowing parallelization (efficiently). Also the manual states: It should be noted that for many kinds of models the internal dependencies don’t allow efficient parallelization for getting any substantial speed-up.
I'm not an expert on this, but as to my understanding HPC depends on massive parallelization. Therefore, models generated by Dymola do not seem to be a very good application to be run on HPC-clusters.
Dymola on multiple computers:
Running a single Dymola-simulation on multiple computers in parallel is not possible as far as I know.
I think there are several answers to this question.
The flags under Translation all refer to parallelized code inside a
single simulation executable. If you mention HPC I do not think you
need to consider this.
To run multiple simulations on a single
multi-core computer there is built-in support in Dymola. The relevant
function is simulateModelMulti, etc. The Sweep Parameters feature uses this automatically.
There is no built-in support
to distribute the simulation on several computers in a cluster.
However, if you generate your dymosim.exe with the Binary Model
Export option enabled, it can be run on other computers. You need to
distribute dymosim.exe, dsin.txt and and data files you read across
the cluster. I'm sure your HPC cluster has tools for that.

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

How are Scala 2.9 parallel collections working behind the scenes?

Scala 2.9 introduced parallel collections. They are a really great tool for certain tasks. However, how do they work internally and am I able to influence the behavior/configuration?
What method do they use to figure out the optimal number of threads? If I am not satisfied with the result are there any configuration parameters to adjust?
I'm not only interested how many threads are actually created, I am also interested in the way how the actual work is distributed amongst them. How the results are collected and how much magic is going on behind the scenes. Does Scala somehow test if a collection is large enough to benefit from parallel processing?
Briefly, there are two orthogonal aspects to how your operations are parallelized:
The extent to which your collection is split into chunks (i.e. the size of the chunks) for a parallelizable operation (such as map or filter)
The number of threads to use for the underlying fork-join pool (on which the parallel tasks are executed)
For #2, this is managed by the pool itself, which discovers the "ideal" level of parallelism at runtime (see java.lang.Runtime.getRuntime.availableProcessors)
For #1, this is a separate problem and the scala parallel collections API does this via the concept of work-stealing (adaptive scheduling). That is, when a particular piece of work is done, a worker will attempt to steal work from other work-queues. If none is available, this is an indication that all of the processors are very busy and hence a bigger chunk of work should be taken.
Aleksandar Prokopec, who implemented the library gave a talk at this year's ScalaDays which will be online shortly. He also gave a great talk at ScalaDays2010 where he describes in detail how the operations are split and re-joined (there are a number of issues that are not immediately obvious and some lovely bits of cleverness in there too!).
A more comprehensive answer is available in the PDF describing the parallel collections API.

Clarifications in Electric commander and tutorial

I was searching for tutorials on Electric cloud over the net but found nothing. Also could not find good blogs dealing with it. Can somebody point me in right directions for this?
Also we are planning on using Electric cloud for executing perl scripts in parallel. We are not going to build software. We are trying to test our hardware in parallel by executing the same perl script in parallel using electric commander. But I think Electric commander might not be the right tool given its cost. Can you suggest some of the pros and cons of using electric commander for this and any other feature which might be useful for our testing.
Thanks...
RE #1: All of the ElectricCommander documentation is located in the Electric Cloud Knowledge Base located at https://electriccloud.zendesk.com/entries/229369-documentation.
ElectricCommander can also be a valuable application to drive your tests in parallel. Here are just a few aspects for consideration:
Subprocedures: With EC, you can just take your existing scripts, drop them into a procedure definition and call that procedure multiple times (concurrently) in a single procedure invocation. If you want, you can further decompose your scripts into more granular subprocedures. This will drive reuse, lower cost of administration, and it will enable your procedures to run as fast as possible (see parallelism below).
Parallelism: Enabling a script to run in parallel is literally as simple as checking a box within EC. I'm not just referring to running 2 procedures at the same time without risk of data collision. I'm referring to the ability to run multiple steps within a procedure concurrently. Coupled with the subprocedure capability mentioned above, this enables your procedures to run as fast as possible as you can nest suprocedures within other subprocedures and enable everything to run in parallel where the tests will allow it.
Root-cause Analysis: Tests can generate an immense amount of data, but often only the failures, warnings, etc. are relevant (tell me what's broken). EC can be configured to look for very specific strings in your test output and will produce diagnostic based on that configuration. So if your test produces a thousand lines of output, but only 5 lines reference errors, EC will automatically highlight those 5 lines for you. This makes it much easier for developers to quickly identify root-cause analysis.
Results Tracking: ElectricCommander's properties mechanism allows you to store any piece of information that you determine to be relevant. These properties can be associated with any object in the system whether it be the procedure itself or the job that resulted from the invocation of a procedure. Coupled with EC's reporting capabilities, this means that you can produce valuable metrics indicating your overall project health or throughput without any constraint.
Defect Tracking Integration: With EC, you can automatically file bugs in your defect tracking system when tests fail or you can have EC create a "defect triage report" where developers/QA review the failures and denote which ones should be auto-filed by EC. This eliminates redundant data entry and streamlines overall software development.
In short, EC will behave exactly they way you want it to. It will not force you to change your process to fit the tool. As far as cost goes, Electric Cloud provides a version known as ElectricCommander Workgroup Edition for cost-sensitive customers. It is available for a small annual subscription fee and something that you may want to follow up on.
I hope this helps. Feel free to contact your account manager or myself directly if you have additional questions (dfarhang#electric-cloud.com).
Maybe you could execute the same perl script on several machines by using r-commands, or cron, or something similar.
To further address the parallel aspect of your question:
The command-line interface lets you write scripts to construct
procedures, including this kind of subprocedure with parallel steps.
So you are not limited, in the number of parallel steps, to what you
wrote previously: you can write a procedure which dynamically sizes
itself to (for example) the number of steps you would like to run in
parallel, or the number of resources you have to run steps in
parallel.