Is Google OR Tools TSP parallel by default? - or-tools

I use the routing library a lot and am wondering if OR Tools uses all available cores on a particular machine by default. For example, when solving an integer program in Gurobi, it shows the number of cores available and the number of threads it uses automatically. How can we find that out when using the routing functions in Google OR Tools?

No. TSP is sequential.
CP-SAT is parallel by default.

Related

Running Dymola parallel on the cluster

I am trying to run Dymola on the cluster so that everyone in my research group could submit a model and simulate jobs, is it possible to run Dymola on a cluster and utilize the power of HPC?
I could use some flags to make Dymola run parallel on a many-cores computer, but how to run a parallel simulation on many computers?
Parallelization on a single computer:
Parallelizing a Modelica model is possible, but the the model needs to be
suitable by nature (which doesn't happen too often, at least to my experience), for some examples where it works well see e.g. here
modified manually by the modeler to allow parallelization, e.g. by introducing delay blocks, see here or some similar approach here.
Often Dymola will output No Parallelization in the translation log, presumably due to the model not allowing parallelization (efficiently). Also the manual states: It should be noted that for many kinds of models the internal dependencies don’t allow efficient parallelization for getting any substantial speed-up.
I'm not an expert on this, but as to my understanding HPC depends on massive parallelization. Therefore, models generated by Dymola do not seem to be a very good application to be run on HPC-clusters.
Dymola on multiple computers:
Running a single Dymola-simulation on multiple computers in parallel is not possible as far as I know.
I think there are several answers to this question.
The flags under Translation all refer to parallelized code inside a
single simulation executable. If you mention HPC I do not think you
need to consider this.
To run multiple simulations on a single
multi-core computer there is built-in support in Dymola. The relevant
function is simulateModelMulti, etc. The Sweep Parameters feature uses this automatically.
There is no built-in support
to distribute the simulation on several computers in a cluster.
However, if you generate your dymosim.exe with the Binary Model
Export option enabled, it can be run on other computers. You need to
distribute dymosim.exe, dsin.txt and and data files you read across
the cluster. I'm sure your HPC cluster has tools for that.

How to use the multicore option in vowpal wabbit

I am running vowpal wabbit (ksvm option) on a single machine (8 core Macbook pro). How can I make it use all the 8 cores?
From the linked videos (videolectures.net) on the github page I see there is a --thread-bits option to control the number of threads but vw --help doesn't list this option so I guess It's not there any more in the latest version.
What is the right way to use the multicore capabilities of vowpal wobbit. I don't want to run it over multiple nodes but I am interested in using the multicore capabilities on a single machine.
Is vowpal-wabbit "multi-core"?
Only partially. It uses 2 cores by default (using C++ std::thread):
IO/Parsing thread: murmur-hash3 for hashing features, fast-atof for parsing numerics, & parse-example
Learning thread (SGD predict, estimate-error & update loop)
None of these are making examples appear out-of-order as real-parallelization might do. Data examples are all processed sequentially but in a (short) parallel pipeline.
Disabling multi-threading (switch to single core):
Use the option: --onethread
So the overall answer to your question is that the present options don't let you parallelize beyond 2 cores.
Cluster mode aka all-reduce
As you mentioned, there's the cluster-mode which supports data-partitioning and processing each part on a separate node in a cluster.
--thread-bits ?
Grepping the (latest) source code, I can't find any reference to --thread-bits.

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

record virtual memory use in netlogo over time

I am running netlogo on a HPC cluster, and I wondered if there is any way to output-print the java heap used over time?
I am trying to optimize the heap space used for a large model with loads of GIS data, but the HPC cluster only gives limited information on how much is used at which step.
I believe tools exist for monitoring JVM heap usage; I don't know much about that, but it isn't actually a NetLogo-specific topic, so you might look into that separately.
If you want to gather the information from within NetLogo itself:
As you point out in a comment, the "About NetLogo" dialog displays heap usage numbers. The code that retrieves those numbers is here: https://github.com/NetLogo/NetLogo/blob/533131ddb63da21ac35639e61d67601a3dae7aa2/src/main/org/nlogo/util/SysInfo.scala#L28-L39
You can see that it's just calling some routines in the Java standard library (in java.lang.Runtime). You could write a little NetLogo extension that calls the same routines.

Multi-Core Programming. Boost's MPI, OpenMP, TBB, or something else?

I am totally a novice in Multi-Core Programming, but I do know how to program C++.
Now, I am looking around for Multi-Core Programming library. I just want to give it a try, just for fun, and right now, I found 3 APIs, but I am not sure which one should I stick with. Right now, I see Boost's MPI, OpenMP and TBB.
For anyone who have experienced with any of these 3 API (or any other API), could you please tell me the difference between these? Are there any factor to consider, like AMD or Intel architecture?
As a starting point I'd suggest OpenMP. With this you can very simply do three basic types of parallelism: loops, sections, and tasks.
Parallel loops
These allow you to split loop iterations over multiple threads. For instance:
#pragma omp parallel for
for (int i=0; i<N; i++) {...}
If you were using two threads, then the first thread would perform the first half of the iteration. The second thread would perform the second half.
Sections
These allow you to statically partition the work over multiple threads. This is useful when there is obvious work that can be performed in parallel. However, it's not a very flexible approach.
#pragma omp parallel sections
{
#pragma omp section
{...}
#pragma omp section
{...}
}
Tasks
Tasks are the most flexible approach. These are created dynamically and their execution is performed asynchronously, either by the thread that created them, or by another thread.
#pragma omp task
{...}
Advantages
OpenMP has several things going for it.
Directive-based: the compiler does the work of creating and synchronizing the threads.
Incremental parallelism: you can focus on just the region of code that you need to parallelise.
One source base for serial and parallel code: The OpenMP directives are only recognized by the compiler when you run it with a flag (-fopenmp for gcc). So you can use the same source base to generate both serial and parallel code. This means you can turn off the flag to see if you get the same result from the serial version of the code or not. That way you can isolate parallelism errors from errors in the algorithm.
You can find the entire OpenMP spec at http://www.openmp.org/
Under the hood OpenMP is multi-threaded programming but at a higher level of abstraction than TBB and its ilk. The choice between the two, for parallel programming on a multi-core computer, is approximately the same as the choice between any higher and lower level software within the same domain: there is a trade off between expressivity and controllability.
Intel vs AMD is irrelevant I think.
And your choice ought to depend on what you are trying to achieve; for example, if you want to learn TBB then TBB is definitely the way to go. But if you want to parallelise an existing C++ program in easy steps, then OpenMP is probably a better first choice; TBB will still be around later for you to tackle. I'd probably steer clear of MPI at first unless I was certain that I would be transferring from shared-memory programming (which is mostly what you do on a multi-core) to distributed-memory programming (on clusters or networks). As ever , the technology you choose ought to depend on your requirements.
I'd suggest you to play with MapReduce for sometime. You can install several virtual machines instances on the same machine, each of which runs a Hadoop instance (Hadoop is a Yahoo! open source implementation of MapReduce). There are a lot of tutorials online for setting up Hadoop.
btw, MPI and OpenMP are not the same thing. OpenMP is for shared memory programming, which generally means, multi-core programming, not parallel programming on several machines.
Depends on your focus. If you are mainly interested in multi threaded programming go with TBB. If you are more interested in process level concurrency then MPI is the way to go.
Another interesting library is OpenCL. It basically allows you to use all your hardware (CPU, GPU, DSP, ...) in the best way.
It has some interesting features, like the possibility to create hundreds of threads without performance penalties.