Mosek memory issue for large linear programming - matlab

I use MOSEK to run a very large linear programming problem in Matlab (32768 unknowns and 691621 constraints).
The code is submitted in a Linux based cluster.
In the bash file I request the following amount of memory:
#$ -l h_vmem=20G
#$ -l tmem=20G
but get Mosek error: MSK_RES_ERR_SPACE (Out of space.)
I could request more memory (however, it is unclear how much more?), but this would mean queuing in the cluster for a long time.
Hence, I was wondering whether I can try to ameliorate the issue in some other way.
1) Quoting from some MOSEK FAQs:
Java, .NET, amd Python applications runs under a virtual machine. MOSEK shares memeory
with the virtual machine. This implies it might be necessary to force the virtual machine to
free unused memory by explicitly calling the garbage collector (for example before optimization
is performed) in order to make sufficient memory available to MOSEK.
Can this advise be useful? What does it mean calling the garbage collector (i.e., which line should I add to my Matlab code?).
2) From https://docs.mosek.com/9.2/pythonapi/guidelines-optimizer.html (even though this is for Python), it suggests to set
Task.putmaxnumvar. Estimate for the number of variables.
Task.putmaxnumcon. Estimate for the number of constraints.
Task.putmaxnumcone. Estimate for the number of cones.
Task.putmaxnumbarvar. Estimate for the number of semidefinite matrix variables.
Task.putmaxnumanz. Estimate for the number of non-zeros in A.
Task.putmaxnumqnz. Estimate for the number of non-zeros in the quadratic terms.
Can I do that in Matlab? How?
3) From http://ask.cvxr.com/t/how-to-deal-with-out-of-space-error-when-using-mosek-to-solve-a-conic-optimization-problem/7510: "It will reduce memory consumption to some extent if you run on 1 thread (set MSK_IPAR_NUM_THREADS to 1 in cvx solver options or set MSK_IPAR_INTPNT_MULTI_THREAD to 0)"
Can this be done in Matlab as well? I have tried
param_MOSEK.MSK_IPAR_NUM_THREADS = 1;
param_MOSEK.MSK_IPAR_INTPNT_MULTI_THREAD = 'MSK_OFF';
but it does not seem to work as the output file still gives
Optimizer - threads : 16
Optimizer - solved problem : the dual
...
Comments related to questions below:
The code runs in my MacOS 64 bit using 16 threads in just 180 sec. The
memory of the computer is 32 GB 2667 MHz DDR4. It uses much less than 20G (around 9G).
The code fails when it is run on the cluster of my univ (Linux based) after having requested 20G of vmem and tmem. In the cluster, MOSEK executes the presolving, the GP based matrix reordering, and then fails. This is a typical log file
Wed 9 Sep 08:10:47 BST 2020
Task ID is 6
< M A T L A B (R) >
Copyright 1984-2019 The MathWorks, Inc.
R2019b Update 3 (9.7.0.1261785) 64-bit (glnxa64)
November 27, 2019
For online documentation, see https://www.mathworks.com/support
For product information, visit www.mathworks.com.
MOSEK Version 9.2.5 (Build date: 2020-4-22 22:56:56)
Copyright (c) MOSEK ApS, Denmark. WWW: mosek.com
Platform: Linux/64-X86
Problem
Name :
Objective sense : min
Type : LO (linear optimization problem)
Constraints : 691597
Cones : 0
Scalar variables : 32768
Matrix variables : 0
Integer variables : 0
Optimizer started.
Presolve started.
Linear dependency checker started.
Linear dependency checker terminated.
Eliminator started.
Freed constraints in eliminator : 0
Eliminator terminated.
Eliminator - tries : 1 time : 0.00
Lin. dep. - tries : 1 time : 0.33
Lin. dep. - number : 0
Presolve terminated. Time: 2.99
GP based matrix reordering started.
GP based matrix reordering terminated.
Optimizer terminated. Time: 20.15
Interior-point solution summary
Problem status : UNKNOWN
Solution status : UNKNOWN
Primal. obj: 0.0000000000e+00 nrm: 1e+00 Viol. con: 1e+00 var: 0e+00
Dual. obj: 0.0000000000e+00 nrm: 0e+00 Viol. con: 0e+00 var: 0e+00
Optimizer summary
Optimizer - time: 20.15
Interior-point - iterations : 0 time: 19.95
Basis identification - time: 0.00
Primal - iterations : 0 time: 0.00
Dual - iterations : 0 time: 0.00
Clean primal - iterations : 0 time: 0.00
Clean dual - iterations : 0 time: 0.00
Simplex - time: 0.00
Primal simplex - iterations : 0 time: 0.00
Dual simplex - iterations : 0 time: 0.00
Mixed integer - relaxations: 0 time: 0.00
Mosek error: MSK_RES_ERR_SPACE (Out of space.)

Irrelevant in Matlab.
Irrelevant and imposible in Matlab. The MEX interface feeds the problem into Mosek in one go and takes care of all allocations itself.
For MSK_IPAR_NUM_THREADS to be respected you must restart the whole process i.e. Matlab. See https://docs.mosek.com/9.2/faq/faq.html#mosek-is-ignoring-the-limit-on-the-number-of-threads. However, when you set MSK_IPAR_INTPNT_MULTI_THREAD = 'MSK_OFF' then Mosek will use 1 thread regardless of the number of all threads available i.e. the number printed to the log is just an upper bound. You should be able to see in the task manager/top/whatever other CPU load tracker that only 1CPU is in use.
The basic question is: have you tried to run the problem without any memory limits to see if it works at all and estimate the memory consumption? Does it run on other machines?

Related

Calculate the performance of a multicore architecture?

Cal a multicore architecture with 10 computing cores: 2 processor cores and 8 coprocessors. Each processor core can deliver 2.0 GFlops, while each coprocessor can deliver 1.0 GFlops. All computing cores can perform calculation simultaneously. Any instruction can execute in either processor or coprocessor cores unless there are any explicit restrictions.
If 70% of dynamic instructions in an application are parallelizable, what is the maximum average performance (Flops) you can get in the optimal situation? Please note that the remaining 30% instructions can be executed only after the execution of the parallel 70% is over.
Consider another application where all the dynamic instructions can be partitioned into 6 groups (A, B, C, D, E, F) with the following dependency. For example, A --> C implies that all the instructions in A need to be completed before starting the execution of instructions in C. Each of the first four groups (A, B, C and D) contains 20% of the dynamic instructions whereas each of the remaining two groups (E and F) contains 10% of the dynamic instructions. All the instructions in each group must be executed sequentially on the same processor or coprocessor core. How to schedule them on the multicore architecture to achieve the best possible performance? What is the maximum average performance (Flops) now?
A(20%) --> C(20%) -->
E(10%)-->F(10%)
B(20%) --> d(20%) -->
For the first part, you need to use Amdahl's Law, which is:
max speed-up = 1/(1-p+p/n)
where p is the parallelizable part. n is the improvement factor in executing the parallel portion.
(Note that the Amdahl's Law formula can be used for first order estimates on other types of changes. E.g., given a factor of N reduction in ALU energy use and P fraction of energy used by the ALU, one can find the improvement in total energy use.)
In your case, since the serial portion would be executed on the higher performance (2 GFLOPS) processor core, n is 6 ([8 coprocessor cores * 1 GFLOPS/core + 2 processor cores * 2 GFLOPS/core]/ 2 GFLOPS/processor core).
A quick calculation shows the max speed-up you can get is 2.4 related to 1 processor core. The maximum FLOPS would therefore be the speed-up times the speed if the whole program was executed serially on one processor core, i.e., 2.4 * 2 GFLOPS = 4.8 GFLOPS.
For the second part, note that initially there are two independent instruction streams: A -> C and B -> C. Since the system has two processor cores, both can be executed in parallel on the higher performance processor cores. Furthermore, both have the same amount of work (40% of total for each stream), so one the same performance core they will complete at the same time.
Since E depends on results from both C and D, it must be started after both finish. E and F would execute on a processor core (which core is arbitrary since E must wait for the tasks running on both processor cores to complete).
As you can see 80% of the program (40% for A+C; 40% for B+D) can be parallelized by a factor of 2 and 20% of the program (E+F) is serial. You can then just plug the numbers into the Amdahl's Law formula (p=0.8, n=2).

how to display only select powers of expansion maple

Say I have a super long polynomial of multiple variables. far too long to display on-screen, or print out, so collect http://www.maplesoft.com/support/help/Maple/view.aspx?path=collect is unlikely to help. I would like to tell maple to display only terms that contain a specific variable to one select power. I am sure there must be a simple way to do this. And no, I haven't looked into this extensively. feel free to just provide a link to the answer, if it already exists online.
thank's...
If you care about speed -- perhaps because you need to do similar queries for other powers, of possibly other variables -- then consider using the coeff command. Eg, for polynomial f, the terms with x^2 could be obtained with the command,
x^2*coeff(f,x,2);
For a trivariate dense polynomial of about 1000000 terms as in the following example, the coeff command is several hundred times faster in Maple 16 and 17 than is the has command approach as shown below.
restart:
f:=expand(randpoly(x,degree=100,dense)
*randpoly(y,degree=100,dense)
*randpoly(z,degree=100,dense)):
nops(f); # number of terms
990000
sol1:=CodeTools:-Usage( select(has,f,x^2) ):
memory used=105.36MiB, alloc change=58.22MiB, cpu time=842.00ms, real time=843.00ms
sol2:=CodeTools:-Usage( x^2*coeff(f,x,2) ):
memory used=156.84KiB, alloc change=0 bytes, cpu time=0ns, real time=4.00ms
expand(sol1-sol2);
0
# Check that the timing difference was not just due to the order in which
# the two approaches were done, by a simple repeat.
CodeTools:-Usage( select(has,f,x^2) ):
memory used=105.30MiB, alloc change=23.11MiB, cpu time=733.00ms, real time=691.00ms
CodeTools:-Usage( x^2*coeff(f,x,2) ):
memory used=156.81KiB, alloc change=0 bytes, cpu time=0ns, real time=3.00ms
That was all done in Maple 17 64bit on Windows 7, and timings are pretty similar in Maple 16. This is in stark contrast to Maple 15 and earlier, where the coeff approach is about 3 times slower than that has approach. Those improvements relate to major work done in handling polynomial structures in Maple 16 and 17. See here and here.
Let's say that you want to see all terms of polynomial poly with x^2. Then do select(has, poly, x^2);

To find execution time on a mult-icore machine

I'am preparing for a competitive exam and i have an operating system question.
I'am not getting how to solve it. please help me out.
Q-)
A program took 160 seconds to execute on a single processor but only 64 seconds on a
4 core multicore. What is the best estimate for the execution time on a 64 core machine?
I don't think this is strictly relevant to programming (you might find this more relevant on the Math StackExchange but I'll attempt to answer it anyway.
The answer will depend entirely on how you model execution time vs number of cores. You could model the execution time as inversely proportional to the number of cores. For example, I used the following model:
Where t is time in seconds and n is number of cores, c (could represent overhead) and k (a factor) are constants.
Solve simultaneously
to get k = 128 and c = 32.
Then just substitute n = 64
So, you get 34 seconds according to this model. Of course, since you don't know the exact model, this can only be a calculated guess.

Amdahl's law example

Can someone help me with this example please and show me how to work the second part?
the question is :
If one third of a weather prediction algorithm is inherently serial and the remainder
parallelizable, what is the minimum number of cores needed to guarantee a 150% speedup over a
single core implementation?
ii. Your boss revises the figure to 200%. What is your new answer?
Thanks very much in advance !!
Guess: If the algorithm is 1/3 serial and 2/3 parallel...I would think that each core you added would give you a 66% increase in performance...So for 150% increase, you'd need 3 more cores, and for a 200% increase, you'd need 4.
This is a guess. Your textbook might be more helpful :)
If the algorithm runs on a single core and takes 90 minutes then 30 minutes is for the serial part and 60 minutes for the parallel part.
Add a CPU:
30 is for the serial part and 30 for the parallel part(half of the 60 overlaps with the serial part).
90 / 60 = 150% increase.
I am a bit late, but here are the answers:
1) 150% increase -> 2 cores at least required as dbasnett said;
2) 200% increase -> 4 cores at least required basing on the Amahld's law:
Here, 90 minutes overall required to perform the calculation. P is the actually enhanced part of the algorithm (the parallelizable part) which is 2/3 of 90, N is the number of cores, so when there's a core only:
You get 1, which means 100%, which is how the algorithm performs the standard way (without multi-core acceleration and therefore no parallelization speedup).
Now, we must find N number of cores for which the previous equation equals 2, where 2 means that the algorithm performs in half time (45 minutes instead of 90 when there's no parallelization) and therefore with a 200% speedup:
Since:
We see that:
So with 4 cores computing in parallel the 2/3 of the algoritm you get 200% speedup. The same goes for 150%, you will get 2, as dbasnett already told you.
Pretty simple.
Note that a complex algorithm may imply further divisions of its parallelizable parts (and in theory you can have a different number of processing units per parallelizable part concurrently):
You can further look at Wikipedia (there's also an example):
http://en.wikipedia.org/wiki/Amdahl%27s_law#Description
Anyway, the principle is the same:
Let T be the time an algorithm needs to execute in order to complete, A be the serial part of it, B its parallelizable part and N the number of parallel CPUs, you can divide B in further small sections and perform calculations on each part:
You may for C, D, G e.g. adopt M CPUs instead of N (the speedup will of course differ if M != N).
And at the end, you will arrive at a point when having more CPUs doesn't matter anymore, since:
And your algorithm speedup will at most tend to total execution time (T) divided by the execution time of the Serial part only (A).
Therefore parallel calculation comes really handy only when you have low execution time for the serial part of your algorithm.

Using more than one GPU in matlab

this is the output of ginfo using Jacket/matlab:
Detected CUDA-capable GPUs:
CUDA driver 270.81, CUDA toolkit 4.0
GPU0 Tesla C1060, 4096 MB, Compute 1.3 (single,double) (in use)
GPU1 Tesla C1060, 4096 MB, Compute 1.3 (single,double)
GPU2 Quadro FX 1800, 742 MB, Compute 1.1 (single)
Display Device: GPU2 Quadro FX 1800
The problem is :
Can I use the two Teslas at same time (parfor)? How?
How to know number of cores are currently running/executing the program?
After running the following code and make Quadro (in use) I found it takes less time than Tesla despite Tesla having 240 cores and Quadro has only 64? Maybe because it's the display device?maybe becouse it's single precision and Tesla is Double precision?
clc; clear all;close all;
addpath ('C:/Program Files/AccelerEyes/Jacket/engine');
i = im2double(imread('cameraman.tif'));
i_gpu=gdouble(i);
h=fspecial('motion',50,45);% Create predefined 2-D filter
h_gpu=gdouble(h);
tic;
for j=1:500
x_gpu = imfilter( i_gpu,h_gpu );
end
i2 = double(x_gpu); %memory transfer
t=toc
figure(2), imshow(i2);
Any help with the code will be appreciated. As you can see it's very trivial example used to demonstrate power of GPU, no more.
Using two Teslas at the same time: write a MEX file and call cudaChooseDevice(0), launch one kernel, then call cudaChooseDevice(1) and execute another kernel. Kernel calls and memory copies (i.e., cudaMemcpyAsync and cudaMemcpyPeerAsync) are asynchronous. I've given an example about how to write a MEX file (i.e., a DLL) in one of my other answers. Just add a second kernel to that example. FYI, you don't need Jacket if you can do some C/C++ programming. On the other hand, if you don't want to spend your time learning the Cuda SDK or you don't have a C/C++ compiler then you're stuck with Jacket or gp-you or GPUlib until Matlab changes the way that parfor works.
An alternative is to call OpenCL from Matlab (again through a MEX file). Then you could launch kernels on all the GPUs and CPUs. Again, this requires some C/C++ programming.
From Matlab 2012, GPU array and GPU related functions are fully integrated into the MATLAB so you might not need to use Jacket to achieve what you are trying to do.
In sum, put gpuDevice(deviceID); before running GPU commands, then the following codes will be run on the deviceIDth gpu. For instance
gpuDevice(1);
a = gpuArray(rand(3)); // a is on the first GPU memory
gpuDevice(2);
b = gpuArray(rand(4)); // b is on the second GPU memory
To run multiple GPUs. simply put
c = cell(1,num_device);
parfor i = 1:num_device
gpuDevice(i);
a = gpuArray(magic(3));
b = gpuArray(rand(3));
c{i} = gather(a*b);
end
You can see the GPU memory usage by typing nvidia-smi on the system command line.
This way of setting GPU id seems strange but it is the conventional way to set GPU id. In CUDA, if you want to use specific GPU then cudaSetDevice(gpuId) and the following codes will run on the gpuIdth GPU. (0-base indexing)
----------------------EDIT----------------
Tested and confirmed on MATLAB 2012b, MATLAB 2013b.
Checked using nvidia-smi that the code is actually using different GPU memories. You might have to scale it very large rand(5000) and check very quickly since temporary variables a and b would disappear after the for loop ends