I would like to use pardiso (https://github.com/haasad/PyPardisoProject) on our Windows server for LU decomposition and have replaced scipy spsolve with paradiso.
#old
dxdlam = spsolve(Ab.tocsr(), bb)
#new
dxdlam = pypardiso.spsolve(Ab.tocsr(), bb)
Is it possible to limit the CPU performance of the paradiso function spsolve so that, for example, only a part of the kernel/CPU is used? Or, alternatively, can the number of parallel jobs be passed to the paradiso function?
Related
I’m running RStudio Version 1.1.419 with R-3.4.3 on Windows 10. I am trying to fit an (f)arima model and setting the fractional differencing parameter during the optimization process to be between (-0.5,0.5), i.e. allowing for antipersistence (d < 0), short memory (d = 0) and long memory (d > 0). I have tried multiple functions to accomplish that. I am aware that the default of fracdiff$drange is (0,0.5). Therefore this ...
> result <- fracdiff(MeanPrice, nar = 2, nma = 1, drange = c(-0.5,0.5))
sadly returns this..
Warning: C fracdf() optimization failure
Warning message: unable to compute correlation matrix; maybe change 'h'
Is there a way to fit fracdiff or other models (maybe arfima::arfima()?) with that drange? Your help is very much appreciated.
If you look at the package documentation, it states that the h argument for fracdiff "is used to compute a finite difference approximation to the Hessian, and
hence only influences the cov, cor, and std.error computations." However, as they are referring to the Hessian, I would assume that this affects the results of the MLE. There are other functions in that package that may be helpful: fdGHP for estimating the order of fractional differencing based on the Geweke and Porter-Hudak method, and similarly fdSperio.
Take a look at the forecast package. If you estimate the order of fractional differencing using the above mentioned functions, you might be able to use the same method described in the details of the arfima function.
I am writing a code in Fortran that involve computation of linear least squares (A^(-1)*b). For this I have used the subroutine "DGELSD". I am able to get the correct answer from my code. I have cross checked my solution with that of the data available from MATLAB.
Now when it comes to the computation time, MATLAB is taking much less time than that of my .f90 code. The main motivation for me to write the .f90 code was to do heavy computations, as MATLAB was unable to handle this problem (as the matrix size is increased, it takes more and more time). I am talking of the order of matrix around (10^6 x 10^6).
I know that I may be lacking somewhere around the vectorization or parallelization of code (as I am new to it). But would it make any difference? As the subroutine "DGELSD" is already highly optimized. Also I am using Intel ifort compiler and visual studio as an editor.
I have attached the part of the main code (.f90) below for reference. Please suggest what can be done to decrease the computation time. I am having good hardware configuration to run heavy computation.
Workstation specification: Intel Xeon 32 core - 64 bit processor, 64 GB RAM.
Code information:
a) I have used 'DGELSD' example available on Intel MKL example site..
b) I am using x64 architecture.
c) This is the code I am using in MATLAB for comparison.
function min_norm_check(m,n)
tic
a=15*m*n;
b=15*m*n;
c=6;
A=rand(a,b);
B=rand(b,c);
C=A\B;
toc
end
This is the Fotran code given below:
! Program to check the time required to
! calculate linear least square solution
! using DGELSD subroutine
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: M,N,LDA,LDB,NRHS,NB,NG
REAL :: T1,T2,START,FINISH
DOUBLE PRECISION, DIMENSION (:,:), ALLOCATABLE :: D_CAP,A,B,TG,DG
INTEGER :: I=0
NB=10
NG=10
M = 15*NB*NG
N = 15*NB*NG
NRHS = 6
!!
LDA = MAX(M,N)
LDB = MAX(M,NRHS)
!
ALLOCATE (A(LDA,N))
ALLOCATE (B(LDB,NRHS))
A = 0
B = 0
CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)
CALL CPU_TIME(START)
DO I=1,1
WRITE(*,*) I
CALL CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
ENDDO
CALL CPU_TIME(FINISH)
WRITE(*,*)'("TIME =',FINISH-START,'SECONDS.")'
END
SUBROUTINE CALC_MIN_NORM_D(M,N,LDA,LDB,NRHS,A,B,D_CAP)
!
! SUBROUTINE DEFINITION TO CALCULATE THE
! LINEAR LEAST SQUARE SOLUTION OF ([A]^-1*B)
!
IMPLICIT NONE
INTEGER :: M,N,NRHS,LDA,LDB,LWMAX,INFO,LWORK,RANK
DOUBLE PRECISION RCOND
INTEGER, ALLOCATABLE, DIMENSION(:) :: IWORK
DOUBLE PRECISION :: A(LDA,N),B(LDB,NRHS),D_CAP(LDB,NRHS)
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: S,WORK
!
WRITE(*,*)'IN CALC MIN NORM BEGINING'
WRITE(*,*)'DGELSD EXAMPLE PROGRAM RESULTS'
LWMAX = 1E8
ALLOCATE(S(M))
ALLOCATE(IWORK(3*M*0+11*M))
ALLOCATE(WORK(LWMAX))
! NEGATIVE RCOND MEANS USING DEFAULT (MACHINE PRECISION) VALUE
RCOND = -1.0
!
! QUERY THE OPTIMAL WORKSPACE.
!
LWORK = -1
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
LWORK = MIN( LWMAX, INT( WORK( 1 ) ) )
!WRITE(*,*) 'AFTER FIRST DGELSD'
!
! SOLVE THE EQUATIONS A*X = B.
!!!
CALL DGELSD( M, N, NRHS, A, LDA, B, LDB, S, RCOND, RANK, WORK,LWORK, IWORK, INFO )
!
! CHECK FOR CONVERGENCE.
!
IF( INFO.GT.0 ) THEN
WRITE(*,*)'THE ALGORITHM COMPUTING SVD FAILED TO CONVERGE;'
WRITE(*,*)'THE LEAST SQUARES SOLUTION COULD NOT BE COMPUTED.'
STOP
END IF
!
!
WRITE(*,*)' EFFECTIVE RANK = ', RANK
!D_CAP = B
END
This is the build log for successful compilation of the file. As I am using Visual studio with intel visual fortran, I can compile my program using Compile option available in the editor. I mean to say I don't have to use command line interface to run my program.
Compiling with Intel(R) Visual Fortran Compiler 17.0.2.187 [IA-32]...
ifort /nologo /debug:full /Od /warn:interfaces /module:"Debug\\" /object:"Debug\\" /Fd"Debug\vc110.pdb" /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /c /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\\bin" /Qm32 "D:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Console6.f90"
Linking...
Link /OUT:"Debug\Console6.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Debug\Console6.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"D:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\Console6\Console6\Debug\Console6.pdb" /SUBSYSTEM:CONSOLE /IMPLIB:"D:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.lib" -qm32 "Debug\Console6.obj"
Embedding manifest...
mt.exe /nologo /outputresource:"D:\Google Drive\Friday_May_27_2016\Mtech Thesis Job\All new work\Fortran coding\fortran hfgmc\Console6\Console6\Debug\Console6.exe;#1" /manifest "Debug\Console6.exe.intermediate.manifest"
Console6 - 0 error(s), 0 warning(s)
Also I have included the intel mkl library, shown in the figure below:
I have compiled the program and the output has been written in file 'OUTPUT_FILE.TXT'.
IN CALC MIN NORM BEGINING
DGELSD EXAMPLE PROGRAM RESULTS
EFFECTIVE RANK = 1500
IN CALC MIN NORM ENDING
("TIME TAKEN TO SOLVE DGELSD= 4.290028 SECONDS.")
On the other hand MATLAB gives the following result in the output command window:
min_norm_check(10,10)
Elapsed time is 0.224172 seconds.
Also I don't want to outperform MATLAB, its easy and simple. But with increase in the size of the problem, MATLAB stops responding. I have left my program to run on MATLAB for more than two days. It still haven't produced any results.
In theano, given a batch cost cost with shape (batch_size,), it is easy to compute the gradient of the mean cost, as in T.grad(T.mean(cost,axis=0),p) with p being a parameter used in the computation of cost. This is done efficiently by backpropagating the gradient through the computational graph. What I would now like to do is to compute the mean of the squared gradients over the batch. This can be done using the following piece of code:
import theano.tensor as T
g_square = T.mean(theano.scan(lambda i:T.grad(cost[i],p)**2,sequences=T.arange(cost.shape[0]))[0],axis=0)
Where for convenience p is assumed to be a single theano tensor and not a list of tensors.
The computation could be performed efficiently by simply backpropagating the gradient until the last step, and squaring the components of the last operation (which should be a sum over the batch index). I might be wrong on this one, but the computation should be as easy, and nearly as fast as a simple backpropagation. However, theano seems unable to optimize the computation, and it keeps using a loop, making computations extremely slow.
Would anyone know of a solution to make the computation efficient, either by forcing optimizations, expressing the computation in a different way, or even going through the backpropagation process ?
Thanks in advance.
Your function g_square happens to have complexity O(batch_size**2) instead of O(batch_size) as expected. This lets it appear incredibly slow for larger batch sizes.
The reason is because in every iteration the forward and backward pass is computed over the whole batch, even though just cost[i] for one data point is needed.
I assume the input to the cost computation graph, x, is a tensor with the first dimension of size batch_size. Theano has no means to automatically slice this tensor along this dimension. Therefore computation is always done over the whole batch.
Unfortunately I see no better solution than slicing your input and doing the loop outside Theano:
# x: input data batch
batch_size = x.shape[0]
g_square_fun = theano.function( [p], T.grad(cost[0],p)**2)
g_square_value = 0
for i in batch_size:
g_square_value += g_square_fun( x[i:i+1])
Perhaps when future versions of Theano come with better build in capabilities to compute Jacobians there will be more elegant solutions.
After digging deeper in the Theano docs I found a solution that would work in the compute graph. Key idea is that you clone the graph of your network inside the scan function, thereby explicitly slicing the input tensor. I tried the following code and empirically it shows O(batch_size) as expected:
# x: input data batch
# assuming cost = network(x,p)
from theano.gof.graph import clone_get_equiv
def g_square(cost,p):
g = T.zeros_like(p)
def scan_fn( i, g, cost, p):
# clone the graph computing cost, but slice it's input
cloned = clone_get_equiv([],[cost],
copy_inputs_and_orphans=False,
memo={x: x[i:i+1]})
cost_slice = cloned[cost].reshape([])
return g+T.grad(cost_slice,p)**2
result,updates = theano.reduce( scan_fn,
outputs_info=g,
sequences=[T.arange(cost.size)],
non_sequences=[cost.flatten(),p])
return result
I'm using the SCIP solver in the OPTI toolbox in matlab to solve a quadratic optimization problem with integer constraints. I ran it with the following specs and it's been running for a day and has already taken up 55GB of ram in my system and still counting. I'm new to optimization in matlab, am I doing something wrong or is this usual? I tried with less maxnodes and maxtime, but the program stops with the 'Node limit reached' error in those cases. Here's the code (H, Aeq etc. have been defined earlier in the code) -
X = sym('X%d%d', [104 1]);
fun = #(X) 1/2*X'*H*X;
options = optiset('solver', 'SCIP', 'maxnodes', 20000000, 'maxtime', 100000);
Opt = opti('fun', fun, 'eq', Aeq, Beq, 'xtype', xtype, 'options', options);
[xval,fval,exitflag,info] = solve(Opt)
This is not unusual if the quadratic function(s) are nonconvex. This easily leads to hard problems that cannot be solved to proven optimality with today's algorithms in any reasonably finite amount of time. Note that this does not only depend on the size of the problem, but in general smaller problems (of a similar type) will be easier.
This being said, SCIP might already have found a near-optimal solution that is accessible even when the time or node limit is exceeded.
I have a serial code that looks something like that:
sum = a;
sum += b;
sum += c;
sum += d;
I would like to parallelize it to something like that:
temp1 = a + b and in the same time temp2 = c + d
sum = temp1 + temp2
How do I do it using Intel parallel studio tools?
Thanks!!!
Assuming that all variables are of integral or floating point types, there is absolutely no sense to parallelize this code (in the sense of executing by different threads/cores), as the overhead will be much much higher than any benefit out of it. The applicable parallelism in this example is at the level of multiple computation units and/or vectorization on a single CPU. Optimizing compilers are sophisticated enough nowadays to exploit this automatically, without code changes; however if you wish you may explicitly use temporary variables, as in the second part of the question.
And if you ask just out of curiosity: Intel Parallel Studio provides several ways to parallelize code. For example, let's use Cilk keywords together with C++11 lambda functions:
#include <cilk/cilk.h>
...
temp = cilk_spawn [=]{ return a+b; }();
sum = c+d;
cilk_sync;
sum += temp;
Don't expect to get performance out of that (see above), unless you use classes with a computational-heavy overloaded operator+.