GCC -O2 with -march / -ftree-vectorize

GCC -O2 with -march / -ftree-vectorize - compiler-optimization

I am trying out several compiler switches against a program that performs sobel kernel convolution on two images( 2000Hx3000W and 6800Hx8500W ). There are some observations that I am not able to interprete, following are the data - compiler flags and time taken in secs (please focus on the last column, as it signifies convolution on Y axis for the larger image):
O2-march=barcelona 0.1483326 0.833264 1.6018882 28.6711242
O2-ftree-vectorize 0.1462104 0.847973 1.506708 26.628592
O2 0.1468406 0.8368156 1.5999718 20.61377564
O2-ftree-vectorize-march=barcelona 0.1441898 0.827366 1.4687354 15.2572644
I expected -O2-march=barcelona to be moderately better, considering the machine I am running on is AMD barcelona. Any ideas as to why -O2 is better than -O2 -march?
About -ftree-vectorize, it should be able to run instructions in parallel since my loop is dependence free. But then, -O2-ftree-vectorize-march=barcelona is the best of the lot, when individually there are reasonable differences in timing.
It would be great if I could understand this behavior.
Regards,
Sayan

Related

Differences in numerical behaviour between Scala versions

We are in the process of upgrading to Scala 2.10. We seem to have fixed all the show stoppers, but for some of our numerical calculations the answers are a little different in a small number of cases. Are there any known changes introduced between 2.9.1 and 2.10.2 that might cause this or has anybody else seen something like this before?
Since all our small tests pass this should be some cumulative effect in iterative calculations, and even then it only affects a small percentage of cases. These are complex calculations and this is the only system for performing them, so we do not have an independent way to verify which version is right, short of doing a clean room reimplementation.

Sorry, can't post just a comment yet.
What is the magnitude of the differences you're seeing, and what kind of calculation are you doing in your code? It may not be meaningful to ask which is correct, it's FP, after all. Could you try compiling your code with strictfp? Slight differences in code gen and optimization can result in differences in when/how often 80-bit intermediates are truncated to 64 -- assuming you're running on an x86 architecture. If you have access to hardware that doesn't have 80-bit fp registers you might try running both versions on that. Even then, simple expressions like LL + S1 + S2, where LL is much greater than S1 or S2, can give different results depending on the order of the adds.

Very slow execution of Matlab code under ubuntu

I was using MATLAB 2012a under windows 7 and I was executing some intense code, and I mean by intense in terms of memory usage and processing time, however, the code was working fine on Windows. Now, I changed my OS to ubuntu 12.04 and I installed Matlab 2013a. The amount of memory used is considerably less than the way it was in Windows, but the time taken by matlab to execute the same code is extremely high-really high.
I need to mention that my code contain nothing that may take such huge time except a statement of sparse with symbolic substitution as one of the arguments as follows
K=zeros(Np,Np);
for i=1:ord
K=K+sparse(t(1:ord,:),repmat(t(i,:),ord,1),double(subs(Kv(:,i),Arg(Kv,1,1,6),Arg(Kv,1,2,6))),Np,Np);
end
Note: that Kv is a symbolic matrix and Arg is a function to provide OLD and NEW and it depends on a number of global variables.
I have the feeling that I missed to add something to ubuntu that might help accelerate the execution of the Matlab codes.
Any ideas ?

I had a similar problem at windows, but I believe the solution is same on Ubuntu LTS.
So, if you increase the Java Heap Memory of Matlab, the Matlab will consume more memory from your system but it will be faster.
To do that go to:
File->preferences->General->Java Heap Memory and increase to the maximum.
The default value is 128, that is too little.

If heap memory limit doesn't fix the issue, then try increasing matlab process.
First start matlab, then do
ps aux|grep MATLAB
In my case the result is:
comtom 9769 28.2 19.8 4360632 761808 tty2 S<l+ 14:00 1:50 /usr/local/MATLAB/MATLAB_Production_Server/R2015a/bin/glnxa64/MATLAB -desktop
Look at first number (PID). Then use it with command renice to change process priority:
renice -3 -p 9769
That's it. The GUI is very slow because it's built against outdated Xorg libs. So changing priority helps, you may notice some gnome effect's tear, but matlab's interface will work a lot better.

Fast Algorithms for Finding Pairwise Euclidean Distance (Distance Matrix)

I know matlab has a built in pdist function that will calculate pairwise distances. However, my matrix is so large that its 60000 by 300 and matlab runs out of memory.
This question is a follow up on Matlab euclidean pairwise square distance function.
Is there any workaround for this computational inefficiency. I tried manually coding the pairwise distance calculations and it usually takes a full day to run (sometimes 6 to 7 hours).
Any help is greatly appreciated!

Well, I couldn't resist playing around. I created a Matlab mex C file called pdistc that implements pairwise Euclidean distance for single and double precision. On my machine using Matlab R2012b and R2015a it's 20–25% faster than pdist(and the underlying pdistmex helper function) for large inputs (e.g., 60,000-by-300).
As has been pointed out, this problem is fundamentally bounded by memory and you're asking for a lot of it. My mex C code uses minimal memory beyond that needed for the output. In comparing its memory usage to that of pdist, it looks like the two are virtually the same. In other words, pdist is not using lots of extra memory. Your memory problem is likely in the memory used up before calling pdist (can you use clear to remove any large arrays?) or simply because you're trying to solve a big computational problem on tiny hardware.
So, my pdistc function likely won't be able to save you memory overall, but you may be able to use another feature I built in. You can calculate chunks of your overall pairwise distance vector. Something like this:
m = 6e3;
n = 3e2;
X = rand(m,n);
sz = m*(m-1)/2;
for i = 1:m:sz-m
D = pdistc(X', i, i+m); % mex C function, X is transposed relative to pdist
... % Process chunk of pairwise distances
end
This is considerably slower (10 times or so) and this part of my C code is not optimized well, but it will allow much less memory use – assuming that you don't need the entire array at one time. Note that you could do the same thing much more efficiently with pdist (or pdistc) by creating a loop where you passed in subsets of X directly, rather than all of it.
If you have a 64-bit Intel Mac, you won't need to compile as I've included the .mexmaci64 binary, but otherwise you'll need to figure out how to compile the code for your machine. I can't help you with that. It's possible that you may not be able to get it to compile or that there will be compatibility issues that you'll need to solve by editing the code yourself. It's also possible that there are bugs and the code will crash Matlab. Also, note that you may get slightly different outputs relative to pdist with differences between the two in the range of machine epsilon (eps). pdist may or may not do fancy things to avoid overflows for large inputs and other numeric issues, but be aware that my code does not.
Additionally, I created a simple pure Matlab implementation. It is massively slower than the mex code, but still faster than a naïve implementation or the code found in pdist.
All of the files can be found here. The ZIP archive includes all of the files. It's BSD licensed. Feel free to optimize (I tried BLAS calls and OpenMP in the C code to no avail – maybe some pointer magic or GPU/OpenCL could further speed it up). I hope that it can be helpful to you or someone else.

On my system the following is the fastest (Even faster than the C code pdistc by #horchler):
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2);
mD = sqrt(bsxfun(#plus, vSsqX.', vSsqX) - (2 * (mX.' * mX)));
end
You'll need a very well tuned C code to beat this, I think.
Update
Since MATLAB R2016b MATLAB supports implicit broadcasting without the use of bsxfun().
Hence the code can be written:
function [ mD ] = CalcDistMtx ( mX )
vSsqX = sum(mX .^ 2, 1);
mD = sqrt(vSsqX.'+ vSsqX - (2 * (mX.' * mX)));
end
A generalization is given in my Calculate Distance Matrix project.
P. S.
Using MATLAB's pdist for comparison: squareform(pdist(mX.')) is equivalent to CalcDistMtx(mX).
Namely the input should be transposed.

Computers are not infinitely large, or infinitely fast. People think that they have a lot of memory, a fast CPU, so they just create larger and larger problems, and then eventually wonder why their problem runs slowly. The fact is, this is NOT computational inefficiency. It is JUST an overloaded CPU.
As Oli points out in a comment, there are something like 2e9 values to compute, even assuming you only compute the upper or lower half of the distance matrix. (6e4^2/2 is approximately 2e9.) This will require roughly 16 gigabytes of RAM to store, assuming that only ONE copy of the array is created in memory. If your code is sloppy, you might easily double or triple that. As soon as you go into virtual memory, things get much slower.
Wanting a big problem to run fast is not enough. To really help you, we need to know how much RAM is available. Is this a virtual memory issue? Are you using 64 bit MATLAB, on a CPU that can handle all the needed RAM?

Will Matlab standalone be faster than Matlab from UI for long execution code?

I have built an standalone Matlab application. I was expecting it to be faster than running the application from the Matlab environent but it is indeed a bit slower (1.3 seg per iteration vs 1.5 seg per iteration)
I am not counting the init time required by MCR but the execution of my code.
Is that the expected performance or should I be obtaining a performance improvement?
I haven't found any settings on the deployment tool that could help to reduce execution time.
Thanks in advance

Applications built with MATLAB Compiler should execute at pretty much exactly the same speed as within MATLAB.
MATLAB Compiler does not convert your MATLAB code into machine code in the same way as a C compiler does for C. What it does is to archive and encrypt your MATLAB code (note, it properly encrypts it, not just pcodes it as a comment suggests), create a thin executable wrapper and package them together, possibly also with MATLAB Compiler Runtime (MCR). MCR is very similar to MATLAB itself, without a graphical user interface, and is freely redistibutable.
When you run the executable, it dearchives and decrypts your MATLAB code and runs it against the MCR. It should run exactly the same, both in terms of results and speed.
Very old versions of MATLAB Compiler (pre-version 4.0) worked in a different way, converting a subset of the MATLAB language into C code, and compiling this. This provided a potentially significant speed-up, but only a subset of the language was supported and results, unless you were careful, could sometimes be different. Similar functionality is now available in the separate MATLAB Coder product.
There are a few small things you can do to improve performance: for example, within deploytool you can specify which toolboxes your application uses. deploytool uses a dependency checker to package up all MATLAB functionality that it thinks your code might possibly depend on, but it can't always tell exactly, as the functions your code needs might change at runtime. It therefore errs on the side of caution and includes more than necessary. By specifying only the toolboxes you know to be necessary, you can speed things up a little (it also speeds up the build process quite a bit).

Different results for the same algorithm in matlab

I'm doing an assignment of linear algebra, to compare the performance and stability of QR factorization algorithms Gram-Schmidt and Householder.
My doubt comes when calculating the following table:
Where the matrices Q and R are the resulting matrices of the QR factorizations by applying the Gram-Schmidt and householder to a Hilbert matrix A, I is the identity matrix of dimension N; and || * || is the Frobenius norm‎.
When I do the calculations on different computers i have different results in some cases, may be due to this?. The above table corresponds to the calculations performed in a 32-bit computer and the next table in a 64-bit:
These results in matlab involves computer architectures in which the calculations were made?

I'm really interested by an answer if you find one!
Unfortunately plenty of things can change the numerical results...
For being efficient, some LAPACK algorithm iterate on sub-matrix blocks. For optimal efficiency, the size of the blocks has to fit somehow the size of CPU L1/L2/L3 caches...
The size of block is controlled by LAPACK routine ILAENV see http://www.netlib.org/lapack/lug/node120.html
Of course, if block size differ, result will differ numerically... It is possible that the lapack/BLAS DLL provided by Matlab are compiled with a differently tuned version of ILAENV on the two machines, or that ILAENV has been replaced with a customized optimized version taking cache size into account, you could check by yourself making a small C program which call ILAENV and link it to DLL provided by Matlab...
For underlying BLAS, it's even worse: if an optimized version is used, some fused mul-add FPU instruction might be used when available by exemple, and they are not necessarily available on all FPU. AFAIK, Matlab use ATLAS http://math-atlas.sourceforge.net/, and you'll have to inquire about how the lirary was produced... You would have to track differences in the result of basic algebra operations (like matrix*vector or matrix*matrix ...).
UPDATE: Even if ILAENV were the same, QR uses elementary rotations, so it obviously depend on sin/cos implementation. Unfortunately no standard tells exactly how sin and cos should bitwise-behave, they can be off a few ulp from rounded exact result, and differ from one library to another and will give different results on different architectures/compilers (hardwired in x87 FPU). So unless you provide your own version of these functions (or work in ADA) and compile them with specially crafted compiler options, and maybe finely control the FPU modes there's close to no chance to find exactly same results on different architectures... You will also have to ask Matlab if they took special care to insure floating point deterministic results when they compiled those libraries.

That depends on matlab implementation. Do you get the same result when rerun on same architecture? If yes, this problem may caused by precision. Sometimes, it is caused by different FPU (floatingpoint process uint) of CPU. You may try on more 32-bit/64-bit with different CPU.
The best answer should be reply by your matlab provider. Just call them if you have valid license.
According this link.
one cause of difference is that if calculations are done on with the x87 instructions, it get held in 80 bit precision. depending on compiler optimisations, it numbers may stay at 80bit for a few operation before being truncated back to 64 bit. this can cause variations. see http://gcc.gnu.org/wiki/x87note for more info.
the gcc man pages says that using sse (instead of 387) is default on x86-64 platforms. you should be able to force it on 32bit using something like
-mfpmath=sse -msse -msse2

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse