Is it possible to use pre-calculated factorization to accelerate backslash\mldivide with sparse matrix - matlab

I perform many iterations of solving a linear system of equations: Mx=b with large and sparse M.
M doesn't change between iterations but b does. I've tried several methods and so far found the backslash\mldivide to be the most efficient and accurate.
The following code is very similar to what I'm doing:
for ii=1:num_iter
x = M\x;
x = x+dx;
end
Now I want to accelerate the computation even more by utilizing the fact that M is fixed.
Setting the flag spparms('spumoni',2) allows detailed information of the solver algorithm.
I ran the following code:
spparms('spumoni',2);
x = M\B;
The output (monitoring):
sp\: bandwidth = 2452+1+2452.
sp\: is A diagonal? no.
sp\: is band density (0.01) > bandden (0.50) to try banded solver? no.
sp\: is A triangular? no.
sp\: is A morally triangular? no.
sp\: is A a candidate for Cholesky (symmetric, real positive diagonal)? no.
sp\: use Unsymmetric MultiFrontal PACKage with Control parameters:
UMFPACK V5.4.0 (May 20, 2009), Control:
Matrix entry defined as: double
Int (generic integer) defined as: UF_long
0: print level: 2
1: dense row parameter: 0.2
"dense" rows have > max (16, (0.2)*16*sqrt(n_col) entries)
2: dense column parameter: 0.2
"dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
3: pivot tolerance: 0.1
4: block size for dense matrix kernels: 32
5: strategy: 0 (auto)
6: initial allocation ratio: 0.7
7: max iterative refinement steps: 2
12: 2-by-2 pivot tolerance: 0.01
13: Q fixed during numerical factorization: 0 (auto)
14: AMD dense row/col parameter: 10
"dense" rows/columns have > max (16, (10)*sqrt(n)) entries
Only used if the AMD ordering is used.
15: diagonal pivot tolerance: 0.001
Only used if diagonal pivoting is attempted.
16: scaling: 1 (divide each row by sum of abs. values in each row)
17: frontal matrix allocation ratio: 0.5
18: drop tolerance: 0
19: AMD and COLAMD aggressive absorption: 1 (yes)
The following options can only be changed at compile-time:
8: BLAS library used: Fortran BLAS. size of BLAS integer: 8
9: compiled for MATLAB
10: CPU timer is ANSI C clock (may wrap around).
11: compiled for normal operation (debugging disabled)
computer/operating system: Microsoft Windows
size of int: 4 UF_long: 8 Int: 8 pointer: 8 double: 8 Entry: 8 (in bytes)
sp\: is UMFPACK's symbolic LU factorization (with automatic reordering) successful? yes.
sp\: is UMFPACK's numeric LU factorization successful? yes.
sp\: is UMFPACK's triangular solve successful? yes.
sp\: UMFPACK Statistics:
UMFPACK V5.4.0 (May 20, 2009), Info:
matrix entry defined as: double
Int (generic integer) defined as: UF_long
BLAS library used: Fortran BLAS. size of BLAS integer: 8
MATLAB: yes.
CPU timer: ANSI clock ( ) routine.
number of rows in matrix A: 3468
number of columns in matrix A: 3468
entries in matrix A: 60252
memory usage reported in: 16-byte Units
size of int: 4 bytes
size of UF_long: 8 bytes
size of pointer: 8 bytes
size of numerical entry: 8 bytes
strategy used: symmetric
ordering used: amd on A+A'
modify Q during factorization: no
prefer diagonal pivoting: yes
pivots with zero Markowitz cost: 1284
submatrix S after removing zero-cost pivots:
number of "dense" rows: 0
number of "dense" columns: 0
number of empty rows: 0
number of empty columns 0
submatrix S square and diagonal preserved
pattern of square submatrix S:
number rows and columns 2184
symmetry of nonzero pattern: 0.904903
nz in S+S' (excl. diagonal): 62184
nz on diagonal of matrix S: 2184
fraction of nz on diagonal: 1.000000
AMD statistics, for strict diagonal pivoting:
est. flops for LU factorization: 2.76434e+007
est. nz in L+U (incl. diagonal): 306216
est. largest front (# entries): 31329
est. max nz in any column of L: 177
number of "dense" rows/columns in S+S': 0
symbolic factorization defragmentations: 0
symbolic memory usage (Units): 174698
symbolic memory usage (MBytes): 2.7
Symbolic size (Units): 9196
Symbolic size (MBytes): 0
symbolic factorization CPU time (sec): 0.00
symbolic factorization wallclock time(sec): 0.00
matrix scaled: yes (divided each row by sum of abs values in each row)
minimum sum (abs (rows of A)): 1.00000e+000
maximum sum (abs (rows of A)): 9.75375e+003
symbolic/numeric factorization: upper bound actual %
variable-sized part of Numeric object:
initial size (Units) 149803 146332 98%
peak size (Units) 1037500 202715 20%
final size (Units) 787803 154127 20%
Numeric final size (Units) 806913 171503 21%
Numeric final size (MBytes) 12.3 2.6 21%
peak memory usage (Units) 1083860 249075 23%
peak memory usage (MBytes) 16.5 3.8 23%
numeric factorization flops 5.22115e+008 2.59546e+007 5%
nz in L (incl diagonal) 593172 145107 24%
nz in U (incl diagonal) 835128 154044 18%
nz in L+U (incl diagonal) 1424832 295683 21%
largest front (# entries) 348768 30798 9%
largest # rows in front 519 175 34%
largest # columns in front 672 177 26%
initial allocation ratio used: 0.309
# of forced updates due to frontal growth: 1
number of off-diagonal pivots: 0
nz in L (incl diagonal), if none dropped 145107
nz in U (incl diagonal), if none dropped 154044
number of small entries dropped 0
nonzeros on diagonal of U: 3468
min abs. value on diagonal of U: 4.80e-002
max abs. value on diagonal of U: 1.00e+000
estimate of reciprocal of condition number: 4.80e-002
indices in compressed pattern: 13651
numerical values stored in Numeric object: 295806
numeric factorization defragmentations: 0
numeric factorization reallocations: 0
costly numeric factorization reallocations: 0
numeric factorization CPU time (sec): 0.05
numeric factorization wallclock time (sec): 0.00
numeric factorization mflops (CPU time): 552.22
solve flops: 1.78396e+006
iterative refinement steps taken: 1
iterative refinement steps attempted: 1
sparse backward error omega1: 1.80e-016
sparse backward error omega2: 0.00e+000
solve CPU time (sec): 0.00
solve wall clock time (sec): 0.00
total symbolic + numeric + solve flops: 2.77385e+007
Observe the lines:
numeric factorization flops 5.22115e+008 2.59546e+007 5%
solve flops: 1.78396e+006
total symbolic + numeric + solve flops: 2.77385e+007
It indicates that the factorization of M took 2.59546e+007/2.77385e+007 = 93.6% of the total time required to solve the equations.
I would like to calculate the factorization in advance outside of my iterations and then run only the last stage which takes about 6.5% CPU time.
I know how to calculate the factorization ([L,U,P,Q,R] = lu(M);) but I don't know how to utilize its output as input to a solver.
I would like to run something in the spirit of:
[L,U,P,Q,R] = lu(M);
for ii=1:num_iter
dx = solve_pre_factored(M,P,Q,R,x);
x = x+dx;
end
Is there a way to do that in Matlab?

You have to ask yourself what all these matrices from the LU factorization do.
As the documentation states :
[L,U,P,Q,R] = lu(A) returns unit lower triangular matrix L, upper triangular matrix U, permutation matrices P and Q, and a diagonal scaling matrix R so that P*(R\A)Q = LU for sparse non-empty A. Typically, but not always, the row-scaling leads to a sparser and more stable factorization. The statement lu(A,'matrix') returns identical output values.
Thus in more mathematical terms we have PR-1AQ = LU, thus A = RP-1LUQ-1
Then x = M\x can be rewritten in the following steps :
y = R-1x
z = P y
u = L-1z
v = U-1u
w = Q v
x = w
To invert U, L and R you can use \ which will recognize they are triangular (and diagonal for R) matrices - as monitoring should confirm, and use the appropriate trivial solvers for them.
Thus in a denser and matlab-written way : x = Q*(U\(L\(P*(R\x))));
Doing this will be exactly what happens inside the solver \, with only a single factorization, as you asked.
However, as stated in the comments, it will become faster for big numbers of inversions to compute N = M-1 once, and then only do a simple matrix-vector multiplication, which is much simpler than the process explained above. The initial computation, inv(M), is longer and has some limitations, so this trade-off also depends on properties if your matrix.

Related

(q/kdb+) Interpolation formula not working for some cases

I have the formulas below to generate a linear interpolation in q:
lsfit:{(enlist y) lsq x xexp/: til 1+z};
interp:{[xn;x;y]sum (1;xn)*flip lsfit[x;y;1]};
and the data below to interpolate:
xn:(4.7;7.5;4.9);
x:(3 5f;7.5 7.5;3 5f);
y:(1.3 1.5;2 2f;1.3 1.5);
interp'[xn;x;y]
which is generating
index value
0 enlist 1.47
1 enlist 0nf
2 enlist 1.49
why am I getting 0 in the second row?
Update: Inconsistet behaviour for other examples
xn:(6;7;8;9);
x:(6 6f;7 7f;8 8f;9 9f);
y:(1 1f;1 1f;1 1f;1 1f);
interp'[xn;x;y]
generates
index value
0 enlist 1f
1 enlist 0nf
2 enlist 0nf
3 enlist 1f
So, it looks like sometimes the formula works, rows 0 and 3, and sometimes it does not, rows 1 and 2.
How can I fix it?
Thanks!
The reason you are encountering this issue is because of the mathematical details matrix division.
Matrix division can be performed by taking the inverse of a matrix and then matrix multiplying. In q, this can be seen by performing those operations directly.
q) enlist[2 2f] lsq (1 2f;3 4f)
-1 1
q) enlist[2 2f] mmu inv (1 2f;3 4f)
-1 1
One of your input x values to lsfit is the row 7.5 7.5. With a z value of 1f, this converts that vector into a matrix (1 1;7.5 7.5) in the xexp operation. This matrix is then used in the lsq operation.
The problem then occurs because (1 1;7.5 7.5) is not invertible. A matrix is invertible if and only if the determinant is non-zero. The determinant for a 2 x 2 matrix is AD - BC. In your example, A = 1, B = 1, C = 7.5, and D = 7.5. So the determinant is zero, the matrix is not invertible, and the output from the function is Onf.
To resolve this issue, you would have to ensure that the two items in each row of x are not identical.
Hope that helps.

Matlab - convert sparse matrix to complex-sparse matrix

I have sparse matrix A which I need to convert to complex-sparse matrix by setting its imaginary part to zero.
A = sprand(3,3,0.5);
A_c = complex(A,0);
However, this throws me an error that A should be full.
Error using complex
Real input A must be numeric, real, and full.
Is there any work-around to achieve this?
When I first answered this question I did not consider the way complex sparse matrices are implemented in MATLAB. I tricked myself into the following answer.
Naive solution
You could apply complex() to each element of the matrix.
A_c = spfun(#(x)complex(x,0),A)
Here #(x)complex(x,0) denotes an anonymous function that applied to each element x of the matrix A returns a complex number with Re=x and Im=0. And spfun just returns a new sparse matrix produced by applying our anonymous function to the nonzero elements of the matrix A.
What happens is that this solution returns an object identical to the original matrix. The matrix A_c occupies the same number of bytes and is equal to the original matrix A.
>> whos A A_c
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
A 3x3 76 double
A_c 3x3 76 double
A comment from Florian Roemer made me reconsider my answer.
Explanation
The representation of sparse matrices in MATLAB is described in a paper by Gilbert, Moler and Schreiber published in 1991.
A real matrix is represented as a single vector of nonzero elements of the corresponding storage class (i.e. double or complex) stored in column-wise order plus an integer vector of indices of these elements in their respective columns plus an integer vector of indices of locations where new columns start. I.e. an m*n sparse matrix with k nonzero elements would occupy n*4 + k*12 bytes with 4 bytes for integers and 8 bytes to store reals as double precision. That is a 3x3 real sparse matrix with 5 nonzero entries occupies (4+5)*4+8*5 = 76 bytes.
A complex sparse matrix would have another real array for the imaginary parts of all nonzero entries of the matrix but only if at least one element has a nonzero imaginary part.
Consider
>> B = sprand(3,3,0.5)
B =
Compressed Column Sparse (rows = 3, cols = 3, nnz = 5 [56%])
(1, 1) -> 0.46883
....
>> B_c = B ; B_c(1,1) += 1e-100i
B_c =
Compressed Column Sparse (rows = 3, cols = 3, nnz = 5 [56%])
(1, 1) -> 4.6883e-01 + 1.0000e-100i
....
Now we have made MATLAB to allocate additional storage for the imaginary parts of each nonzero entry of the original matrix, even though only one of the entries has a nonzero imaginary part.
>> whos B B_c
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
B 3x3 76 double
c B_c 3x3 116 double
Now B_c is a proper complex sparse matrix that occupies
(4+5)*4 + 8 * 5 + 8 * 5 = 116 bytes
Conclusion
If you just need a sparse matrix with zero imaginary parts, then do nothing to the original matrix.
If you need a matrix that actually allocates storage for the complex entries and carries the complex attribute, then add a small imaginary value to at least one of the nonzero entries of the original matrix.
Matlab remark: I did not test this in actual Matlab but Octave is quite happy with this solution.

Find the number of zero elements in a matrix in MATLAB [duplicate]

This question already has answers here:
Find specific value's count in a vector
(4 answers)
Closed 8 years ago.
I have a NxM matrix for example named A. After some processes I want to count the zero elements.
How can I do this in one line code? I tried A==0 which returns a 2D matrix.
There is a function to find the number of nonzero matrix elements nnz. You can use this function on a logical matrix, which will return the number of true.
In this case, we apply nnz on the matrix A==0, hence the elements of the logical matrix are true, if the original element was 0, false for any other element than 0.
A = [1, 3, 1;
0, 0, 2;
0, 2, 1];
nnz(A==0) %// returns 3, i.e. the number of zeros of A (the amount of true in A==0)
The credits for the benchmarking belong to Divarkar.
Benchmarking
Using the following paramters and inputs, one can benchmark the solutions presented here with timeit.
Input sizes
Small sized datasize - 1:10:100
Medium sized datasize - 50:50:1000
Large sized datasize - 500:500:4000
Varying % of zeros
~10% of zeros case - A = round(rand(N)*5);
~50% of zeros case - A = rand(N);A(A<=0.5)=0;
~90% of zeros case - A = rand(N);A(A<=0.9)=0;
The results are shown next -
1) Small Datasizes
2. Medium Datasizes
3. Large Datasizes
Observations
If you look closely into the NNZ and SUM performance plots for medium and large datasizes, you would notice that their performances get closer to each other for 10% and 90% zeros cases. For 50% zeros case, the performance gap between SUM and NNZ methods is comparatively wider.
As a general observation across all datasizes and all three fraction cases of zeros,
SUM method seems to be the undisputed winner. Again, an interesting thing was observed here that the general case solution sum(A(:)==0) seems to be better in performance than sum(~A(:)).
some basic matlab to know: the (:) operator will flatten any matrix into a column vector , ~ is the NOT operator flipping zeros to ones and non zero values to zero, then we just use sum:
sum(~A(:))
This should be also about 10 times faster than the length(find... scheme, in case efficiency is important.
Edit: in the case of NaN values you can resort to the solution:
sum(A(:)==0)
I'll add something to the mix as well. You can use histc and compute the histogram of the entire matrix. You specify the second parameter to be which bins the numbers should be collected at. If we just want to count the number of zeroes, we can simply specify 0 as the second parameter. However, if you specify a matrix into histc, it will operate along the columns but we want to operate on the entire matrix. As such, simply transform the matrix into a column vector A(:) and use histc. In other words, do this:
histc(A(:), 0)
This should be equivalent to counting the number of zeroes in the entire matrix A.
Well I don't know if I'm answering well the question but you could code it as follows :
% Random Matrix
M = [1 0 4 8 0 6;
0 0 7 4 8 0;
8 7 4 0 6 0];
n = size(M,1); % Number of lines of M
p = size(M,2); % Number of columns of M
nbrOfZeros = 0; % counter
for i=1:n
for j=1:p
if M(i,j) == 0
nbrOfZeros = nbrOfZeros + 1;
end
end
end
nbrOfZeros

What is the Haskell / hmatrix equivalent of the MATLAB pos function?

I'm translating some MATLAB code to Haskell using the hmatrix library. It's going well, but
I'm stumbling on the pos function, because I don't know what it does or what it's Haskell equivalent will be.
The MATLAB code looks like this:
[U,S,V] = svd(Y,0);
diagS = diag(S);
...
A = U * diag(pos(diagS-tau)) * V';
E = sign(Y) .* pos( abs(Y) - lambda*tau );
M = D - A - E;
My Haskell translation so far:
(u,s,v) = svd y
diagS = diag s
a = u `multiply` (diagS - tau) `multiply` v
This actually type checks ok, but of course, I'm missing the "pos" call, and it throws the error:
inconsistent dimensions in matrix product (3,3) x (4,4)
So I'm guessing pos does something with matrix size? Googling "matlab pos function" didn't turn up anything useful, so any pointers are very much appreciated! (Obviously I don't know much MATLAB)
Incidentally this is for the TILT algorithm to recover low rank textures from a noisy, warped image. I'm very excited about it, even if the math is way beyond me!
Looks like the pos function is defined in a different MATLAB file:
function P = pos(A)
P = A .* double( A > 0 );
I can't quite decipher what this is doing. Assuming that boolean values cast to doubles where "True" == 1.0 and "False" == 0.0
In that case it turns negative values to zero and leaves positive numbers unchanged?
It looks as though pos finds the positive part of a matrix. You could implement this directly with mapMatrix
pos :: (Storable a, Num a) => Matrix a -> Matrix a
pos = mapMatrix go where
go x | x > 0 = x
| otherwise = 0
Though Matlab makes no distinction between Matrix and Vector unlike Haskell.
But it's worth analyzing that Matlab fragment more. Per http://www.mathworks.com/help/matlab/ref/svd.html the first line computes the "economy-sized" Singular Value Decomposition of Y, i.e. three matrices such that
U * S * V = Y
where, assuming Y is m x n then U is m x n, S is n x n and diagonal, and V is n x n. Further, both U and V should be orthonormal. In linear algebraic terms this separates the linear transformation Y into two "rotation" components and the central eigenvalue scaling component.
Since S is diagonal, we extract that diagonal as a vector using diag(S) and then subtract a term tau which must also be a vector. This might produce a diagonal containing negative values which cannot be properly interpreted as eigenvalues, so pos is there to trim out the negative eigenvalues, setting them to 0. We then use diag to convert the resulting vector back into a diagonal matrix and multiply the pieces back together to get A, a modified form of Y.
Note that we can skip some steps in Haskell as svd (and its "economy-sized" partner thinSVD) return vectors of eigenvalues instead of mostly 0'd diagonal matrices.
(u, s, v) = thinSVD y
-- note the trans here, that was the ' in Matlab
a = u `multiply` diag (fmap (max 0) s) `multiply` trans v
Above fmap maps max 0 over the Vector of eigenvalues s and then diag (from Numeric.Container) reinflates the Vector into a Matrix prior to the multiplys. With a little thought it's easy to see that max 0 is just pos applied to a single element.
(A>0) returns the positions of elements of A which are larger than zero,
so forexample, if you have
A = [ -1 2 -3 4
5 6 -7 -8 ]
then B = (A > 0) returns
B = [ 0 1 0 1
1 1 0 0]
Note that we have ones corresponding to an elemnt of A which is larger than zero, and 0 otherwise.
Now if you multiply this elementwise with A using the .* notation, then you are multipling each element of A that is larger than zero with 1, and with zero otherwise. That is, A .* B means
[ -1*0 2*1 -3*0 4*1
5*1 6*1 -7*0 -8*0 ]
giving finally,
[ 0 2 0 4
5 6 0 0 ]
So you need to write your own function that will return positive values intact, and negative values set to zero.
And also, u and v does not match in dimension, for a generall SVD decomposition, so you actually would need to REDIAGONALIZE pos(diagS - Tau), so that u* diagnonalized_(diagS -tau) agrres to v

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)
The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!
This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html
I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);
You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).
The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.