I'm currently using a linear model of coregionalization
(see e.g. alvarez notes https://arxiv.org/pdf/1106.6251.pdf )
which is optimized via SVGP.
I noticed that the upper limit of the number of inducing points before running OOM was greatly reduced (now about ~5k inducing points instead of 8k when not using a coregionalized kernel). From my understanding, the limiting bottleneck should have been the same (still the MxM kernel matrix), however it seems like more changed.
In addition, I now get the warning:
.../lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning:
Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
The Kernel Matrices are constructed as follows.
I don't use big Qs or Rs (Q=3, R=3).
def coreg_k(Q, R, output_dim, active_dims):
# create Q different kernels with rank R
coreg = []
k_q = []
# lengthscales = np.logspace(-1, 3, 5)
lengthscales = [0.1, 1, 5]
for q in range(Q):
coreg_tmp = gpflow.kernels.Coregion(input_dim=1, output_dim=output_dim, rank=R, active_dims=active_dims)
coreg_tmp.W = np.random.randn(output_dim, R)
coreg.append(coreg_tmp)
k_tmp = []
k_tmp.append(Matern52(input_dim=len(kernel_idxs["coords"]), active_dims=kernel_idxs["coords"],
lengthscales=lengthscales[q], ARD=False))
k_tmp.append(RBF(input_dim=len(kernel_idxs["rest"]), active_dims=kernel_idxs["rest"],
ARD=True, lengthscales=lengthscales[q]))
k = k_tmp[0]
for i in range(1, len(k_tmp)):
k += k_tmp[i]
k_q.append(k)
# combine all those kernels
kern_lcm = coreg[0] * k_q[0]
for q in range(1, Q):
kern_lcm += coreg[q] * k_q[q]
return kern_lcm
What is taking up so much memory? The few parameters more from the extra kernels should not change that much.
Thanks.
In the computation of the Kuu matrix, the coregionalisation kernel constructs an M x M matrix. So if you have Q Coregion kernels, tensorflow actually needs to allocate Q x M x M memory. This is not orders of magnitude more, but linear in the number of kernels, which seems to roughly match up with how much less inducing points you can fit in memory on your machine.
For a more efficient implementation of the intrinsic coregionalisation model, have a look at the multi-output framework notebook in the GPflow documentation. Hope this helps!
Related
I have a code that does a 4 level decomposition of an image. The levels are similar to the Wavelet transform of an image that decomposes the image into 4 levels: The approximation part, and the three detailed parts. The code that I have implemented uses generalized SVD to do this decomposition. Here is the code
function[Y,U] = MSVD1(X)
%multiresolution SVD (MSVD)
%input-> x: image (spatial domain)
%outputs-> Y: one level MSVD decomposition of x
% U: the unitary matrix (U in SVD)
[m,n] = size(X);
m = m/2; n = n/2;
A = zeros(4,m*n);
for j = 1:n
for i = 1:m
A(:,i + (j-1)*m) = reshape(X((i-1)*2+(1:2),(j-1)*2+(1:2)),4,1);
end
end
[U,S] = svd(A);
T = U'*A;
Y.LL = reshape(T(1,:),m,n);
Y.LH = reshape(T(2,:),m,n);
Y.HL = reshape(T(3,:),m,n);
Y.HH = reshape(T(4,:),m,n);
end
Now the basic operations involved in this are using SVD. So my question is should the time complexity in terms of Big O notation be same as a normal SVD of a matrix? If not what should be the terms that we need to take into account find the complexity in terms of input size of an image? Does the reshaping elements also add account for the time complexity or is it just O(1)?
Can somebody help?
First, the complexity of the constant size reshape (inside the loop) is O(1). Hence, the complexity of the for loop is \Theta(m*n). Second, the complexity of svd is O(max(m, n) * min(m, n)) and based on what data will be returned by the function it can be O(max(m, n)^2) (according to this reference). Moreover, base on the #Daniel comment, the worst-case scenario for reshapes at the end of your code, can O(m*n) (it is usually less than this).
Therefore, the complexity of the code is O(max(m, n)^2). Also, because of the loop, it is Omega(m*n).
I've surveyed my GPU's performance against itself and the CPU for varying matrix sizes, and found the opposite of what most GPU literature suggests: the GPU's computing advantage diminishes with array size. Code, results, & specs shown below. Noteworthy observations:
GPU utility remains sub-10%, according to Task Manager
~(50%, 20%) = (RAM, CPU) usage for large (K > 9000) arrays
Considerable speed ratio drop's observed for around K > 8000
Splitting the K > 8000 (= 9000) Xga matrix into four increases vectorized speed two-fold
My GPU ranks far higher among GPUs than my CPU (#24 vs. #174); it thus seems an on-par CPU would outperform the GPU for larger arrays
Last pic's GPU vs. CPU benchmark supports (5); GPU isn't as vastly superior as expected
What's the culprit - is my code, or MATLAB, or hardware configuration under-utilizing the GPU? How to find out and resolve it?
%% CODE: centroid indexing in K-means algorithm
% size(X) = [16000, 3]
% size(centroids) = [K, 3]
% Xga = gpuArray(single(X)); cga = gpuArray(single(centroids));
% Speed ratio = t2/t1, if t2 > t1 - else, t1/t2
%% TIMING
f1 = fasterFunction(...);
f2 = slowerFunction(...);
t1 = gputimeit(f1) % OR timeit(f1) for non-GPU arrays
t2 = timeit(f2) % OR gputimeit(f2) for GPU arrays
%% FUNCTIONS
function out = vecHammer(X, c, K, m)
[~, out] = min(reshape(permute(sum((X-permute(c,[3 2 1])).^2,2),[1 2 3]),m,K),[],2);
end
function out = forvecHammer(X, c, m)
out = zeros(m,1);
for j=1:m
[~,out(j)] = min(sum(((X(j,:))'-c').^2));
end
end
function out = forforHammer(X,c,m,K)
out = zeros(m,1); idxtemp = zeros(K,1);
for i=1:m
for j=1:K
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
end
[~, out(i)] = min(idxtemp);
end
end
The probable answer is - the data was simply too small, and only so much can be parallelized; my GPU pulls a gigabyte dataset with a few percentage points - this one barely measured up to 10MB.
I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)
Suppose i have a sparse matrix M with the following properties:
size(M) -> 100000 100000
sprank(M) -> 99236
nnz(M) -> 499987
numel(M) -> 1.0000e+10
How come solving the system takes way more than 8GB of RAM? whos('M') gives only 8.4mb.
I'm using the following code (provided at http://www.mathworks.com/moler/exm/chapters/pagerank.pdf)
function x = pagerank(G,p)
G = G - diag(diag(G));
[n,n] = size(G);
c = full(sum(G,1));
r = full(sum(G,2));
% Scale column sums to be 1 (or 0 where there are no out links).
k = find(c~=0);
D = sparse(k,k,1./c(k),n,n);
% Solve (I - p*G*D)*x = e
e = ones(n,1);
I = speye(n,n);
x = (I - p*G*D)\e;
% Normalize so that sum(x) == 1.
x = x/sum(x);`
Left divide! that x = (I - p*G*D)\e does way more things that what it seems!
From Matlab mldivide for sparse matrices:
Not all the solvers take the same amount of memory, and some of them take a lot. Left dividing in Matlab is fantastic, but you need to know what you are doing.
I suggest to have a look to some iterative solvers if you run out of memory, such as Preconditioned Conjugate Gradient (PGC) or Algebraic Multigrid (AMG) or in case of complex numbers I think Biconjugate gradients stabilized method works fine
If you dont know where to start, I highly recommend PGC. In the project I am working on my code for left dividing is something like:
% CAUTION! PSEUDOCODE! do not try to run
try
x=A\b
catch
x=pgc(A,b)
end
I have an physical instrument of measurement (force platform with load cells) which gives me three values, A, B and C. It happens, though, that these values - that should be orthogonal - actually are somewhat coupled, due to physical characteristics of the measuring device, which causes cross-talk between applied and returned values of force and torque.
Then, it is recommended that a calibration matrix be used to transform the measured values into a better estimate of the actual values, like this:
The problem is that it is necessary to perform a SET of measurements, so that different measured(Fz, Mx, My) and actual(Fz, Mx, My) are least-squared to get some C matrix that works best for the system as a whole.
I can solve Ax = B problems with scipy.linalg.lststq, or even scipy.linalg.solve (giving an exact solution) for ONE measurement, but how should I proceed to consider a set of different measurements, each one with its own equation giving a potentially different 3x3 matrix?
Any help is much appreciated, thanks for reading.
I posted a similar question containing just the mathematical part of this at math.stackexchange.com, and this answer solved the problem:
math.stackexchange.com/a/232124/27435
In case anyone have a similar problem in the future, here is the almost literal Scipy implementation of that answer (first lines are initialization boilerplate code):
import numpy
import scipy.linalg
### Origin of the coordinate system: upper left corner!
"""
1----------2
| |
| |
4----------3
"""
platform_width = 600
platform_height = 400
# positions of each load cell (one per corner)
loadcell_positions = numpy.array([[0, 0],
[platform_width, 0],
[platform_width, platform_height],
[0, platform_height]])
platform_origin = numpy.array([platform_width, platform_height]) * 0.5
# applying a known force at known positions and taking the measurements
measurements_per_axis = 5
total_load = 50
results = []
for x in numpy.linspace(0, platform_width, measurements_per_axis):
for y in numpy.linspace(0, platform_height, measurements_per_axis):
position = numpy.array([x,y])
for loadpos in loadcell_positions:
moments = platform_origin-loadpos * total_load
load = numpy.array([total_load])
result = numpy.hstack([load, moments])
results.append(result)
results = numpy.array(results)
noise = numpy.random.rand(*results.shape) - 0.5
measurements = results + noise
# now expand ("stuff") the 3x3 matrix to get a linearly independent 3x3 matrix
expands = []
for n in xrange(measurements.shape[0]):
k = results[n,:]
m = measurements[n,:]
expand = numpy.zeros((3,9))
expand[0,0:3] = m
expand[1,3:6] = m
expand[2,6:9] = m
expands.append(expand)
expands = numpy.vstack(expands)
# perform the actual regression
C = scipy.linalg.lstsq(expands, measurements.reshape((-1,1)))
C = numpy.array(C[0]).reshape((3,3))
# the result with pure noise (not actual coupling) should be
# very close to a 3x3 identity matrix (and is!)
print C
Hope this helps someone!