I am trying to implement NMF with Alternating Least Squares method. I am just curious about the following basic implementation of the problem:
If I understand correctly, we can solve for each matrix equation stated in this pseudocode without nonnegativity constraints, with closed form solution and set the negative entries to 0, in a brute force way. Is this understanding correct? Is this a basic alternative to more complicated, constrained optimization problems, where we use projected gradient descent, for example? More importantly, if implemented in this basic way, will the algorithm have any practical value? I want to use NMF for variable reduction purposes and it is important that I use NMF, since my data is by definition non-negative. I am looking for opinions on this one.
If I understand correctly, we can solve for each matrix equation stated in this pseudocode without nonnegativity constraints, with closed form solution and set the negative entries to 0, in a brute force way. Is this understanding correct? Yes.
Is this a basic alternative to more complicated, constrained optimization problems, where we use projected gradient descent, for example? ---In a sense, yes. This is indeed a fast way of Nonnegative factorization. However, articles related to NMF would point out that although this method is fast, it does not guarantee convergence of the nonnegative factors. A better implementation to use would be Hierarchical Alternating Least Squares for NMF (HALS-NMF). Check this paper for a comparison of some popular NMF algorithms: http://www.cc.gatech.edu/~hpark/papers/jgo.pdf
More importantly, if implemented in this basic way, will the algorithm have any practical value? Just basing from my experience, I would say that results aren't as good as compared to say HALS or BPP(Block Pivoting Principle).
Using nonnegative least squares in this algo as opposed to clipping off negative values would obviously be better in this algorithm, but in general I would not recommend this basic ALS/ANNLS method as it has bad convergence properties (it often fluctuates or can even show divergence) - a minimal Matlab implementation of a better method, the accelerated-Hierarchical Alternating Least Squares method for NMF (of Cichocki et al.), which is currently one of the fastest methods is shown here (code by Nicolas Gillis) :
% Accelerated hierarchical alternating least squares (HALS) algorithm of
% Cichocki et al.
%
% See N. Gillis and F. Glineur, "Accelerated Multiplicative Updates and
% Hierarchical ALS Algorithms for Nonnegative Matrix Factorization”,
% Neural Computation 24 (4), pp. 1085-1105, 2012.
% See http://sites.google.com/site/nicolasgillis/
%
% [U,V,e,t] = HALSacc(M,U,V,alpha,delta,maxiter,timelimit)
%
% Input.
% M : (m x n) matrix to factorize
% (U,V) : initial matrices of dimensions (m x r) and (r x n)
% alpha : nonnegative parameter of the accelerated method
% (alpha=0.5 seems to work well)
% delta : parameter to stop inner iterations when they become
% inneffective (delta=0.1 seems to work well).
% maxiter : maximum number of iterations
% timelimit : maximum time alloted to the algorithm
%
% Output.
% (U,V) : nonnegative matrices s.t. UV approximate M
% (e,t) : error and time after each iteration,
% can be displayed with plot(t,e)
%
% Remark. With alpha = 0, it reduces to the original HALS algorithm.
function [U,V,e,t] = HALSacc(M,U,V,alpha,delta,maxiter,timelimit)
% Initialization
etime = cputime; nM = norm(M,'fro')^2;
[m,n] = size(M); [m,r] = size(U);
a = 0; e = []; t = []; iter = 0;
if nargin <= 3, alpha = 0.5; end
if nargin <= 4, delta = 0.1; end
if nargin <= 5, maxiter = 100; end
if nargin <= 6, timelimit = 60; end
% Scaling, p. 72 of the thesis
eit1 = cputime; A = M*V'; B = V*V'; eit1 = cputime-eit1; j = 0;
scaling = sum(sum(A.*U))/sum(sum( B.*(U'*U) )); U = U*scaling;
% Main loop
while iter <= maxiter && cputime-etime <= timelimit
% Update of U
if j == 1, % Do not recompute A and B at first pass
% Use actual computational time instead of estimates rhoU
eit1 = cputime; A = M*V'; B = V*V'; eit1 = cputime-eit1;
end
j = 1; eit2 = cputime; eps = 1; eps0 = 1;
U = HALSupdt(U',B',A',eit1,alpha,delta); U = U';
% Update of V
eit1 = cputime; A = (U'*M); B = (U'*U); eit1 = cputime-eit1;
eit2 = cputime; eps = 1; eps0 = 1;
V = HALSupdt(V,B,A,eit1,alpha,delta);
% Evaluation of the error e at time t
if nargout >= 3
cnT = cputime;
e = [e sqrt( (nM-2*sum(sum(V.*A))+ sum(sum(B.*(V*V')))) )];
etime = etime+(cputime-cnT);
t = [t cputime-etime];
end
iter = iter + 1; j = 1;
end
% Update of V <- HALS(M,U,V)
% i.e., optimizing min_{V >= 0} ||M-UV||_F^2
% with an exact block-coordinate descent scheme
function V = HALSupdt(V,UtU,UtM,eit1,alpha,delta)
[r,n] = size(V);
eit2 = cputime; % Use actual computational time instead of estimates rhoU
cnt = 1; % Enter the loop at least once
eps = 1; eps0 = 1; eit3 = 0;
while cnt == 1 || (cputime-eit2 < (eit1+eit3)*alpha && eps >= (delta)^2*eps0)
nodelta = 0; if cnt == 1, eit3 = cputime; end
for k = 1 : r
deltaV = max((UtM(k,:)-UtU(k,:)*V)/UtU(k,k),-V(k,:));
V(k,:) = V(k,:) + deltaV;
nodelta = nodelta + deltaV*deltaV'; % used to compute norm(V0-V,'fro')^2;
if V(k,:) == 0, V(k,:) = 1e-16*max(V(:)); end % safety procedure
end
if cnt == 1
eps0 = nodelta;
eit3 = cputime-eit3;
end
eps = nodelta; cnt = 0;
end
For full code and comparison to other methods, see
https://sites.google.com/site/nicolasgillis/code
(section Accelerated MU and HALS algorithms for NMF)
and
N. Gillis and F. Glineur, "Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization”, Neural Computation 24 (4), pp. 1085-1105, 2012.
Yes, this can be done, but no you should not do it.
The bottleneck in NMF is not the non-negative least squares calculation, it's the calculation of the right-hand side of the least squares equations and the loss calculation (if used to determine convergence). In my experience, with a fast NNLS solver, the NNLS adds less than 1% relative runtime compared to basic least squares solving. Nowadays (maybe not when you asked the question) there are very fast methods such as TNT-NN and sequential coordinate descent which make things very fast.
I have tried this method and the model quality was really poor. It was hardly reminiscent of HALS or multiplicative updates.
Related
I am trying to implement the Gauss-Seidel method in MATLAB. But there are two major mistakes in my code, and I could not fix them:
My code converges very well on small matrices, but it never converges on large matrices.
The code makes redundant iterations. How can I prevent from redundant iterations?
Gauss-Seidel Method on wikipedia.
N=5;
A=rand(N,N);
b=rand(N,1);
x = zeros(N,1);
sum = 0;
xold = x;
tic
for n_iter=1:1000
for i = 1:N
for j = 1:N
if (j ~= i)
sum = sum + (A(i,j)/A(i,i)) * xold(j);
else
continue;
end
end
x(i) = -sum + b(i)/A(i,i);
sum = 0;
end
if(abs(x(i)-xold(j))<0.001)
break;
end
xold = x;
end
gs_time=toc;
prompt1='Gauss-Seidel Method Time';
prompt2='x Matrix';
disp(prompt2);
disp(x);
disp(prompt1);
disp(gs_time);
First off, a generality. The Gauß-Seidel and Jacobi methods only apply to diagonally dominant matrices, not generic random ones. So to get correct test examples, you need to actually constructively ensure that condition, for instance via
A = rand(N,N)+N*eye(N)
or similar.
Else the method will diverge towards infinity in some or all components.
Now to some other strangeness in your implementation. What does
if(abs(x(i)-xold(j))<0.001)
mean? Note that this instruction is outside the loops where i and j are the iteration variables, so potentially, the index values are undefined. By inertia they will accidentally both have the value N, so this criterion makes at least a little sense.
What you want to test is some norm of the difference of the vectors as a whole, thus using sum(abs(x-xold))/N or max(abs(x-xold)). On the right side you might want to multiply with the same norm construction applied to x so that the test is for the relative error, taking the scale of the problem into account.
By the instructions in the given code, you are implementing the Jacobi iteration, computing all the updates first and then advancing the iteration vector. For the Gauß-Seidel variant you would need to replace the single components in-place, so that newly computed values are immediately used.
Also, you could shorten/simplify the inner loop
xold = x;
for i = 1:N
sum = b(i);
for j = 1:N
if (j ~= i)
sum = sum - A(i,j) * x(j);
end
end
x(i) = sum/A(i,i);
end
err = norm(x-xold)
or even shorter using the language features of matlab
xold = x
for i = 1:N
J = [1:(i-1) (i+1):N];
x(i) = ( b(i) - A(i,J)*x(J) )/A(i,i);
end
err = norm(x-xold)
%Gauss-seidal method for three equations
clc;
x1=0;
x2=0;
x3=0;
m=input('Enter number of iteration');
for i=1:1:m
x1(i+1)=(-0.01-0.52*x2(i)-x3(i))/0.3
x2(i+1)=0.67-1.9*x3(i)-0.5*x1(i+1)
x3(i+1)=(0.44-0.1*x1(i+1)-0.3*x2(i+1))/0.5
er1=abs((x1(i+1)-x1(i))/x1(i+1))*100
er2=abs((x2(i+1)-x2(i))/x2(i+1))*100
er3=abs((x3(i+1)-x3(i))/x3(i+1))*100
if er1<=0.01
er2<=0.01
er3<=0.01
break;
end
end
Figure 1. Hypothesis plot. y axis: Mean entropy. x axis: Bits.
This Question is in continuation to a previous one asked Matlab : Plot of entropy vs digitized code length
I want to calculate the entropy of a random variable that is discretized version (0/1) of a continuous random variable x. The random variable denotes the state of a nonlinear dynamical system called as the Tent Map. Iterations of the Tent Map yields a time series of length N.
The code should exit as soon as the entropy of the discretized time series becomes equal to the entropy of the dynamical system. It is known theoretically that the entropy of the system, H is log_e(2) or ln(2) = 0.69 approx. The objective of the code is to find number of iterations, j needed to produce the same entropy as the entropy of the system, H.
Problem 1: My problem in when I calculate the entropy of the binary time series which is the information message, then should I be doing it in the same base as H? OR Should I convert the value of H to bits because the information message is in 0/1 ? Both give different results i.e., different values of j.
Problem 2: It can happen that the probality of 0's or 1's can become zero so entropy correspondng to it can become infinity. To prevent this, I thought of putting a check using if-else. But, the loop
if entropy(:,j)==NaN
entropy(:,j)=0;
end
does not seem to be working. Shall be greateful for ideas and help to solve this problem. Thank you
UPDATE : I implemented the suggestions and answers to correct the code. However, my logic of solving was not proper earlier. In the revised code, I want to calculate the entropy for length of time series having bits 2,8,16,32. For each code length, entropy is calculated. Entropy calculation for each code length is repeated N times starting for each different initial condition of the dynamical system. This appraoch is adopted to check at which code length the entropy becomes 1. The nature of the plot of entropy vs bits should be increasing from zero and gradually reaching close to 1 after which it saturates - remains constant for all the remaining bits. I am unable to get this curve (Figure 1). Shall appreciate help in correcting where I am going wrong.
clear all
H = 1 %in bits
Bits = [2,8,16,32,64];
threshold = 0.5;
N=100; %Number of runs of the experiment
for r = 1:length(Bits)
t = Bits(r)
for Runs = 1:N
x(1) = rand;
for j = 2:t
% Iterating over the Tent Map
if x(j - 1) < 0.5
x(j) = 2 * x(j - 1);
else
x(j) = 2 * (1 - x(j - 1));
end % if
end
%Binarizing the output of the Tent Map
s = (x >=threshold);
p1 = sum(s == 1 ) / length(s); %calculating probaility of number of 1's
p0 = 1 - p1; % calculating probability of number of 0'1
entropy(t) = -p1 * log2(p1) - (1 - p1) * log2(1 - p1); %calculating entropy in bits
if isnan(entropy(t))
entropy(t) = 0;
end
%disp(abs(lambda-H))
end
Entropy_Run(Runs) = entropy(t)
end
Entropy_Bits(r) = mean(Entropy_Run)
plot(Bits,Entropy_Bits)
For problem 1, H and entropy can be in either nats or bits units, so long as they are both computed using the same units. In other words, you should use either log for both or log2 for both. With the code sample you provided, H and entropy are correctly calculated using consistant nats units. If you prefer to work in units of bits, the conversion of H should give you H = log(2)/log(2) = 1 (or using the conversion factor 1/log(2) ~ 1.443, H ~ 0.69 * 1.443 ~ 1).
For problem 2, as #noumenal already pointed out you can check for NaN using isnan. Alternatively you could check if p1 is within (0,1) (excluding 0 and 1) with:
if (p1 > 0 && p1 < 1)
entropy(:,j) = -p1 * log(p1) - (1 - p1) * log(1 - p1); %calculating entropy in natural base e
else
entropy(:, j) = 0;
end
First you just
function [mean_entropy, bits] = compute_entropy(bits, blocks, threshold, replicate)
if replicate
disp('Replication is ON');
else
disp('Replication is OFF');
end
%%
% Populate random vector
if replicate
seed = 849;
rng(seed);
else
rng('default');
end
rs = rand(blocks);
%%
% Get random
trial_entropy = zeros(length(bits));
for r = 1:length(rs)
bit_entropy = zeros(length(bits), 1); % H
% Traverse bit trials
for b = 1:(length(bits)) % N
tent_map = zeros(b, 1); %Preallocate for memory management
%Initialize
tent_map(1) = rs(r);
for j = 2:b % j is the iterator, b is the current bit
if tent_map(j - 1) < threshold
tent_map(j) = 2 * tent_map(j - 1);
else
tent_map(j) = 2 * (1 - tent_map(j - 1));
end % if
end
%Binarize the output of the Tent Map
s = find(tent_map >= threshold);
p1 = sum(s == 1) / length(s); %calculate probaility of number of 1's
%p0 = 1 - p1; % calculate probability of number of 0'1
bit_entropy(b) = -p1 * log2(p1) - (1 - p1) * log2(1 - p1); %calculate entropy in bits
if isnan(bit_entropy(b))
bit_entropy(b) = 0;
end
%disp(abs(lambda-h))
end
trial_entropy(:, r) = bit_entropy;
disp('Trial Statistics')
data = get_summary(bit_entropy);
disp('Mean')
disp(data.mean);
disp('SD')
disp(data.sd);
end
% TO DO Compute the mean for each BIT index in trial_entropy
mean_entropy = 0;
disp('Overall Statistics')
data = get_summary(trial_entropy);
disp('Mean')
disp(data.mean);
disp('SD')
disp(data.sd);
%This is the wrong mean...
mean_entropy = data.mean;
function summary = get_summary(entropy)
summary = struct('mean', mean(entropy), 'sd', std(entropy));
end
end
and then you just have to
% Entropy Script
clear all
%% Settings
replicate = false; % = false % Use true for debugging only.
%H = 1; %in bits
Bits = 2.^(1:6);
Threshold = 0.5;
%Tolerance = 0.001;
Blocks = 100; %Number of runs of the experiment
%% Run
[mean_entropy, bits] = compute_entropy(Bits, Blocks, Threshold, replicate);
%What we want
%plot(bits, mean_entropy);
%What we have
plot(1:length(mean_entropy), mean_entropy);
I am applying an ML estimate of a Bernoulli random variable. I have initially the following code:
muBern = 0.75;
bernoulliSamples = rand(1, N);
bernoulliSamples(bernoulliSamples < muBern) = 1;
bernoulliSamples(bernoulliSamples > muBern & bernoulliSamples ~= 1) = 0;
bernoulliSamples; % 1xN matrix of Bernoulli measurements, 1's and 0's
estimateML = zeros(1,N);
for n = 1:N
estimateML(n) = (1/n)*sum(bernoulliSamples(1:n)); % The ML estimate for muBern
end
This works fairly well, but every run of the code is only one possible result of taking N=100 observations. I want to repeat this experiment I=100 times and take the average of all the results, to get a solution that accurately represents the experiment.
muBern = 0.75;
bernoulliSamples = rand(I, N);
bernoulliSamples(bernoulliSamples < muBern) = 1;
bernoulliSamples(bernoulliSamples > muBern & bernoulliSamples ~= 1) = 0;
bernoulliSamples; % IxN matrix of Bernoulli measurements, 1's and 0's
estimateML = zeros(I,N);
for n = 1:N
estimateML(n,:) = (1/n)*sum(bernoulliSamples(1:n,2)); % The ML estimate for muBern
end
I am wondering if this for loop is doing what I want it to: each row represents a completely different experiment. Is the second code instance doing the same thing as the first one, only with 100 different results as a cause of 100 different experiments?
You don't need any loops. In the single-experiment case, replace the loop by this, which does the same thing:
estimateML = cumsum(bernoulliSamples) ./ (1:N);
In the multiple-experiment case, use this:
estimateML = bsxfun(#rdivide, cumsum(bernoulliSamples,2), 1:N);
Came up with the answer, I was just overthinking it, if anyone is interested the following is what I was looking for:
for n = 1:N
estimateML(:,n) = (1/n)*sum(bernoulliSamples(:,1:n),2); % The ML estimate for muBern
end
I have implemented cosine similarity in Matlab like this. In fact, I have a two-dimensional 50-by-50 matrix. To obtain a cosine should I compare items in a line by line form.
for j = 1:50
x = dat(j,:);
for i = j+1:50
y = dat(i,:);
c = dot(x,y);
sim = c/(norm(x,2)*norm(y,2));
end
end
Is this correct?
and The question is this: wath is the complexity or O(n) in this state?
Just a note on an efficient implementation of the same thing using vectorized and matrix-wise operations (which are optimized in MATLAB). This can have huge time savings for large matrices:
dat = randn(50, 50);
OP (double-for) implementation:
sim = zeros(size(dat));
nRow = size(dat,1);
for j = 1:nRow
x = dat(j, :);
for i = j+1:nRow
y = dat(i, :);
c = dot(x, y);
sim(j, i) = c/(norm(x,2)*norm(y,2));
end
end
Vectorized implementation:
normDat = sqrt(sum(dat.^2, 2)); % L2 norm of each row
datNorm = bsxfun(#rdivide, dat, normDat); % normalize each row
dotProd = datNorm*datNorm'; % dot-product vectorized (redundant!)
sim2 = triu(dotProd, 1); % keep unique upper triangular part
Comparisons for 1000 x 1000 matrix: (MATLAB 2013a, x64, Intel Core i7 960 # 3.20GHz)
Elapsed time is 34.103095 seconds.
Elapsed time is 0.075208 seconds.
sum(sum(sim-sim2))
ans =
-1.224314766369880e-14
Better end with 49. Maybe you should also add an index to sim?
for j = 1:49
x = dat(j,:);
for i = j+1:50
y = dat(i,:);
c = dot(x,y);
sim(j) = c/(norm(x,2)*norm(y,2));
end
end
The complexity should be roughly like o(n^2), isn't it?
Maybe you should have a look at correlation functions ... I don't get what you want to write exactly, but it looks like you want to do something similar. There are built-in correlation functions in Matlab.
I've searched a lot but didn't find any solution to my problem, could you please help me vectorizing (or just a way to make it way faster) these loops ?
% n is the size of C
h = 1/(n-1)
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
F2=zeros(n,n);
C2=zeros(n,n);
t = 0.0;
for iter=1:12000
F2=F.^3-F;
for i=1:n
for j=1:n
F2(i,j)=F2(i,j)-(C(ij(i-1),j)+C(ij(i+1),j)+C(i,ij(j-1))+C(i,ij(j+1))-4*C(i,j)).*(a.^2)./(h.^2);
end
end
F=F2;
for i=1:n
for j=1:n
C2(i,j)=C(i,j)+(F(ij(i-1),j)+F(ij(i+1),j)+F(i,ij(j-1))+F(i,ij(j+1))-4*F(i,j)).*dt./(h^2);
end
end
C=C2;
t = t + dt;
end
function i=ij(i) %Just to have a matrix as loop (the n+1 th cases are the 1 th and 0 the 0th are nth)
if i==0
i=n;
return
elseif i==n+1
i=1;
end
return
end
thanks a lot
EDIT: Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
t = 0.0;
for iter=1:12000
F=C.^3-C-var1*(C([n 1:n-1],1:n) + C([2:n 1], 1:n) + C(1:n, [n 1:n-1]) + C(1:n, [2:n 1]) - 4*C);
C = C + var2*(F([n 1:n-1], 1:n) + F([2:n 1], 1:n) + F(1:n, [n 1:n-1]) + F(1:n,[2:n 1]) - 4*F);
t = t + dt;
end
Found an answer, it was totally ridiculous and I was searching way too far
%n is still the size of C
h = 1/((n-1))
dt = 1e-6;
a = 1e-2;
F=zeros(n,n);
var1=(a^2)/(h^2); %to make a bit less calculus
var2=dt/(h^2); % the same
prev = [n 1:n-1];
next = [2:n 1];
t = 0.0;
for iter=1:12000
F = C.*C.*C - C - var1*(C(:,next)+C(:,prev)+C(next,:)+C(prev,:)-4*C);
C = C + var2*(F(:,next)+F(:,prev)+F(next,:)+F(prev,:)-4*F);
t = t + dt;
end
The behavior of the inner loop looks like a 2-dimensional circular convolution. That's the same as multiplication in the FFT domain. Subtraction is invariant across a linear operation such as FFT.
You'll want to use the fft2 and ifft2 functions.
Once you do that, I think you'll find that the repeated convolution can be eliminated by raising the convolution kernel (element-wise) to the power iter. If that optimization is correct, I'm predicting a speedup of 5 orders of magnitude.
You can replace for example C(ij(i-1),j) by using circshift(C,[1,0]) or circshift(C,[1,0]) (i can't figure out witch one of two is correct)
http://www.mathworks.com/help/matlab/ref/circshift.htm