I´m constructing an algorithm that uses the BFGS method to find the parameters in a logistic regression for a binary dataset in Octave.
Now, I´m struggling with something I believe is an overfitting problem. I run the algorithm for several datasets and it actually converges to the same results as the fminunc function of Octave. However for an especific "type of dataset" the algorithm converges to very high values of the parameters, at contrary to the fminunc which gives razonable values of these parameters. I added a regularization term and I actually achieved my algorithm to converge to the same values of fminunc.
This especific type of dataset has data that can be completely separated by a straight line. My question is: why this is a problem for the BFGS method but it´s not a problem for fminunc? How this function avoid this issue without regularization? Could I implement this in my algorithm?
The code of my algorithm is the following:
function [beta] = Log_BFGS(data, L_0)
clc
close
%************************************************************************
%************************************************************************
%Loading the data:
[n, e] = size(data);
d = e - 1;
n; %Number of observations.
d; %Number of features.
Y = data(:, e); %Labels´ values
X_o = data(:, 1:d);
X = [ones(n, 1) X_o]; %Features values
%Initials conditions:
beta_0 = zeros(e, 1);
beta = [];
beta(:, 1) = beta_0;
N = 600; %Max iterations
Tol = 1e-10; %Tolerance
error = .1;
L = L_0; %Regularization parameter
B = eye(e);
options = optimset('GradObj', 'on', 'MaxIter', 600);
[beta_s] = fminunc(#(t)(costFunction(t, X, Y, L)), beta_0, options);
disp('Beta obtained with the fminunc function');
disp("--------------");
disp(beta_s)
k = 1;
a_0 = 1;
% Define the sigmoid function
h = inline('1.0 ./ (1.0 + exp(-z))');
while (error > Tol && k < N)
beta_k = beta(:, k);
x_0 = X*beta_k;
h_0 = h(x_0);
beta_r = [0 ; beta(:, k)(2:e, :)];
g_k = ((X)'*(h_0 - Y) + L*beta_r)/n;
d_k = -pinv(B)*g_k;
a = 0.1; %I´ll implement an Armijo line search here (soon)
beta(:, k+1) = beta(:, k) + a*d_k;
beta_k_1 = beta(:, k+1);
x_1 = X*beta_k_1;
h_1 = h(x_1);
beta_s = [0 ; beta(:, k+1)(2:e, :)];
g_k_1 = (transpose(X)*(h_1 - Y) + L*beta_s)/n;
s_k = beta(:, k+1) - beta(:, k);
y_k = g_k_1 - g_k;
B = B - B*s_k*s_k'*B/(s_k'*B*s_k) + y_k*y_k'/(s_k'*y_k);
k = k + 1;
error = norm(d_k);
endwhile
%Accuracy of the logistic model:
p = zeros(n, 1);
for j = 1:n
if (1./(1. + exp(-1.*(X(j, :)*beta(:, k)))) >= 0.5)
p(j) = 1;
else
p(j) = 0;
endif
endfor
R = mean(double(p == Y));
beta = beta(:, k);
%Showing the results:
disp("Estimation of logistic regression model Y = 1/(1 + e^(beta*X)),")
disp("using the algorithm BFGS =")
disp("--------------")
disp(beta)
disp("--------------")
disp("with a convergence error in the last iteration of:")
disp(error)
disp("--------------")
disp("and a total number of")
disp(k-1)
disp("iterations")
disp("--------------")
if k == N
disp("The maximum number of iterations was reached before obtaining the desired error")
else
disp("The desired error was reached before reaching the maximum of iterations")
endif
disp("--------------")
disp("The precision of the logistic regression model is given by (max 1.0):")
disp("--------------")
disp(R)
disp("--------------")
endfunction
The results I got for the dataset are showed in the following picture. If you need the data used in this situation, please let me know.
Results of the algorithm
Check the objectives!
The values of the solution-vector are nice, but the whole optimization is driven by the objective. You say fminunc which gives reasonable values of these parameters, but reasonable is not defined within this model.
It would not be impossible, that both, your low-value and your high-value solution allows pretty much the same objective. And that's what those solvers are solely caring about (when using no regulization-term).
So the important question is: is there a unique solution (which should disallow these results)? Only when your dataset has full rank! So maybe your data is rank-deficient and you obtain two equally good solutions. Of course there might be slight differences due to numerical-issues, which are always a source of errors, especially in more complex optimization-algorithms.
Related
I'm having some issues getting my RK2 algorithm to work for a certain second-order linear differential equation. I have posted my current code (with the provided parameters) below. For some reason, the value of y1 deviates from the true value by a wider margin each iteration. Any input would be greatly appreciated. Thanks!
Code:
f = #(x,y1,y2) [y2; (1+y2)/x];
a = 1;
b = 2;
alpha = 0;
beta = 1;
n = 21;
h = (b-a)/(n-1);
yexact = #(x) 2*log(x)/log(2) - x +1;
ye = yexact((a:h:b)');
s = (beta - alpha)/(b - a);
y0 = [alpha;s];
[y1, y2] = RungeKuttaTwo2D(f, a, b, h, y0);
error = abs(ye - y1);
function [y1, y2] = RungeKuttaTwo2D(f, a, b, h, y0)
n = floor((b-a)/h);
y1 = zeros(n+1,1); y2 = y1;
y1(1) = y0(1); y2(1) = y0(2);
for i=1:n-1
ti = a+(i-1)*h;
fvalue1 = f(ti,y1(i),y2(i));
k1 = h*fvalue1;
fvalue2 = f(ti+h/2,y1(i)+k1(1)/2,y2(i)+k1(2)/2);
k2 = h*fvalue2;
y1(i+1) = y1(i) + k2(1);
y2(i+1) = y2(i) + k2(2);
end
end
Your exact solution is wrong. It is possible that your differential equation is missing a minus sign.
y2'=(1+y2)/x has as its solution y2(x)=C*x-1 and as y1'=y2 then y1(x)=0.5*C*x^2-x+D.
If the sign in the y2 equation were flipped, y2'=-(1+y2)/x, one would get y2(x)=C/x-1 with integral y1(x)=C*log(x)-x+D, which contains the given exact solution.
0=y1(1) = -1+D ==> D=1
1=y1(2) = C*log(2)-1 == C=1/log(2)
Additionally, the arrays in the integration loop have length n+1, so that the loop has to be from i=1 to n. Else the last element remains zero, which gives wrong residuals for the second boundary condition.
Correcting that and enlarging the computation to one secant step finds the correct solution for the discretization, as the ODE is linear. The error to the exact solution is bounded by 0.000285, which is reasonable for a second order method with step size 0.05.
I want to write a program that makes use of Newtons Method:
To estimate the x of this integral:
Where X is the total distance.
I have functions to calculate the Time it takes to arrive at a certain distance by using the trapezoid method for numerical integration. Without using trapz.
function T = time_to_destination(x, route, n)
h=(x-0)/n;
dx = 0:h:x;
y = (1./(velocity(dx,route)));
Xk = dx(2:end)-dx(1:end-1);
Yk = y(2:end)+y(1:end-1);
T = 0.5*sum(Xk.*Yk);
end
and it fetches its values for velocity, through ppval of a cubic spline interpolation between a set of data points. Where extrapolated values should not be fetcheable.
function [v] = velocity(x, route)
load(route);
if all(x >= distance_km(1))==1 & all(x <= distance_km(end))==1
estimation = spline(distance_km, speed_kmph);
v = ppval(estimation, x);
else
error('Bad input, please choose a new value')
end
end
Plot of the velocity spline if that's interesting to you evaluated at:
dx= 1:0.1:65
Now I want to write a function that can solve for distance travelled after a certain given time, using newton's method without fzero / fsolve . But I have no idea how to solve for the upper bound of a integral.
According to the fundamental theorem of calculus I suppose the derivative of the integral is the function inside the integral, which is what I've tried to recreate as Time_to_destination / (1/velocity)
I added the constant I want to solve for to time to destination so its
(Time_to_destination - (input time)) / (1/velocity)
Not sure if I'm doing that right.
EDIT: Rewrote my code, works better now but my stopcondition for Newton Raphson doesnt seem to converge to zero. I also tried to implement the error from the trapezoid integration ( ET ) but not sure if I should bother implementing that yet. Also find the route file in the bottom.
Stop condition and error calculation of Newton's Method:
Error estimation of trapezoid:
Function x = distance(T, route)
n=180
route='test.mat'
dGuess1 = 50;
dDistance = T;
i = 1;
condition = inf;
while condition >= 1e-4 && 300 >= i
i = i + 1 ;
dGuess2 = dGuess1 - (((time_to_destination(dGuess1, route,n))-dDistance)/(1/(velocity(dGuess1, route))))
if i >= 2
ET =(time_to_destination(dGuess1, route, n/2) - time_to_destination(dGuess1, route, n))/3;
condition = abs(dGuess2 - dGuess1)+ abs(ET);
end
dGuess1 = dGuess2;
end
x = dGuess2
Route file: https://drive.google.com/open?id=18GBhlkh5ZND1Ejh0Muyt1aMyK4E2XL3C
Observe that the Newton-Raphson method determines the roots of the function. I.e. you need to have a function f(x) such that f(x)=0 at the desired solution.
In this case you can define f as
f(x) = Time(x) - t
where t is the desired time. Then by the second fundamental theorem of calculus
f'(x) = 1/Velocity(x)
With these functions defined the implementation becomes quite straightforward!
First, we define a simple Newton-Raphson function which takes anonymous functions as arguments (f and f') as well as an initial guess x0.
function x = newton_method(f, df, x0)
MAX_ITER = 100;
EPSILON = 1e-5;
x = x0;
fx = f(x);
iter = 0;
while abs(fx) > EPSILON && iter <= MAX_ITER
x = x - fx / df(x);
fx = f(x);
iter = iter + 1;
end
end
Then we can invoke our function as follows
t_given = 0.3; % e.g. we want to determine distance after 0.3 hours.
n = 180;
route = 'test.mat';
f = #(x) time_to_destination(x, route, n) - t_given;
df = #(x) 1/velocity(x, route);
distance_guess = 50;
distance = newton_method(f, df, distance_guess);
Result
>> distance
distance = 25.5877
Also, I rewrote your time_to_destination and velocity functions as follows. This version of time_to_destination uses all the available data to make a more accurate estimate of the integral. Using these functions the method seems to converge faster.
function t = time_to_destination(x, d, v)
% x is scalar value of destination distance
% d and v are arrays containing measured distance and velocity
% Assumes d is strictly increasing and d(1) <= x <= d(end)
idx = d < x;
if ~any(idx)
t = 0;
return;
end
v1 = interp1(d, v, x);
t = trapz([d(idx); x], 1./[v(idx); v1]);
end
function v = velocity(x, d, v)
v = interp1(d, v, x);
end
Using these new functions requires that the definitions of the anonymous functions are changed slightly.
t_given = 0.3; % e.g. we want to determine distance after 0.3 hours.
load('test.mat');
f = #(x) time_to_destination(x, distance_km, speed_kmph) - t_given;
df = #(x) 1/velocity(x, distance_km, speed_kmph);
distance_guess = 50;
distance = newton_method(f, df, distance_guess);
Because the integral is estimated more accurately the solution is slightly different
>> distance
distance = 25.7771
Edit
The updated stopping condition can be implemented as a slight modification to the newton_method function. We shouldn't expect the trapezoid rule error to go to zero so I omit that.
function x = newton_method(f, df, x0)
MAX_ITER = 100;
TOL = 1e-5;
x = x0;
iter = 0;
dx = inf;
while dx > TOL && iter <= MAX_ITER
x_prev = x;
x = x - f(x) / df(x);
dx = abs(x - x_prev);
iter = iter + 1;
end
end
To check our answer we can plot the time vs. distance and make sure our estimate falls on the curve.
...
distance = newton_method(f, df, distance_guess);
load('test.mat');
t = zeros(size(distance_km));
for idx = 1:numel(distance_km)
t(idx) = time_to_destination(distance_km(idx), distance_km, speed_kmph);
end
plot(t, distance_km); hold on;
plot([t(1) t(end)], [distance distance], 'r');
plot([t_given t_given], [distance_km(1) distance_km(end)], 'r');
xlabel('time');
ylabel('distance');
axis tight;
One of the main issues with my code was that n was too low, the error of the trapezoidal sum, estimation of my integral, was too high for the newton raphson method to converge to a very small number.
Here was my final code for this problem:
function x = distance(T, route)
load(route)
n=10e6;
x = mean(distance_km);
i = 1;
maxiter=100;
tol= 5e-4;
condition=inf
fx = #(x) time_to_destination(x, route,n);
dfx = #(x) 1./velocity(x, route);
while condition > tol && i <= maxiter
i = i + 1 ;
Guess2 = x - ((fx(x) - T)/(dfx(x)))
condition = abs(Guess2 - x)
x = Guess2;
end
end
As you probably guessed from the title, I'm attempting to do tridiagonal GaussJordan elimination. I'm trying to do it without the default solver. My answers aren't coming out correct and I need some assistance as to where the error is in my code.
I'm getting different values for A/b and x, using the code I have.
n = 4;
#Range for diagonals
ranged = [15 20];
rangesd = [1 5];
#Vectors for tridiagonal matrix
supd = randi(rangesd,[1,n-1]);
d = randi(ranged,[1,n]);
subd = randi(rangesd,[1,n-1]);
#Creates system Ax+b
A = diag(supd,1) + diag(d,0) + diag(subd,-1)
b = randi(10,[1,n])
#Uses default solver
y = A/b
function x = naive_gauss(A,b);
#Forward elimination
for k=1:n-1
for i=k+1:n
xmult = A(i,k)/A(k,k);
for j=k+1:n
A(i,j) = A(i,j)-xmult*A(k,j);
end
b(i) = b(i)-xmult*b(k);
end
end
#Backwards elimination
x(n) = b(n)/A(n,n);
for i=n-1:-1:1
sum = b(i);
for j=i+1:n
sum = sum-A(i,j)*x(j);
end
x(i) = sum/A(i,i)
end
end
x
Your algorithm is correct. The value of y that you compare against is wrong.
you have y=A/b, but the correct syntax to get the solution of the system should be y=A\b.
I am trying to solve a differential equation with the ode solver ode45 with MATLAB. I have tried using it with other simpler functions and let it plot the function. They all look correct, but when I plug in the function that I need to solve, it fails. The plot starts off at y(0) = 1 but starts decreasing at some point when it should have been an increasing function all the way up to its critical point.
function [xpts,soln] = diffsolver(p1x,p2x,p3x,p1rr,y0)
syms x y
yp = matlabFunction((p3x/p1x) - (p2x/p1x) * y);
[xpts,soln] = ode45(yp,[0 p1rr],y0);
p1x, p2x, and p3x are polynomials and they are passed into this diffsolver function as parameters.
p1rr here is the critical point. The function should diverge after the critical point, so i want to integrate it up to that point.
EDIT: Here is the code that I have before using diffsolver, the above function. I do pade approximation to find the polynomials p1, p2, and p3. Then i find the critical point, which is the root of p1 that is closest to the target (target is specified by user).
I check if the critical point is empty (sometimes there might not be a critical point in some functions). If its not empty, then it uses the above function to solve the differential equation. Then it plots the x- and y- points returned from the above function basically.
function error = padeapprox(m,n,j)
global f df p1 p2 p3 N target
error = 0;
size = m + n + j + 2;
A = zeros(size,size);
for i = 1:m
A((i + 1):size,i) = df(1:(size - i));
end
for i = (m + 1):(m + n + 1)
A((i - m):size,i) = f(1:(size + 1 - i + m));
end
for i = (m + n + 2):size
A(i - (m + n + 1),i) = -1;
end
if det(A) == 0
error = 1;
fprintf('Warning: Matrix is singular.\n');
end
V = -A\df(1:size);
p1 = [1];
for i = 1:m
p1 = [p1; V(i)];
end
p2 = [];
for i = (m + 1):(m + n + 1)
p2 = [p2; V(i)];
end
p3 = [];
for i = (m + n + 2):size
p3 = [p3; V(i)];
end
fx = poly2sym(f(end:-1:1));
dfx = poly2sym(df(end:-1:1));
p1x = poly2sym(p1(end:-1:1));
p2x = poly2sym(p2(end:-1:1));
p3x = poly2sym(p3(end:-1:1));
p3fullx = p1x * dfx + p2x * fx;
p3full = sym2poly(p3fullx); p3full = p3full(end:-1:1);
p1r = roots(p1(end:-1:1));
p1rr = findroots(p1r,target); % findroots eliminates unreal roots and chooses the one closest to the target
if ~isempty(p1rr)
[xpts,soln] = diffsolver(p1x,p2x,p3fullx,p1rr,f(1));
if rcond(A) >= 1e-10
plot(xpts,soln); axis([0 p1rr 0 5]); hold all
end
end
I saw some examples using another function to generate the differential equation but i've tried using the matlabFunction() method with other simpler functions and it seems like it works. Its just that when I try to solve this function, it fails. The solved values start becoming negative when they should all be positive.
I also tried using another solver, dsolve(). But it gives me an implicit solution all the time...
Does anyone have an idea why this is happening? Any advice is appreciated. Thank you!
Since your code seems to work for simpler functions, you could try to increase the accuracy options of the ode45 solver.
This can be achieved by using odeset:
options = odeset('RelTol',1e-10,'AbsTol',1e-10);
[T,Y] = ode45(#function,[tspan],[y0],options);
I'm trying to train a single layer of an autoencoder using minFunc, and while the cost function appears to decrease, when enabled, the DerivativeCheck fails. The code I'm using is as close to textbook values as possible, though extremely simplified.
The loss function I'm using is the squared-error:
$ J(W; x) = \frac{1}{2}||a^{l} - x||^2 $
with $a^{l}$ equal to $\sigma(W^{T}x)$, where $\sigma$ is the sigmoid function. The gradient should therefore be:
$ \delta = (a^{l} - x)*a^{l}(1 - a^{l}) $
$ \nabla_{W} = \delta(a^{l-1})^T $
Note, that to simplify things, I've left off the bias altogether. While this will cause poor performance, it shouldn't affect the gradient check, as I'm only looking at the weight matrix. Additionally, I've tied the encoder and decoder matrices, so there is effectively a single weight matrix.
The code I'm using for the loss function is (edit: I've vectorized the loop I had and cleaned code up a little):
% loss function passed to minFunc
function [ loss, grad ] = calcLoss(theta, X, nHidden)
[nInstances, nVars] = size(X);
% we get the variables a single vector, so need to roll it into a weight matrix
W = reshape(theta(1:nVars*nHidden), nVars, nHidden);
Wp = W; % tied weight matrix
% encode each example (nInstances)
hidden = sigmoid(X*W);
% decode each sample (nInstances)
output = sigmoid(hidden*Wp);
% loss function: sum(-0.5.*(x - output).^2)
% derivative of loss: -(x - output)*f'(o)
% if f is sigmoid, then f'(o) = output.*(1-output)
diff = X - output;
error = -diff .* output .* (1 - output);
dW = hidden*error';
loss = 0.5*sum(diff(:).^2, 2) ./ nInstances;
% need to unroll gradient matrix back into a single vector
grad = dW(:) ./ nInstances;
end
Below is the code I use to run the optimizer (for a single time, as the runtime is fairly long with all training samples):
examples = 5000;
fprintf('loading data..\n');
images = readMNIST('train-images-idx3-ubyte', examples) / 255.0;
data = images(:, :, 1:examples);
% each row is a different training sample
X = reshape(data, examples, 784);
% initialize weight matrix with random values
% W: (R^{784} -> R^{10}), W': (R^{10} -> R^{784})
numHidden = 10; % NOTE: this is extremely small to speed up DerivativeCheck
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
W = low + (high-low)*rand(numVisible, numHidden);
% run optimization
options = {};
options.Display = 'iter';
options.GradObj = 'on';
options.MaxIter = 10;
mfopts.MaxFunEvals = ceil(options.MaxIter * 2.5);
options.DerivativeCheck = 'on';
options.Method = 'lbfgs';
[ x, f, exitFlag, output] = minFunc(#calcLoss, W(:), options, X, numHidden);
The results I get with the DerivitiveCheck on are generally less than 0, but greater than 0.1. I've tried similar code using batch gradient descent, and get slightly better results (some are < 0.0001, but certainly not all).
I'm not sure if I made either a mistake with my math or code. Any help would be greatly appreciated!
update
I discovered a small typo in my code (which doesn't appear in the code below) causing exceptionally bad performance. Unfortunately, I'm still getting getting less-than-good results. For example, comparison between the two gradients:
calculate check
0.0379 0.0383
0.0413 0.0409
0.0339 0.0342
0.0281 0.0282
0.0322 0.0320
with differences of up to 0.04, which I'm assuming is still failing.
Okay, I think I might have solved the problem. Generally the differences in the gradients are < 1e-4, though I do have at least one which is 6e-4. Does anyone know if this is still acceptable?
To get this result, I rewrote the code and without tying the weight matrices (I'm not sure if doing so will always cause the derivative check to fail). I've also included biases, as they didn't complicate things too badly.
Something else I realized when debugging is that it's really easy to make a mistake in the code. For example, it took me a while to catch:
grad_W1 = error_h*X';
instead of:
grad_W1 = X*error_h';
While the difference between these two lines is just the transpose of grad_W1, because of the requirement of packing/unpacking the parameters into a single vector, there's no way for Matlab to complain about grad_W1 being the wrong dimensions.
I've also included my own derivative check which gives slightly different answers than minFunc's (my deriviate check gives differences that are all below 1e-4).
fwdprop.m:
function [ hidden, output ] = fwdprop(W1, bias1, W2, bias2, X)
hidden = sigmoid(bsxfun(#plus, W1'*X, bias1));
output = sigmoid(bsxfun(#plus, W2'*hidden, bias2));
end
calcLoss.m:
function [ loss, grad ] = calcLoss(theta, X, nHidden)
[nVars, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
delta_o = err .* output .* (1.0 - output);
delta_h = W2*delta_o .* hidden .* (1.0 - hidden);
grad_W1 = X*delta_h';
grad_bias1 = sum(delta_h, 2);
grad_W2 = hidden*delta_o';
grad_bias2 = sum(delta_o, 2);
loss = 0.5*sum(err(:).^2);
grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
end
unpackParams.m:
function [ W1, bias1, W2, bias2 ] = unpackParams(params, nVisible, nHidden)
mSize = nVisible*nHidden;
W1 = reshape(params(1:mSize), nVisible, nHidden);
offset = mSize;
bias1 = params(offset+1:offset+nHidden);
offset = offset + nHidden;
W2 = reshape(params(offset+1:offset+mSize), nHidden, nVisible);
offset = offset + mSize;
bias2 = params(offset+1:end);
end
packParams.m
function [ params ] = packParams(W1, bias1, W2, bias2)
params = [W1(:); bias1; W2(:); bias2(:)];
end
checkDeriv.m:
function [check] = checkDeriv(X, theta, nHidden, epsilon)
[nVars, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
delta_o = err .* output .* (1.0 - output);
delta_h = W2*delta_o .* hidden .* (1.0 - hidden);
grad_W1 = X*delta_h';
grad_bias1 = sum(delta_h, 2);
grad_W2 = hidden*delta_o';
grad_bias2 = sum(delta_o, 2);
check = zeros(size(theta, 1), 2);
grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
for i = 1:size(theta, 1)
Jplus = calcHalfDeriv(X, theta(:), i, nHidden, epsilon);
Jminus = calcHalfDeriv(X, theta(:), i, nHidden, -epsilon);
calcGrad = (Jplus - Jminus)/(2*epsilon);
check(i, :) = [calcGrad grad(i)];
end
end
checkHalfDeriv.m:
function [ loss ] = calcHalfDeriv(X, theta, i, nHidden, epsilon)
theta(i) = theta(i) + epsilon;
[nVisible, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVisible, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
loss = 0.5*sum(err(:).^2);
end
Update
Okay, I've also figured out why tying the weights was causing issues. I wanted to go down to just [ W1; bias1; bias2 ] since W2 = W1'. This way I could simply recreate W2 by looking at W1. However, because the values of $\theta$ are changed by epsilon, this was in effect changing both matrices at the same time. The proper solution is to simply pass W1 as a separate parameter while at the same time reducing $\theta$.
Update 2
Okay, this is what I get for posting too late at night. While the first update does indeed cause things to pass correctly, it's not the correct solution.
I think the correct thing to do is to actually calculate the gradients for W1 and W2, and then set the final gradient of W1 to grad_W1 to grad_W2. The hand-waving argument is that since the weight matrix is acting to both encode and decode, its weights must be affected by both gradients. I haven't thought through the actual theoretical ramifications of this yet, however.
If I run this using my own derivative check, it passes the 10e-4 threshold. It does much better than before with minFunc's derivative check, though still worse than if I don't tie the weights.