SVGP for US Flight data - gpflow

My problem is the optimization issue for SVIGP in the US Flight dataset.
I implemented the SVGP model for the US flight data mentioned in the Hensman 2014 using the number of inducing point = 100, batch_size = 1000, learning rate = 1e-5 and maxiter = 500.
The result is pretty strange end ELBO does not increase and it have large variance no matter how I tune the learning rate
Initialization
M = 100
D = 8
def init():
kern = gpflow.kernels.RBF(D, 1, ARD=True)
Z = X_train[:M, :].copy()
m = gpflow.models.SVGP(X_train, Y_train.reshape([-1,1]), kern, gpflow.likelihoods.Gaussian(), Z, minibatch_size=1000)
return m
m = init()
Inference
m.feature.trainable = True
opt = gpflow.train.AdamOptimizer(learning_rate = 0.00001)
m.compile()
opt.minimize(m, step_callback=logger, maxiter = 500)
plt.plot(logf)
plt.xlabel('iteration')
plt.ylabel('ELBO')
Result:
Added Results
Once I add more iterations and use large learning rate. It is good to see that ELBO increases as iterations increase. But it is very confused that both RMSE(root mean square error) for training and testing data increase too. Do you have some suggestions?
Figures and codes shown as follows:
ELBOs vs iterations
Train RMSEs vs iterations
Test RMSEs vs iterations
Using logger
def logger(x):
print(m.compute_log_likelihood())
logx.append(x)
logf.append(m.compute_log_likelihood())
logt.append(time.time() - st)
py_train = m.predict_y(X_train)[0]
py_test = m.predict_y(X_test)[0]
rmse_hist.append(np.sqrt(np.mean((Y_train - py_train)**2)))
rmse_test_hist.append(np.sqrt(np.mean((Y_test - py_test)**2)))
logger.i+=1
logger.i = 1
And the full code is shown through link.

Related

Inconsistency when estimating AR model coefficients in MATLAB

I'm trying to estimate the coefficients of an AR[2] model
x(t) = a_1*x(t-1) + a_2*x(t-2) + e(t), e(t) ~ N(0, sigma^2)
in MATLAB. For a_1 = 2*cos(2*pi/T)*exp(-1/tau), a_2 = -exp(-2/tau), the AR[2] model corresponds to a linear damped oscillator with period T and relaxation time tau. I simulated some data for this process with T = 30 and tau = 100 which corresponds to a_1 = 1.9368, a_2 = -0.9802:
T = 30; tau = 100;
a_1 = 2*cos(2*pi/T)*exp(-1/tau); a_2 = -exp(-2/tau);
simuMdl = arima(2,0,0);
simuMdl.Constant = 0;
simuMdl.Variance = 1e-1;
simuMdl.AR{1} = a_1;
simuMdl.AR{2} = a_2;
data = simulate(simuMdl, 600);
data = data(501:end);
plot(data)
I only take the last 100 timepoints to make sure the system is not influenced by the initial conditions any more. Now, when trying to estimate the parameters, everything works just fine when using the estimate command that uses maximum likelihood estimation:
ToEstMdl = arima(2,0,0); ToEstMdl.Constant = 0;
EstMdl = estimate(ToEstMdl, data);
EstMdl.AR
%'[1.9319] [-0.9745]'
However, when I use the Yule-Walker-Equations implemented in aryule, I get a completely different result that does not match the true parameter values at all:
aryule(data, 2)
%'1.0000 -1.4645 0.5255'
Does anyone have an idea why the Yule-Walker-equations have such shortcomings to the MLE approach?
Yule-Walker (YW) is a method of moment based method. As such its estimate would get better with increasing data points. You can check it in this example by using all 600 data points to see what is the 'best' YW estimate you can get had you used all the data points and the MLE would still be better than it. You can also increase the data points to say 5000 instead of 600 and you will see in this case the best YW (the one that uses all 5000 points) would start to approach the MLE estimate.

pytorch linear regression given wrong results

I implemented a simple linear regression and I’m getting some poor results. Just wondering if these results are normal or I’m making some mistake.
I tried different optimizers and learning rates, I always get bad/poor results
Here is my code:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable
class LinearRegressionPytorch(nn.Module):
def __init__(self, input_dim=1, output_dim=1):
super(LinearRegressionPytorch, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self,x):
x = x.view(x.size(0),-1)
y = self.linear(x)
return y
input_dim=1
output_dim = 1
if torch.cuda.is_available():
model = LinearRegressionPytorch(input_dim, output_dim).cuda()
else:
model = LinearRegressionPytorch(input_dim, output_dim)
criterium = nn.MSELoss()
l_rate =0.00001
optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)
#optimizer = torch.optim.Adam(model.parameters(),lr=l_rate)
epochs = 100
#create data
x = np.random.uniform(0,10,size = 100) #np.linspace(0,10,100);
y = 6*x+5
mu = 0
sigma = 5
noise = np.random.normal(mu, sigma, len(y))
y_noise = y+noise
#pass it to pytorch
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Variable(x_data).cuda()
target = Variable(y_data).cuda()
else:
inputs = Variable(x_data)
target = Variable(y_data)
for epoch in range(epochs):
#predict data
pred_y= model(inputs)
#compute loss
loss = criterium(pred_y, target)
#zero grad and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
#if epoch % 50 == 0:
# print(f'epoch = {epoch}, loss = {loss.item()}')
#print params
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
There are the poor results :
linear.weight tensor([[1.7374]], device='cuda:0')
linear.bias tensor([0.1815], device='cuda:0')
The results should be weight = 6 , bias = 5
Problem Solution
Actually your batch_size is problematic. If you have it set as one, your targetneeds the same shape as outputs (which you are, correctly, reshaping with view(-1, 1)).
Your loss should be defined like this:
loss = criterium(pred_y, target.view(-1, 1))
This network is correct
Results
Your results will not be bias=5 (yes, weight will go towards 6 indeed) as you are adding random noise to target (and as it's a single value for all your data points, only bias will be affected).
If you want bias equal to 5 remove addition of noise.
You should increase number of your epochs as well, as your data is quite small and network (linear regression in fact) is not really powerful. 10000 say should be fine and your loss should oscillate around 0 (if you change your noise to something sensible).
Noise
You are creating multiple gaussian distributions with different variations, hence your loss would be higher. Linear regression is unable to fit your data and find sensible bias (as the optimal slope is still approximately 6 for your noise, you may try to increase multiplication of 5 to 1000 and see what weight and bias will be learned).
Style (a little offtopic)
Please read documentation about PyTorch and keep your code up to date (e.g. Variable is deprecated in favor of Tensor and rightfully so).
This part of code:
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Tensor(x_data).cuda()
target = Tensor(y_data).cuda()
else:
inputs = Tensor(x_data)
target = Tensor(y_data)
Could be written succinctly like this (without much thought):
inputs = torch.from_numpy(x).float()
target = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = inputs.cuda()
target = target.cuda()
I know deep learning has it's reputation for bad code and fatal practice, but please do not help spreading this approach.

How to cluster X to maximize explained variance in Y?

This is a problem I've encountered more than once, and I have pseudo-solution in mind but it's more or less a Monte Carlo method rather than anything clever.
What I'm trying to do is essentially convert a continuous variable into a categorical variable in such a way that each category has a significantly different mean in the response variable.
So let's say I am trying to model rates of depression against Age Groups. In my model, I want to have up to N Age Groups, and the bounds for each group can be arbitrarily sized (5-10, 11-27, 28-30, 31-64...etc). The question is, how to choose the bounds such that with N groups, the explained variance in depression rates can be maximized?
# Monte Carlo approach using iris dataset as an example
n_clust = 5
best_bounds = rep(0, n_clust)
best_groups = NULL
bestSSE = Inf
X_var = iris$Petal.Length
Y_var = iris$Sepal.Width
min_x = min(X_var)
max_x = max(X_var)
range_x = max_x - min_x
for (i in 1:10000){
b = sort(runif(n_clust-1))
b = cumsum(b / sum(b))
bounds = min_x + b * range_x
groups = cut(X_var, breaks = c(-Inf,bounds,Inf))
model = lm(Y_var~groups)
SSE = sum(model$residuals^2)
if (SSE < bestSSE){
print(SSE)
best_bounds = bounds
best_groups = groups
bestSSE = SSE
}
}
g = aggregate(Y_var, list(best_groups), mean)
names(g) = c("Cluster", "y_mean")
g$Cluster=c(best_bounds)
plot(X_var, Y_var, col='blue', pch=20)
abline(lm(Y_var~X_var), col='darkgray', lty=2)
for (i in 1:(nrow(g))){
x0 = ifelse(i == 1, min_x-max_x, g[i-1,"Cluster"])
x1 = ifelse(i < nrow(g), g[i,"Cluster"], 2*max_x)
segments(x0,g[i,"y_mean"],x1,g[i,"y_mean"], col='red')
}
R_cont = summary(lm(Y_var~X_var))$r.squared
R_cat = summary(lm(Y_var~best_groups))$r.squared
title(paste("R^2:", round(R_cont,4),"vs",round(R_cat,4)))
Note: I don't care about interpretation, only predictive ability.
I would not look at this from a clustering perspective. Instead, treat it as an optimization problem. Then do gradient descent to optimize, or any other search.
Another option would be piecewise linear regression, but you want a "piecewise constant regression" actually.

Variable error rate of SVM Classifier using K-Fold Cross Vaidation Matlab

I'm using K-Fold Cross-validation to get the error rate of a SVM Classifier. This is the code with wich I'm getting the error rate for 8-Fold Cross-validation:
data = load('Entrenamiento.txt');
group = importdata('Grupos.txt');
CP = classperf(group);
N = length(group);
k = 8;
indices = crossvalind('KFold',N,k);
single_error = zeros(1,k);
for j = 1:k
test = (indices==j);
train = ~test;
SVMModel_1 = fitcsvm(data(train,:),group(train,:),'BoxConstraint',1,'KernelFunction','linear');
classification = predict(SVMModel_1,data(test,:));
classperf(CP,classification,test);
single_error(1,j) = CP.ErrorRate;
end
confusion_matrix = CP.CountingMatrix
VP = confusion_matrix(1,1);
FP = confusion_matrix(1,2);
FN = confusion_matrix(2,1);
VN = confusion_matrix(2,2);
mean_error = mean(single_error)
However, the mean_error changes each time I run the script. This is due to crossvalind, which generates random cross-validation indices, so each time I run the script, it generates different random indices.
What should I do to calculate the true error rate? Should I calculate the mean error rate of n code executions? Or what value should I use?
You can check wiki,
In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples.
and
The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation.
So no worries about different error rates of randomly selecting folds.
Of course the results will be different.
However if your error rate is in wide range then increasing k would help.
Also rng can be used to get fixed results.

Naive Bayse Classifier for Multiclass: Getting Same Error Rate

I have implemented the Naive Bayse Classifier for multiclass but problem is my error rate is same while I increase the training data set. I was debugging this over an over but wasn't able to figure why its happening. So I thought I ll post it here to find if I am doing anything wrong.
%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
%Getting the index common to test and train data
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);
test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);
%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));
% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end
error = 0;
p = zeros(numClasses,1);
% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end
Here is the dimentionality of my data
x =
data: [768x8 double]
labels: [768x1 double]
I am using Pima data set from UCI
What are the results of your implementation of the training data itself? Does it fit it at all?
It's hard to be sure but there are couple things that I noticed:
It is important for every class to have training data. You can't really train a classifier to recognize a class if there was no training data.
If possible number of training examples shouldn't be skewed towards some of classes. For example if in 2-class classification number of training and cross validation examples for class 1 constitutes only 5% of the data then function that always returns class 2 will have error of 5%. Did you try checking precision and recall separately?
You're trying to fit normal distribution to each feature in a class and then use it for posterior probabilities. I'm not sure how it plays out in terms of smoothing. Could you try to re-implement it with simple counting and see if it gives any different results?
It also could be that features are highly redundant and bayes method overcounts probabilities.