Neural Nets with Pymc3 - neural-network

I am trying to use pymc3 to sample from the posterior, a set of single-hidden layer neural nets so that I could then convert the model to a hierarchical one, same as in Radford M.Neal's thesis. my first model looks like this:
def sample(nHiddenUnts,X,Y):
nFeatures = X.shape[1]
with pm.Model() as model:
#priors
bho = pm.Normal('hiddenOutBias',mu=0,sd=100)
who = pm.Normal('hiddenOutWeights',mu=0,sd=100,shape= (nHiddenUnts,1) )
bih = pm.Normal('inputBias',mu=0,sd=100 ,shape=nHiddenUnts)
wih= pm.Normal('inputWeights',mu=0,sd=100,shape=(nFeatures,nHiddenUnts))
netOut=T.dot( T.nnet.sigmoid( T.dot( X , wih ) + bih ) , who )+bho
#likelihood
likelihood = pm.Normal('likelihood',mu=netOut,sd=0.001,observed=Y)
start = pm.find_MAP()
step = pm.Metropolis()
trace = pm.sample(100000, step, start, progressbar=True)
return trace
and in the second model hyperpriors are added which are precisions of noise, input-to-hidden and hidden-to-output weights and biases( e.g. bihTau = precision of input->hidden bias ). The parameters for the hyperpriors are chosen so that they could be broad and also log-transformed.
#Gamma Hyperpriors
bhoTau, log_bhoTau = model.TransformedVar('bhoTau',
pm.Gamma.dist(alpha=1,beta=1e-2,testval=1e-4),
pm.logtransform)
WhoTau, log_WhoTau = model.TransformedVar('WhoTau',
pm.Gamma.dist(alpha=1,beta=1e-2,testval=1e-4),
pm.logtransform)
bihTau, log_bihTau = model.TransformedVar('bihTau',
pm.Gamma.dist(alpha=1,beta=1e-2,testval=1e-4),
pm.logtransform)
wihTau, log_wihTau = model.TransformedVar('wihTau',
pm.Gamma.dist(alpha=1,beta=1e-2,testval=1e-4),
pm.logtransform)
noiseTau, log_noiseTau = model.TransformedVar('noiseTau',
pm.Gamma.dist(alpha=1,beta=1e-2,testval=1e+4),
pm.logtransform)
#priors
bho = pm.Normal('hiddenOutBias',mu=0,tau=bhoTau)
who = pm.Normal('hiddenOutWeights',mu=0,tau=WhoTau,shape=(nHiddenUnts,1) )
bih = pm.Normal('inputBias',mu=0,tau=bihTau ,shape=nHiddenUnts)
wih= pm.Normal('inputWeights',mu=0,tau=wihTau ,shape= (nFeatures,nHiddenUnts))
.
.
.
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
where bho,who,bin and win are biases and weights of hidden-to-output and input-to-hidden layers.
To check my models 3 to 5 sample points [0,1) was drawn from a one dimensional toy function of the following form
def g(x):
return np.prod( x+np.sin(2*np.pi*x),axis=1)
The first model (constant hyper-parameters) works fine! but when I sample from the posterior of the hyperparameters+parameters e.g. replace the priors in the first (above)listing with the ones in the second, neither the find_MAP() nor sampling method converges regardless of the number of samples and the resulted ANNs won't interpolate the sample points. Then I tried to integrate hyperpriors into my model one by one. The only one which could be integrated without problem was the noise precision. if I include any of the others then the sampler won't converge to the posterior. I tried this using one 'step function' for all the model variables and also combination of two separate step methods over parameters and hyperparams. In all the cases and with different number of samples the problem was still there.

Related

EEG data classification with SWLDA using matlab

I want to ask your help in EEG data classification.
I am a graduate student trying to analyze EEG data.
Now I am struggling with classifying ERP speller (P300) with SWLDA using Matlab
Maybe there is something wrong in my code.
I have read several articles, but they did not cover much details.
My data size is described as below.
size(target) = [300 1856]
size(nontarget) = [998 1856]
row indicates the number of trials, column indicates spanned feature
(I stretched data [64 29] (for visual representation I did not select ROI)
I used stepwisefit function in Matlab to classify target vs non-target
Code is attached below.
ingredients = [targets; nontargets];
heat = [class_targets; class_nontargets]; % target: 1, non-target: -1
randomized_set = shuffle([ingredients heat]);
for k=1:10 % 10-fold cross validation
parition_factor = ceil(size(randomized_set,1) / 10);
cv_test_idx = (k-1)*parition_factor + 1:min(k * parition_factor, size(randomized_set,1));
total_idx = 1:size(randomized_set,1);
cv_train_idx = total_idx(~ismember(total_idx, cv_test_idx));
ingredients = randomized_set(cv_train_idx, 1:end-1);
heat = randomized_set(cv_train_idx, end);
[W,SE,PVAL,INMODEL,STATS,NEXTSTEP,HISTORY]= stepwisefit(ingredients, heat, 'penter', .1);
valid_id = find(INMODEL==1);
v_weights = W(valid_id)';
t_ingredients = randomized_set(cv_test_idx, 1:end-1);
t_heat = randomized_set(cv_test_idx, end); % true labels for test set
v_features = t_ingredients(:, valid_id);
v_weights = repmat(v_weights, size(v_features, 1), 1);
predictor = sum(v_weights .* v_features, 2);
m_result = predictor > 0; % class A: +1, B: 0
t_heat(t_heat==-1) = 0;
acc(k) = sum(m_result==t_heat) / length(m_result);
end
p.s. my code is currently very inefficient and might be bad..
In my assumption, stepwisefit calculates significant coefficients every steps, and valid column would be remained.
Even though it's not LDA, but for binary classification, LDA and linear regression are not different.
However, results were almost random chance.. (for other binary data on the internet, it worked..)
I think I made something wrong, and your help can correct me.
I will appreciate any suggestion and tips to implement classifier for ERP speller.
Or any idea for implementing SWLDA in Matlab code?
The name SWLDA is only used in the context of Brain Computer Interfaces, but I bet it has another name in a more general context.
If you track the recipe of SWLDA you will end up in Krusienski 2006 papers ("A comparison..." and "Toward enhanced P300..") and from there the book where stepwise logarithmic regression is explained: "Draper Smith, Applied Regression Analysis, 1981". However, as far as I am aware of, no paper gives actually the complete recipe on how to implement it (and their details and secrets).
My approach was using stepwiseglm:
H=predictors;
TH=variables;
lbs=labels % (1,2)
if (stepwiseflag)
mdl = stepwiseglm(H', lbs'-1,'constant','upper','linear','distr','binomial');
if (mdl.NumEstimatedCoefficients>1)
inmodel = [];
for i=2:mdl.NumEstimatedCoefficients
inmodel = [inmodel str2num(mdl.CoefficientNames{i}(2:end))];
end
H = H(inmodel,:);
TH = TH(inmodel,:);
end
end
lbls = classify(TH',H',lbs','linear');
You can also use a k-fold cross validaton approach using matlab cvpartition.
c = cvpartition(lbs,'k',10);
opts = statset('display','iter');
fun = #(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'linear'))));

Fitting a sine wave with Keras and PYMC3 yields unexpected results

I've been trying to fit a sine curve with a keras (theano backend) model using pymc3. I've been using this [http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/] as a reference point.
A Keras implementation alone fit using optimization does a good job, however Hamiltonian Monte Carlo and Variational sampling from pymc3 is not fitting the data. The trace is stuck at where the prior is initiated. When I move the prior the posterior moves to the same spot. The posterior predictive of the bayesian model in cell 59 is barely getting the sine wave, whereas the non-bayesian fit model gets it near perfect in cell 63. I created a notebook here: https://gist.github.com/tomc4yt/d2fb694247984b1f8e89cfd80aff8706 which shows the code and the results.
Here is a snippet of the model below...
class GaussWeights(object):
def __init__(self):
self.count = 0
def __call__(self, shape, name='w'):
return pm.Normal(
name, mu=0, sd=.1,
testval=np.random.normal(size=shape).astype(np.float32),
shape=shape)
def build_ann(x, y, init):
with pm.Model() as m:
i = Input(tensor=x, shape=x.get_value().shape[1:])
m = i
m = Dense(4, init=init, activation='tanh')(m)
m = Dense(1, init=init, activation='tanh')(m)
sigma = pm.Normal('sigma', 0, 1, transform=None)
out = pm.Normal('out',
m, 1,
observed=y, transform=None)
return out
with pm.Model() as neural_network:
likelihood = build_ann(input_var, target_var, GaussWeights())
# v_params = pm.variational.advi(
# n=300, learning_rate=.4
# )
# trace = pm.variational.sample_vp(v_params, draws=2000)
start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
step = pm.HamiltonianMC(scaling=start)
trace = pm.sample(1000, step, progressbar=True)
The model contains normal noise with a fixed std of 1:
out = pm.Normal('out', m, 1, observed=y)
but the dataset does not. It is only natural that the predictive posterior does not match the dataset, they were generated in a very different way. To make it more realistic you could add noise to your dataset, and then estimate sigma:
mu = pm.Deterministic('mu', m)
sigma = pm.HalfCauchy('sigma', beta=1)
pm.Normal('y', mu=mu, sd=sigma, observed=y)
What you are doing right now is similar to taking the output from the network and adding standard normal noise.
A couple of unrelated comments:
out is not the likelihood, it is just the dataset again.
If you use HamiltonianMC instead of NUTS, you need to set the step size and the integration time yourself. The defaults are not usually useful.
Seems like keras changed in 2.0 and this way of combining pymc3 and keras does not seem to work anymore.

Inconsistent/Different Test Performance/Error After Training Neural Network in Matlab

Keeping all parameters constant, I get different Mean Average Percentage Errors on my test data on retraining the neural network. Why is this so? Aren't all components of the neural network training process deterministic? Sometimes, I see a difference of up to 1% on successive trainings.
The training code is below
netFeb = newfit(trainX', trainY', networkConfigFeb);
netFeb.performFcn = 'mae';
netFeb = trainlm(netFeb, trainX', trainY');
% Index for the testing Data
startingInd = find(trainInd == 0, 1, 'first');
endingInd = startingInd + daysInMonth('Feb') - 1 ;
% Testing Data
testX = X(startingInd:endingInd,:);
testY = dailyPeakLoad(startingInd:endingInd,:);
actualLoadFeb = testY;
% Calculate the Forcast Load and the Mean Absolute Percentage Error
forecastLoadFeb = sim(netFeb, testX'';
errFeb = testY - forecastLoadFeb;
errpct = abs(errFeb)./testY*100;
MAPEFeb = mean(errpct(~isinf(errpct)));
As A. Donda hinted, since neural networks initialize their weights randomly, they will generate different networks after training. Thus it will give you different performance. While the training process is deterministic, the initial values are not! You may end up in different local minimums as a result or stop in different places.
If you wish to see why, take a look at Why should weights of Neural Networks be initialized to random numbers?
Edit 1:
Notes
Since the user is defining the testing/training data manually, there is no randomization of the training data sets selected

How to perform two group classification with deep neural network? (Matlab)

I'm new in machine learning (and to stackoverflow as well) and i want to make some classification tasks. I performed two group classifications on my data set (field of speech acoustics) with LIBSVM and Matlab's Pattern Recignition Tool from the Neural network toolbox to create a simple network with one hidden layer. In the hope of higher classification results i want to try Deep Neural Networks, and i found this code: http://www.mathworks.com/matlabcentral/fileexchange/42853-deep-neural-network
I have some difficulty understanding it.
My data is constructed of 127 samples of 19 parameters, so my input number is 19. I want to classify them in two groups: 0 and 1, so my output number is 1. The values in my data set are normalized between 0 and 1.
My code is the following:
clear all
clc
addpath('..');
load('data.mat')
inputdata = inputs;
outputdata = outputs;
datanum = 127;
outputnum = 1;
hiddennum = 3;
inputnum = 19;
% rbm = randRBM(inputnum, outputnum);
% rbm = pretrainRBM( rbm, inputdata );
dbn = randDBN([inputnum, hiddennum, outputnum]);
dbn = pretrainDBN( dbn, inputdata );
dbn = SetLinearMapping( dbn, inputdata, outputdata );
dbn = trainDBN( dbn, inputdata, outputdata );
estimate = v2h( dbn, inputdata )
[rmse AveErrNum] = CalcRmse(dbn, inputdata, outputdata)
The code runs. The rmse is 0.4183, the AveErrNum is 0.1969. What i need is the classification accuracy between my targets (stored in outputdata) and the networks predictions (Accuracy = data classified correctly / all data).
Where do i find the networks predictions after binarization?
Do I use the right type of network for my classification?
Don't I need to divide my data into Training, Validation and Testing samples (like in the case of a simple neural network with one hidden layer)?
Thanks in advance for any help!

MATLAB: Naive Bayes with Univariate Gaussian

I am trying to implement Naive Bayes Classifier using a dataset published by UCI machine learning team. I am new to machine learning and trying to understand techniques to use for my work related problems, so I thought it's better to get the theory understood first.
I am using pima dataset (Link to Data - UCI-ML), and my goal is to build Naive Bayes Univariate Gaussian Classifier for K class problem (Data is only there for K=2). I have done splitting data, and calculate the mean for each class, standard deviation, priors for each class, but after this I am kind of stuck because I am not sure what and how I should be doing after this. I have a feeling that I should be calculating posterior probability,
Here is my code, I am using percent as a vector, because I want to see the behavior as I increase the training data size from 80:20 split. Basically if you pass [10 20 30 40] it will take that percentage from 80:20 split, and use 10% of 80% as training.
function[classMean] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
train_labels = shuffledMatrix_label(1:percent_data_80,:)
test_labels = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:)
new_trin_label = shuffledMatrix_label(1:percentOfRows)
%get unique labels in training
numClasses = size(unique(new_trin_label),1)
classMean = zeros(numClasses,size(new_train,2));
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_trin_label == kclass,:))
std(new_train(new_trin_label == kclass,:))
priorClassforK = length(new_train(new_trin_label == kclass))/length(new_train)
priorClassforK_1 = 1 - priorClassforK
end
end
end
end
First, compute the probability of evey class label based on frequency counts. For a given sample of data and a given class in your data set, you compute the probability of evey feature. After that, multiply the conditional probability for all features in the sample by each other and by the probability of the considered class label. Finally, compare values of all class labels and you choose the label of the class with the maximum probability (Bayes classification rule).
For computing conditonal probability, you can simply use the Normal distribution function.