How to perform multi-label learning with LSTM using theano? - neural-network

I have some text data with multiple labels for each document. I want to train a LSTM network using Theano for this dataset. I came across http://deeplearning.net/tutorial/lstm.html but it only facilitates a binary classification task. If anyone has any suggestions on which method to proceed with, that will be great. I just need an initial feasible direction, I can work on.
thanks,
Amit

1) Change the last layer of the model. I.e.
pred = tensor.nnet.softmax(tensor.dot(proj, tparams['U']) + tparams['b'])
should be replaced by some other layer, e.g. sigmoid:
pred = tensor.nnet.sigmoid(tensor.dot(proj, tparams['U']) + tparams['b'])
2) The cost should also be changed.
I.e.
cost = -tensor.log(pred[tensor.arange(n_samples), y] + off).mean()
should be replaced by some other cost, e.g. cross-entropy:
one = np.float32(1.0)
pred = T.clip(pred, 0.0001, 0.9999) # don't piss off the log
cost = -T.sum(y * T.log(pred) + (one - y) * T.log(one - pred), axis=1) # Sum over all labels
cost = T.mean(cost, axis=0) # Compute mean over samples
3) In the function build_model(tparams, options), you should replace:
y = tensor.vector('y', dtype='int64')
by
y = tensor.matrix('y', dtype='int64') # Each row of y is one sample's label e.g. [1 0 0 1 0]. sklearn.preprocessing.MultiLabelBinarizer() may be handy.
4) Change pred_error() so that it supports multilabel (e.g. using some metrics like accuracy or F1 score from scikit-learn).

You can change the last layer of the model. It would have a vector of target where each element is 0 or 1, depending if you have the target or not.

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Combining Matern and Periodic kernels in coregionalized regression

I am trying to build a multi dimensional GP regression, initially with two outputs f1(x), f2(x).
On one output is somewhat arbitrary, hence I would like to use a Matern kernel here: f1'(x)=K_Matern f1(x). The other output f2(x) shows seasonality where the amplitude is related to the value of f1(x): f2'(x)=K_season(x) f1(x). I have been trying to compose a suitable kernel by combining the two with a Coreg kernel:
K_Matern * Coreg * K_season. As this does not seem to work I was wondering where is the mistake in my thinking.
k1 = gpflow.kernels.Matern32(1, active_dims=[0], lengthscales = 1)
k2 = gpflow.kernels.Periodic(1, active_dims=[1], lengthscales = 1)
coreg = gpflow.kernels.Coregion(1, output_dim=2, rank=1, active_dims=[1])
kern = k1 * coreg * k2
lik = gpflow.likelihoods.SwitchedLikelihood([gpflow.likelihoods.StudentT(), gpflow.likelihoods.StudentT()])
X_augmented = np.vstack((np.hstack((X1, np.zeros_like(X1))), np.hstack((X2, np.ones_like(X2)))))
Y_augmented = np.vstack((np.hstack((Y1, np.zeros_like(X1))), np.hstack((Y2, np.ones_like(X2)))))
m = gpflow.models.VGP(X_augmented, Y_augmented, kern=kern, likelihood=lik, num_latent=1)
Your k2 is operating on the same dimension of X_augmented as coreg (active_dims=[1]) - which is the column that's either 0 or 1 depending on which output it relates to, clearly not what you want! If you want to use different kernels on different outputs, you need to use the multi-output framework, there is a multi-output notebook in the GPflow documentation. Specifically, you probably want a SeparateIndependentMok([Matern32(1), Periodic(1)]). Note that in this framework you need to provide all outputs for each input, and instead of augmenting X with the index-into-output, Y has one column per output (so N x 2 in your example).

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?

Frequency array feeds FFT

The final goal I am trying to achieve is the generation of a ten minutes time series: to achieve this I have to perform an FFT operation, and it's the point I have been stumbling upon.
Generally the aimed time series will be assigned as the sum of two terms: a steady component U(t) and a fluctuating component u'(t). That is
u(t) = U(t) + u'(t);
So generally, my code follows this procedure:
1) Given data
time = 600 [s];
Nfft = 4096;
L = 340.2 [m];
U = 10 [m/s];
df = 1/600 = 0.00167 Hz;
fn = Nfft/(2*time) = 3.4133 Hz;
This means that my frequency array should be laid out as follows:
f = (-fn+df):df:fn;
But, instead of using the whole f array, I am only making use of the positive half:
fpos = df:fn = 0.00167:3.4133 Hz;
2) Spectrum Definition
I define a certain spectrum shape, applying the following relationship
Su = (6*L*U)./((1 + 6.*fpos.*(L/U)).^(5/3));
3) Random phase generation
I, then, have to generate a set of complex samples with a determined distribution: in my case, the random phase will approach a standard Gaussian distribution (mu = 0, sigma = 1).
In MATLAB I call
nn = complex(normrnd(0,1,Nfft/2),normrnd(0,1,Nfft/2));
4) Apply random phase
To apply the random phase, I just do this
Hu = Su*nn;
At this point start my pains!
So far, I only generated Nfft/2 = 2048 complex samples accounting for the fpos content. Therefore, the content accounting for the negative half of f is still missing. To overcome this issue, I was thinking to merge the real and imaginary part of Hu, in order to get a signal Huu with Nfft = 4096 samples and with all real values.
But, by using this merging process, the 0-th frequency order would not be represented, since the imaginary part of Hu is defined for fpos.
Thus, how to account for the 0-th order by keeping a procedure as the one I have been proposing so far?

Solving a system of equations using Python/Scipy for a set of measurements

I have an physical instrument of measurement (force platform with load cells) which gives me three values, A, B and C. It happens, though, that these values - that should be orthogonal - actually are somewhat coupled, due to physical characteristics of the measuring device, which causes cross-talk between applied and returned values of force and torque.
Then, it is recommended that a calibration matrix be used to transform the measured values into a better estimate of the actual values, like this:
The problem is that it is necessary to perform a SET of measurements, so that different measured(Fz, Mx, My) and actual(Fz, Mx, My) are least-squared to get some C matrix that works best for the system as a whole.
I can solve Ax = B problems with scipy.linalg.lststq, or even scipy.linalg.solve (giving an exact solution) for ONE measurement, but how should I proceed to consider a set of different measurements, each one with its own equation giving a potentially different 3x3 matrix?
Any help is much appreciated, thanks for reading.
I posted a similar question containing just the mathematical part of this at math.stackexchange.com, and this answer solved the problem:
math.stackexchange.com/a/232124/27435
In case anyone have a similar problem in the future, here is the almost literal Scipy implementation of that answer (first lines are initialization boilerplate code):
import numpy
import scipy.linalg
### Origin of the coordinate system: upper left corner!
"""
1----------2
| |
| |
4----------3
"""
platform_width = 600
platform_height = 400
# positions of each load cell (one per corner)
loadcell_positions = numpy.array([[0, 0],
[platform_width, 0],
[platform_width, platform_height],
[0, platform_height]])
platform_origin = numpy.array([platform_width, platform_height]) * 0.5
# applying a known force at known positions and taking the measurements
measurements_per_axis = 5
total_load = 50
results = []
for x in numpy.linspace(0, platform_width, measurements_per_axis):
for y in numpy.linspace(0, platform_height, measurements_per_axis):
position = numpy.array([x,y])
for loadpos in loadcell_positions:
moments = platform_origin-loadpos * total_load
load = numpy.array([total_load])
result = numpy.hstack([load, moments])
results.append(result)
results = numpy.array(results)
noise = numpy.random.rand(*results.shape) - 0.5
measurements = results + noise
# now expand ("stuff") the 3x3 matrix to get a linearly independent 3x3 matrix
expands = []
for n in xrange(measurements.shape[0]):
k = results[n,:]
m = measurements[n,:]
expand = numpy.zeros((3,9))
expand[0,0:3] = m
expand[1,3:6] = m
expand[2,6:9] = m
expands.append(expand)
expands = numpy.vstack(expands)
# perform the actual regression
C = scipy.linalg.lstsq(expands, measurements.reshape((-1,1)))
C = numpy.array(C[0]).reshape((3,3))
# the result with pure noise (not actual coupling) should be
# very close to a 3x3 identity matrix (and is!)
print C
Hope this helps someone!