Bidirectional LSTM for Classification - classification

I am done with searching "how to implement bidirectional lstm network for a classification problem (say with iris data)". I have not found any satisfying answer. In almost every cases I came by a solution where BLSTM is implemented for a sequence prediction problem. My simple question is that, how can I create a bidirectional network in pybrain. As whenever I am trying to build one I am writing *
My intention was to add modules later by pybrain.addInputModule() or so. But it is failing of course as I am not specifying seqlen as in
n = BidirectionalNetwork(seqlen=20, inputsize=1,
hiddensize=5, symmetric=False)
what will be seqlen if I have 4 inputs and 3 outputs(as in iris data) and 150 sample data. will it be 150? Things are not clear as I have no example of classification problem.


Ways to Improve Universal Differential Equation Training with sciml_train

About a month ago I asked a question about strategies for better convergence when training a neural differential equation. I've since gotten that example to work using the advice I was given, but when I applied what the same advice to a more difficult model, I got stuck again. All of my code is in Julia, primarily making use of the DiffEqFlux library. In effort to keep this post as brief as possible, I won't share all of my code for everything I've tried, but if anyone wants access to it to troubleshoot I can provide it.
What I'm Trying to Do
The data I'm trying to learn comes from an SIRx model:
function SIRx!(du, u, p, t)
β, μ, γ, a, b = Float32.([280, 1/50, 365/22, 100, 0.05])
S, I, x = u
du[1] = μ*(1-x) - β*S*I - μ*S
du[2] = β*S*I - (μ+γ)*I
du[3] = a*I - b*x
The initial condition I used was u0 = Float32.([0.062047128, 1.3126149f-7, 0.9486445]);. I generated data from t=0 to 25, sampled every 0.02 (in training, I only use every 20 points or so for speed, and using more doesn't improve results). The data looks like this: Training Data
The UDE I'm training is
function SIRx_ude!(du, u, p, t)
μ, γ = Float32.([1/50, 365/22])
S,I,x = u
du[1] = μ*(1-x) - μ*S + ann_dS(u, #view p[1:lenS])[1]
du[2] = -(μ+γ)*I + ann_dI(u, #view p[lenS+1:lenS+lenI])[1]
du[3] = ann_dx(u, #view p[lenI+1:end])[1]
Each of the neural networks (ann_dS, ann_dI, ann_dx) are defined using FastChain(FastDense(3, 20, tanh), FastDense(20, 1)). I tried using a single neural network with 3 inputs and 3 outputs, but it was slower and didn't perform any better. I also tried normalizing inputs to the network first, but it doesn't make a significant difference outside of slowing things down.
What I've Tried
Single shooting
The network just fits a line through the middle of the data. This happens even when I weight the earlier datapoints more in the loss function. Single-shot Training
Multiple Shooting
The best result I had was with multiple shooting. As seen here, it's not simply fitting a straight line, but it's not exactly fitting the data eitherMultiple Shooting Result. I've tried continuity terms ranging from 0.1 to 100 and group sizes from 3 to 30 and it doesn't make a significant difference.
Various Other Strategies
I've also tried iteratively growing the fit, 2-stage training with a collocation, and mini-batching as outlined here:,, Iteratively growing the fit works well the first couple of iterations, but as the length increases it goes back to fitting a straight line again. 2-stage collocation training works really well for stage 1, but it doesn't actually improve performance on the second stage (I've tried both single and multiple shooting for the second stage). Finally, mini-batching worked about as well as single-shooting (which is to say not very well) but much more quickly.
My Question
In summary, I have no idea what to try. There are so many strategies, each with so many parameters that can be tweaked. I need a way to diagnose the problem more precisely so I can better decide how to proceed. If anyone has experience with this sort of problem, I'd appreciate any advice or guidance I can get.
This isn't a great SO question because it's more exploratory. Did you lower your ODE tolerances? That would improve your gradient calculation which could help. What activation function are you using? I would use something like softplus instead of tanh so that you don't have the saturating behavior. Did you scale the eigenvalues and take into account the issues explored in the stiff neural ODE paper? Larger neural networks? Different learning rates? ADAM? Etc.
This is much better suited for a forum for discussion like the JuliaLang Discourse. We can continue there since walking through this will not be fruitful without some back and forth.

Caffe: How to train end-to-end (image to image)?

We are quite new to caffe, but what we have seen so far, looks really promising.
After reading a few papers (1,2), we wanted to reproduce the result of 1, specifically about a segmentation challenge 4.
We downloaded the modified caffe from 3 and were able to execute it, just to see, that the trained network didn't work with the dataset from 4.
At first we thought that the network needs to be trained for the specific problem.
Which lead to the problem of how to do 'image-to-image (aka end-to-end) learning ' (4, training data).
This lead us to 'holistically nested edge detection' (hed, 2), where image-to-image learning, seems to be used.
With hed, we were able to retrain the network on our own. But it doesn't work (it leads to all 0 or 0.5 images - black images :-( ) if we try to train the network for the dataset of 4. For initialization we wrote a script to calculate the mean-map witch we use for the dataset of 4.
Our question(s) are:
How can we reproduce the result, mentioned in 1 by running
image-to-image training?
How do you train networks, where we have image-to-image learning?
Since we only have 30 image-to-image pairs, should we implement
deformation as mentioned in 1/3 via matlab/python or is there a
functionality within caffe already?
Are we missing something simple from 1 or 2?
Kind regards,
Klaus and Bernhard
Ps: We asked the same question at the caffe-user group and intend to post solutions at both locations.
After some time, and trying several different things out - i stumbled upon:
Using that caffe fork, with caffe_neural_models and caffe_neural_tool training image(raw)-to-image(labels) can be done quite simple.
Just check out 'caffe_neural_models/net*' for different configurations.

Weka classification; cross-validation across predefined topic

I am using Weka to classify a dataset. Each data point is in one of five topics that I am trying to generalize across.
I would like to make each topic a test set so that I can train on topics 1-4 and test on topic 5, then train on topics 1, 3, 4 and 5, and test on 2, and so on.
Is there a way to direct Weka to preform this automatically one time with one dataset? That is, can I direct Weka to cross-validate by topic?
I apologize for redundancy if this question has already been asked. If it indeed has, any help in directing me towards the answer would be most appreciated.
There are a few ways that I can think of that may assist in getting the results that you desire:
As you have outlined in your question, you could generate 5 different training sets with the remaining topic as the testing set. Each model would need to be trained individually if you were going to use the Weka interface (Supply the training data, the build a classifier and supply a testing set, repeat). This would likely be quickest if it's a once off.
You may be able to use the FilteredClassifier with the filter of RemoveWithValues. This may be able to remove the training cases of a particular topic if the topic number is an available attribute (I am guessing that this data is not part of the model's data though, so attribute filtering may also be required if using this approach).
If you are willing to use Java to program a solution, you would be able to manipulate the data and build each of the five classifiers in one go. I am thinking that the algorithm for such a model would be as outlined below. If you plan to undertake this process a lot, it may be the better solution.
for each topic t
training_data = all cases not containing topic t
testing_data = training_set cases containing topic t
build classifier using training_data, testing_data
save classifier
end for

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
kmax=100; %for instance...
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
if consistency>precisionmax
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
My last question was closed. Is there another SE site where these kinds of questions are on topic?
You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.
More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on's FAQ for more information.