Ways to Improve Universal Differential Equation Training with sciml_train

Ways to Improve Universal Differential Equation Training with sciml_train - neural-network

About a month ago I asked a question about strategies for better convergence when training a neural differential equation. I've since gotten that example to work using the advice I was given, but when I applied what the same advice to a more difficult model, I got stuck again. All of my code is in Julia, primarily making use of the DiffEqFlux library. In effort to keep this post as brief as possible, I won't share all of my code for everything I've tried, but if anyone wants access to it to troubleshoot I can provide it.
What I'm Trying to Do
The data I'm trying to learn comes from an SIRx model:
function SIRx!(du, u, p, t)
β, μ, γ, a, b = Float32.([280, 1/50, 365/22, 100, 0.05])
S, I, x = u
du[1] = μ*(1-x) - β*S*I - μ*S
du[2] = β*S*I - (μ+γ)*I
du[3] = a*I - b*x
nothing
end;
The initial condition I used was u0 = Float32.([0.062047128, 1.3126149f-7, 0.9486445]);. I generated data from t=0 to 25, sampled every 0.02 (in training, I only use every 20 points or so for speed, and using more doesn't improve results). The data looks like this: Training Data
The UDE I'm training is
function SIRx_ude!(du, u, p, t)
μ, γ = Float32.([1/50, 365/22])
S,I,x = u
du[1] = μ*(1-x) - μ*S + ann_dS(u, #view p[1:lenS])[1]
du[2] = -(μ+γ)*I + ann_dI(u, #view p[lenS+1:lenS+lenI])[1]
du[3] = ann_dx(u, #view p[lenI+1:end])[1]
nothing
end;
Each of the neural networks (ann_dS, ann_dI, ann_dx) are defined using FastChain(FastDense(3, 20, tanh), FastDense(20, 1)). I tried using a single neural network with 3 inputs and 3 outputs, but it was slower and didn't perform any better. I also tried normalizing inputs to the network first, but it doesn't make a significant difference outside of slowing things down.
What I've Tried
Single shooting
The network just fits a line through the middle of the data. This happens even when I weight the earlier datapoints more in the loss function. Single-shot Training
Multiple Shooting
The best result I had was with multiple shooting. As seen here, it's not simply fitting a straight line, but it's not exactly fitting the data eitherMultiple Shooting Result. I've tried continuity terms ranging from 0.1 to 100 and group sizes from 3 to 30 and it doesn't make a significant difference.
Various Other Strategies
I've also tried iteratively growing the fit, 2-stage training with a collocation, and mini-batching as outlined here: https://diffeqflux.sciml.ai/dev/examples/local_minima, https://diffeqflux.sciml.ai/dev/examples/collocation/, https://diffeqflux.sciml.ai/dev/examples/minibatch/. Iteratively growing the fit works well the first couple of iterations, but as the length increases it goes back to fitting a straight line again. 2-stage collocation training works really well for stage 1, but it doesn't actually improve performance on the second stage (I've tried both single and multiple shooting for the second stage). Finally, mini-batching worked about as well as single-shooting (which is to say not very well) but much more quickly.
My Question
In summary, I have no idea what to try. There are so many strategies, each with so many parameters that can be tweaked. I need a way to diagnose the problem more precisely so I can better decide how to proceed. If anyone has experience with this sort of problem, I'd appreciate any advice or guidance I can get.

This isn't a great SO question because it's more exploratory. Did you lower your ODE tolerances? That would improve your gradient calculation which could help. What activation function are you using? I would use something like softplus instead of tanh so that you don't have the saturating behavior. Did you scale the eigenvalues and take into account the issues explored in the stiff neural ODE paper? Larger neural networks? Different learning rates? ADAM? Etc.
This is much better suited for a forum for discussion like the JuliaLang Discourse. We can continue there since walking through this will not be fruitful without some back and forth.

Related

how to fit a complex model to complex data

My aim is to fit gaussian-hermite-polynoms to complex measurement data (consisting out of absolute and an phase part). There are seven independent parameters (p) to generate such an gaussian-hermite-mode. I implement everthing in Matlab, so calculating such a mode is no problem. Problem is the fitting operation. At the moment, I implement two versions. First one is with fminsearch, the second one with lsqnonlin:
fun=#(p) CalcCoeff(c1,...
p(1)*CalcGaussHermite(...
CalcCoor([p(2),p(3),alphaZ],[p(4),p(5)],[centerX,centerY],labcoor),...
[l,m,lambda,p(6),p(7)]));
p=[scale,alphaX,alphaY,z_fX,z_fY,w_0X,w_0Y];
%optimset('Display','iter','PlotFcns',#optimplotfval);
% [fpar,fval,exitflag,output] = fminsearch(fun,p,options);
options = optimoptions('lsqnonlin','Display','iter');
res=lsqnonlin(fun,p,[],[],options);
The function CalcCoeff calculates the difference in case of lsqnonlin, in case of fminsearch it calculates the overlap-integral (dot product).
As a test, I calculate a simple gaussian-mode and tried to reconstruct the parameters used. But my algorithm failed in both cases. It operates without any error message, but the parameters can't be reconstructed. So the question I ask myself is: are there too many parameters for the optimization, so the algorithm isn't able to converge? Or is it even a global problem?
I would be very pleased, if any optimization specialist would give me any hint what might go wrong.
If I asked my question in a wrong way just let me know.
Regards
Andre

What is the ideal value of loss function for a GAN

GAN originally proposed by IJ Goodfellow uses following loss function,
D_loss = - log[D(X)] - log[1 - D(G(Z))]
G_loss = - log[D(G(Z))]
So, discriminator tries to minimize D_loss and generator tries to minimize G_loss, where X and Z are training input and noise input respectively. D(.) and G(.) are map for discriminator and generator neural networks respectively.
As original paper says, when GAN is trained for several steps it reaches at a point where neither generator nor discriminator can improve and D(Y) is 0.5 everywhere, Y is some input to the discriminator. In this case, when GAN is sufficiently trained to this point,
D_loss = - log(0.5) - log(1 - 0.5) = 0.693 + 0.693 = 1.386
G_loss = - log(0.5) = 0.693
So, why can we not use D_loss and G_loss values as a metric for evaluating GAN?
If two loss functions deviate away from these ideal values then GAN surely needs to be trained well or architecture needs to designed well. As theorem 1 in the original paper discusses that these are the optimal values for the D_loss and G_loss but then why can't these be used as evaluation metric?

I think this question belongs on Cross-Validated, but anyway :
I struggled with this for quite some time, and wondered why the question wasn't asked.
What follows is where I'm currently at. Not sure if it'll help you, but it is some of my intuition.
G and D losses are good indicators of failure cases...
Of course, if G loss is a really big number and D is zero, then nothing good is happening in your GAN.
... but not good indicators of performance.
I've trained a bunch of GANs and have almost never seen the "0.5/0.5 case" except on very simple examples. Most of the time, you're happy when outputs D(x) and D(G(z)) (and therefore, the losses) are more or less stable. So don't take these values for "gold standard".
A key intuition I was missing was in simultaneousity of G and D training. At the beginning, sure G is really bad at generating stuff, but D is also really bad at discriminating them. As time passes, G gets better, but D also gets better. So after many epochs, we can think that D is really good at discriminating between fake and real. Therefore, even if G "fools" D only 5% of the time (i.e. D(x)=0.95 and D(G(z))=0.05) then it can mean that G is actually pretty good because it fools sometimes a really good discriminator.
As you know, there are not reliable metrics of image quality besides looking at it for the moment, but I've found that for my usecases, G could produce great images while fooling D only a few % of the time.
A corrolary to this simultaneous training is what's happening at the beginning of the training : You can have D(X)=0.5 and D(G(Z))=0.5, and still have G produce almost random images : it's just that D is not good enough yet to tell them apart from real images.
I see it's been a couple months since you've posted this question. If you've gained intuition in the meantime, I'd be happy to hear it !

Neural Network playing Tic Tac Toe doesn't learn

I have a neural network playing tic-tac-toe. (I know there are other better methods for this, but I want to learn about NN)
So the NN plays against a random AI. First, it should learn to make an allowed move, ie. not choosing a field that is already occupied.
It doesn't get very far with this, however.
When NN chooses an illegal move I optimize the weights such that the distance to another, randomly chosen (legal) field is minimized. (There is one output which should have values between 1 and 9).
My problem is: in changing the weights, a formerly optimized outcome is now also changed. So I have this kind of overfitting: Everytime I backpropagade to optimize the weights for one particular situation, the decision for every other situation becomes worse!
I know I should probably have 9 output neurons instead of 1 and should probably not use a random field as the target, as I assume this can mess things up. I am starting to change this.
Still, the issue seems to remain. Obviously. How can I improve the decision in one situation without forgetting every other situation?
One solution I came up with is to "remember" every game played and optimizing simultaneously over all games played.
However, after a while this becomes very demanding on the computation. Also, it seems to go into the direction of a complete enumartion of all possible board situations. This might be possible for Tic Tac Toe but if I move to another game, say Go, this becomes infeasible.
Where is my mistake? How do I generally tackle this problem? Or where could I read about it? Thanks a lot!

To tackle this problem efficiently, you sould consider Reinforcement Learning methods, instead of what you are currently doing. What your are trying to do is to learn the behaviour of an agent playing Tic Tac Toe. The agent gets a high reward when he wins a game, a high penalty when he loses and an even higher penalty when he performs an illegal move. My guess is that using methods such as Q-learning with neural networks will work perfectly, even with very simple neural nets. One useful paper on the topic could be: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf, or earlier papers on TD-Gammon (I think you can easily find tutorials on the topic using the keywords TD-Gammon, Q-learning, ...).
By the way, a more down-to-earth answer to why your model might not work is that you are seemingly using one single unit to represent categorical outputs: if you want to represent an integer between 1 and N, you should represent it using N output neurons with values between 0 and 1, and pick the neuron with the highest value as your answer. Using a single neuron with value between 1 and 9 creates an unatural assymetry between your outputs, and, for example, when the expected value is 3, your network gets a higher error for outputing a 9 than a 2. This should obviously not be the case: all wrong answers are equally wrong.
Hope this helps,
Best

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help

For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse