I am optimising MRI experiment settings in order to get optimal precision of tissue property measurements using my data.
To optimise the settings (i.e. a vector of numbers) I am using the MATLAB genetic algorithm function (ga()). I want to compare the final result of different optimisation experiments, that have different parameter upper bounds but I do not know how to choose the FunctionTolerance.
My current implementation takes several days. I would like to increase FunctionTolerance so that it does not take as long, yet still allows me to make reliable comparisons of the final results of the two different optimisation experiments. In other words, I do not want to stop the optimisation too early. I want to stop it when it gets close to its best results, but not wait for a long time for it to refine the result.
Is there a general rule for choosing FunctionTolerance or does it depend on what is being optimised?
Related
I wanna do a real-time simulation, if I wanna use the fixed step-size solver in Dymola, with different step sizes, the result could be a little bit different, so is there any standard procedure to choose the step size? Or do I have to do a lot of calculations to prove step size independence just like in the CFD area I need to prove grid independence?
I don't know if there is a standard procedure, but proving numerical stability is not straightforward for numerical solving of nonlinear/hybrid models. Therefore I would go with some not strictly mathematical procedure. As it seems you are free to chose the step-size, so I would do the following.
Option 1 (with at least a little mathematical background):
Linearize the model using the "Tools -> Linear Analysis -> Poles"
The result is a plot containing the Eigenvalues and a table in the "Commands"-window. The latter should contain a column freq. [Hz] (Additional information can be generated by running a "Full Linear Analysis")
Take the highest value for the frequency from the table and derive the necessary step-size for it, given the solvers properties (e.g. stability region)
For Forward Euler it would make sense to use StepSize = 1/max(freq) * 1/10
For others the relation can be very different, but for most explicit solvers, this should be a good starting point
Note: Probably other functions of the "Linear Analysis" contain useful information as well, so it is worth a try to run them.
The problem with the above method is, that the poles of a non-LTI system can depend on the inputs/states of the model. Therefore it can go wrong as the result depends on the state of the system or the time of linearization respectively.
Option 2 (just go by trail and error):
Given you have a rough idea what the step-size should be you can do this:
Pick a solver and select a rather small step-size. This should provide a good result but slow simulation (e.g. 100ns in your case).
Then increase the step-size by e.g. a factor of 10, until the difference is getting to a level where you consider it too big to continue.
Then reduce the changes in step-size to find a sweet-spot for the trade-off between performance and precision.
Note: The above steps could be the flipped, by starting with a big step-size and reducing it until the results match well enough.
Validation/Finetuning
To prove that the result of any of the two above options is not totally off, it would make sense do the following:
Create a reference result with a proven well-working solver (in Dymola I would use DASSL with a reasonable relative tolerance).
Double-check the reference result with a second solver, ideally something rather different (in Dymola this could be Radau, CVode is similar to DASSL)
Compare the results of the reference solver with your fixed-step solver and check if you are fine with the difference.
If the results are similar enough, you can try to increase the step-size to a point where the difference gets too big (finetuning)
For both Options
Note that when you change the system's properties (poles) or input the above procedure(s) should be repeated - at least the validation part.
I was wondering if there exists a technical way to choose initial parameters to these kind of problems (as they can take virtually any form). My question arises from the fact that my solution depends a little on initial parameters (as usual). My fit consists of 10 parameters and approximately 5120 data points (x,y,z) and has non linear constraints. I have been doing this by brute force, that is, trying parameters randomly and trying to observe a pattern but it led me nowhere.
I also have tried using MATLAB's Genetic Algorithm (to find a global optimum) but with no success as it seems my function has a ton of local minima.
For the purpose of my problem, I need justfy in some manner the reasons behind choosing initial parameters.
Without any insight on the model and likely values of the parameters, the search space is too large for anything feasible. Think that just trying ten values for every parameter corresponds to ten billion combinations.
There is no magical black box.
You can try Bayesian Optimization to find a global optimum for expensive black box functions. Matlab describes it's implementation [bayesopt][2] as
Select optimal machine learning hyperparameters using Bayesian optimization
but you can use it to optimize any function. Bayesian Optimization works by updating a prior belief over a distribution of functions with the observed data.
To speed up the optimization I would recommend adding your existing data via the InitialX and InitialObjective input arguments.
While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.
I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.
I am attempting to train a neural network to control a simple entity in a simulated 2D environment, currently by using a genetic algorithm.
Perhaps due to lack of familiarity with the correct terms, my searches have not yielded much information on how to treat fitness and training in cases where all the following conditions hold:
There is no data available on correct outputs for given inputs.
A performance evaluation can only be made after an extended period of interaction with the environment (with continuous controller input/output invocation).
There is randomness inherent in the system.
Currently my approach is as follows:
The NN inputs are instantaneous sensor readings of the entity and environment state.
The outputs are instantaneous activation levels of its effectors, for example, a level of thrust for an actuator.
I generate a performance value by running the simulation for a given NN controller, either for a preset period of simulation time, or until some system state is reached. The performance value is then assigned as appropriate based on observations of behaviour/final state.
To prevent over-fitting, I repeat the above a number of times with different random generator seeds for the system, and assign a fitness using some metric such as average/lowest performance value.
This is done for every individual at every generation. Within a given generation, for fairness each individual will use the same set of random seeds.
I have a couple of questions.
Is this a reasonable, standard approach to take for such a problem? Unsurprisingly it all adds up to a very computationally expensive process. I'm wondering if there are any methods to avoid having to rerun a simulation from scratch every time I produce a fitness value.
As stated, the same set of random seeds is used for the simulations for each individual in a generation. From one generation to the next, should this set remain static, or should it be different? My instinct was to use different seeds each generation to further avoid over-fitting, and that doing so would not have an adverse effect on the selective force. However, from my results, I'm unsure about this.
It is a reasonable approach, but genetic algorithms are not known for being very fast/efficient. Try hillclimbing and see if that is any faster. There are numerous other optimization methods, but nothing is great if you assume the function is a black box that you can only sample from. Reinforcement learning might work.
Using random seeds should prevent overfitting, but may not be necessary depending on how representative a static test is of average, and how easy it is to overfit.