I am learning about NEAT (Neuroevolution of Augmenting Topologies), and am trying to implement it in C++, and I have no idea of what a good compatability threshold would be, please can you recommend one, along with c1, c2 and c3 (see the distance function (δ) in the paper (page 13): http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf)
The compatibility threshold along with variables c1, c2 and c3, should all be chosen based on your problem, and other variables you've set for your NEAT implementation. The larger your compatibility threshold is, the less species you'll have. If your population size is small this will probably be what you want since you don't want your already small population to be divided much further. However, if your population size is really big you can afford to have more species. Another thing to note is that, in general, c1 and c2 should always be set to the same thing, since there isn't any difference in the way that disjoint and excess genes behave. All you have to do now is determine how much you want the weights of each network to factor into speciation. This, in my experience can only be adjusted through trial and error.
Related
For example, how to minimize x^2 when x ranges from -inf to inf?
How to pick initial length of chromosomes?
How to determine initial population?
Can I do in a guided way instead of initializing randomly?
Genetic algorithms normally have a fixed length in the chromosome representation, if the length will vary we are talking more about Genetic Programming.
Anyway. I don't know if there is a rule on the initial length, but you should think of a size that allows the individuals to offer good initial results, but is not very large. You need to remember that GA/GP tend to present Bloat.
Initial population should be the largest you can use and still be able to finish your executions in a reasonable amount of time given your computational power available. In my experience more is always better.
About the initialization, there are certainly several seeding techniques, this paper enumerates some of them, you should be able to find more information about each of them: Nearest Neighbor (NN), Gene Bank (GB), Selective Initialization (SI), Sorted population (SP).
For all the questions above, the best way is to experiment a lot and figure out what works best for your problem, your implementation. For instance: in a system I have the best solution usually sits around 70 nodes; after a lot of experimentation I found a sweet spot of initial individuals of 45 nodes, which worked better than initializing at 30 or 60.
I would like to calculate a Confidence Interval along with my Degrees of Freedom (DOF) estimation in Matlab. I am trying to run the following line of code:
[R, DoF, ciDOF] = copulafit('t', U); % fit the copula
The code line without the "ciDOF" arguments takes between 1-3 hours to run with my data. I tried to run the code with the "ciDOF" argument several times, but the calculations seem to take very long (I stopped the calculation after 8 hours). No error message is generated.
Does anyone have experience with this argument and could kindly tell me how long I should expect the calculation to take (the size of my data is 167*19) and if I have specified the "ciDOF" argument correctly?
Many thanks for the help!
Carolin
If your data matrix U is of size 167 x 19, then what you are asking for is a copula-fit distribution dependent on 19-dimensions, making your copula a distribution in a 20-dimensional space with 19 dependent variables.
This is almost definitely why it is taking so long, because whether it is your intention or not, you are asking MATLAB to solve a minimization problem of taking 19 marginal distributions and come-up with the 19-variate joint distribution (the copula) where each marginal distribution (represented by 167 x 1 row-vectors) is uniform.
Most-likely this is a limit of the MATLAB implementation that is iterating through many independent computations and then trying to combine them together to fit the joint distribution's ideal conditions.
First and foremost -- and not to be insulting or insinuating -- you should definitely check that you really are trying to find a 19-variate copula. Also, just in case, make sure that your matrix U is oriented in the proper way, because if you have it transposed, you could be trying to ask for the solution to a 167-variate distribution.
But, if this is what you are actually trying to do, there is not really an easy way to predict how long it will take or how long it should take. Even with multiple dimensions, if your marginals are simple or uniform already, that would greatly reduce the copula computation. But, really, there is no way to tell.
Although this may seem like a cop-out, you may actually have better luck switching from MATLAB to R, especially if you have a lot of multivariate data, and you will probably find a lot more functionality in R than MATLAB. R is freely available and comes with a Graphical User Interface (GUI), in-case you aren't comfortable with command-line programming.
There are many more sources, but here is one PDF lecture on computing copula-fits in R:
http://faculty.washington.edu/ezivot/econ589/copulasPowerpoint.pdf
I am trying to use kNN classifier to perform some supervised learning. In order to find the best number of 'k' of kNN, I used cross validation. For example, the following codes load some Matlab standard data and run the cross validation to plot various k values with respect to the cross validation error
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
knn = ClassificationKNN.fit(X,Y,...
'NumNeighbors',K(k),'CrossVal','On');
cvloss(k) = kfoldLoss(knn);
end
figure; % Plot the accuracy versus k
plot(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');
The result looks like
The best k in this case is k=2 (it is not an exhaustive search). From the figure, we can see that the cross validation error goes up dramatically after k>50. It gets to a large error and become stable after k>100.
My question is what is the maximum k we should test in this kind of cross validation framework?
For example, there are two classes in the 'ionosphere' data. One class labeled as 'g' and one labeled as 'b'. There are 351 instances in total. For 'g' there are 225 cases and for 'b' there are 126 cases.
In the codes above, it chooses the largest k=351 to be tested. But should we only test from 1 to 126 or up to 225? Is there a relation between the test cases and the maximum number of k? Thanks. A.
The best way to choose a parameter in a classification problem, is to choose it by expertness. What you are doing certainly is not this. If your data is small enough to do a lot of classification with different values of parameters, you will do that, but to be reasonable, you need to show that the parameter you chose is not randomly chosen, you need to explain the behavior of plot you drawn.
In this case, the function is ascending, so you can tell 2 is the best choice.
In most cases you will not choose K more than 20, but there is no proof and you need to do the classification until you can proof your choice.
You don't want k to be too large (i.e. too close to the number of examples), because then the k neighborhood of each query example contains a large fraction of the space, so the prediction depends less and less on the actual location of the query and more on the overall statistics. This explains why the performance is not good for large k. Your classifier essentially chooses always 'g', and gets it wrong 126/351=35% as you see in the plot.
Theory suggests that k needs to grow as the number of labeled examples grow, but sub-linearly.
When you have lots of training data, you want k to be large because you want to have a good estimate of the likelihood of a point near the query point to get each label. This allows to imitate the maximum aposteriori decision rule (which is optimal, assuming you know the actual distribution).
So here are some practical tips:
Get more data if you can. Then run the experiment again.
Focus on small values of k. My bet is that k=3 is better than k=2. Usually for binary classification k is at least 3, and usually an odd number (to avoid ties).
The fact that you see that k=2 is better does not make sense. Therefore the only case in which k=1 is different than k=2 is when the 2 nearest neighbors have different labels. However, in this case the decision is made either randomly or arbitrarily (e.g. always choose 'g'). It depends on the implementation of the knn algorithm. My guess is that in the algorithm you are using the decision is fixed, and that in cases of a tie it chooses 'g' which just happens to be more likely overall. If you switch the roles of the labels you will probably see that k=1 is better than k=2.
Would be interesting to see the the plot for small values of k (e.g. 1 - 20).
References:
nearest neighbor classification
Increasing the number of neighbors to be taken into account during the classification makes your classifier a mean value choice. You only need to check the ratio of your classes to see that it is equal to the error rate.
Since you are using cross validation the k that corresponds to the minimum of your error rate is what you should select as value. In this case it is 3 if not mistaken.
Keep in mind that the cross validation parameter introduces bias in your selection of k. A more elaborate analysis is needed there, but your 10 should be fine for this case.
I have read many tutorials, papers and I understood the concept of Genetic Algorithm, but I have some problems to implement the problem in Matlab.
In summary, I have:
A chromosome containing three genes [ a b c ] with each gene constrained by some different limits.
Objective function to be evaluated to find the best solution
What I did:
Generated random values of a, b and c, say 20 populations. i.e
[a1 b1 c1] [a2 b2 c2]…..[a20 b20 c20]
At each solution, I evaluated the objective function and ranked the solutions from best to worst.
Difficulties I faced:
Now, why should we go for crossover and mutation? Is the best solution I found not enough?
I know the concept of doing crossover (generating random number, probability…etc) but which parents and how many of them will be selected to do crossover or mutation?
Should I do the crossover for the entire 20 solutions (parents) or only two of them?
Generally a Genetic Algorithm is used to find a good solution to a problem with a huge search space, where finding an absolute solution is either very difficult or impossible. Obviously, I don't know the range of your values but since you have only three genes it's likely that a good solution will be found by a Genetic Algorithm (or a simpler search strategy at that) without any additional operators. Selection and Crossover is usually carried out on all chromosome in the population (although it's not uncommon to carry some of the best from each generation forward as is). The general idea is that the fitter chromosomes are more likely to be selected and undergo crossover with each other.
Mutation is usually used to stop the Genetic Algorithm prematurely converging on a non-optimal solution. You should analyse the results without mutation to see if it's needed. Mutation is usually run on the entire population, at every generation, but with a very small probability. Giving every gene 0.05% chance that it will mutate isn't uncommon. You usually want to give a small chance of mutation, without it completely overriding the results of selection and crossover.
As has been suggested I'd do a lit bit more general background reading on Genetic Algorithms to give a better understanding of its concepts.
Sharing a bit of advice from 'Practical Neural Network Recipies in C++' book... It is a good idea to have a significantly larger population for your first epoc, then your likely to include features which will contribute to an acceptable solution. Later epocs which can have smaller populations will then tune and combine or obsolete these favourable features.
And Handbook-Multiparent-Eiben seems to indicate four parents are better than two. However bed manufactures have not caught on to this yet and seem to only produce single and double-beds.
I have 3 sets of data: xdata, ydata and error_ydata.
I need to fit this data according to a equation like this:
y_fit = c1*sin((2*pi*x_data)/c2 - c3) + c4
where c are constants, and the parameters to find.
I've tried several matlab functions like fittype or lsqcurvefit but they require very close initial estimates for the 4 constants to work. The point was to find these constants whichever are the initial estimates you give.
Any idea?
Thank you in advance.
My best regards
Sorry, but the fact is, nonlinear estimation requires at least decent starting values. If you can't bother to supply them, then expect at least some of the time random crapola for results.
Do those tools require VERY close estimates? Hardly so IMHO, but the definition of "very" is a highly subjective one. Perhaps you need to learn more about optimization and the tools that you will use. Once you do, you will start to know how to make them work better. A workman who lacks understanding of their tools should expect to get hurt on a frequent basis.
You might do some reading. Here is one place to start.
There ARE some tools out there that allow a reduction of the problem using a partitioned least squares approach. fminspleas is one. (You can also find pleas in the optimization toips and tricks file.). But in order to use that tool, you will need to learn something about its estimation methodology, understanding how it splits the parameters into two classes. Again, understand your tools.