Real numbers (constants) in genetic programming - numbers

I can't figure out how a genetically programmed A.I. can determine when there should be a constant in the final equation. If I take the formula F(m) = ma; F(m) = m9.8, how can the A.I. know what the real number 9.8 actually is? I understand that instead of putting the final number in the binary tree, you can actually put a symbol that describes a constant and then later calculate or guess what is its value in a certain way.
Thank you

Given a predefined set of constants (part of the terminal set) they'll be combined to form new constants (using a tree-representation, any sub-tree with only numeric constants as leaves can itself be thought of as a new numeric constant).
Even with a single constant (c) the system will create:
the 1.0 constant (constant divided by itself: c / c);
the 2.0 constant (1.0 + 1.0 i.e. c / c + c / c);
the 0.5 constant (1.0 / 2.0 i.e. c / c / (c / c + c / c));
many constants will be created this way (if you are lucky... 9.8).
Sometimes special terminals named "ephemeral random constant" (Koza) are used. For each ephemeral in the initial population, a random number in a specified range is generated. Then these random constants are moved around and combined.
Anyway, even with the use of the ephemeral random constant, GP can be hard put to generate the right constants (Koza said "the finding of numeric constants is a skeleton in the GP closet").
So other techniques can be used during/after the evolution, e.g. numeric mutation, hill climbing...
These hybrid systems often have significant improvements in the success ratios (at least for regression problems).

Related

what is the difference between defining a vector using linspace and defining a vector using steps?

i am trying to learn the basics of matlab ,
i wanted to write a mattlab script ,
in this script i defined a vector x with a "d" step that it's length is (2*pi/1000)
and i wanted to plot two sin function according to x :
the first sin is with a frequency of 1, and the second sin frequency 10.3 ..
this is what i did:
d=(2*pi/1000);
x=-pi:d:pi;
first=sin(x);
second=sin(10.3*x);
plot(x,first,x,second);
my question:
what is the different between :
x=linspace(-pi,pi,1000);
and ..
d=(2*pi/1000);
x=-pi:d:pi;
? i am asking because i got confused since i think they both are the same but i think there is something wrong with my assumption ..
also is there is a more sufficient way to write sin function with a giveng frequency ?
The main difference can be summarizes as predefined size vs predefined step. And your example highlights it very well, indeed (1000 elements vs 1001 elements).
The linspace function produces a fixed-length vector (the length being defined by the third input argument, which defaults to 100) whose lower and upper limits are set, respectively, by the first and the second input arguments. The correct step to use is internally computed by the function itself (step = (x2 - x1) / n).
The colon operator defines a vector of elements whose values range between the specified lower and upper limits. The step, which is an optional parameter that defaults to 1, is the discriminant of the vector length. This means that the length of the result is determined by the number of steps that must be accomplished in order to reach the upper limit, starting from the lower one. On an side note, on this MathWorks thread you can find a very interesting discussion concerning the behavior of the colon operator in respect of floating-point management.
Another difference, related to the first one, is that linspace always includes the upper limit value while the colon operator only contains it if the specified step allows it (0:5:14 = [0 5 10]).
As a general rule, I prefer to use the former when I want to produce a vector of a predefined length (pretty obvious, isn't it?), and the latter when I need to create a sequence whose length has only a marginal relevance (or no relevance at all)

Variable precicion arithmetic for symbolic integral in Matlab

I am trying to calculate some integrals that use very high power exponents. An example equation is:
(-exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)).^2 ...
./( exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)) ...
/ (2*sqrt(pi))
where p is constant (1000 being a typical value), and I need the integral for x=[-inf,inf]. If I use the integral function for numeric integration I get NaN as a result. I can avoid that if I set the limits of the integration to something like [-20,20] and a low p (<100), but ideally I need the full range.
I have also tried setting syms x and using int and vpa, but in this case vpa returns:
1.0 - 1.0*numeric::int((1125899906842624*(exp(-(x - 10*10^(1/2))^2) - exp(-(x + 10*10^(1/2))^2))^2)/(3991211251234741*(exp(-(x - 10*10^(1/2))^2) + exp(-(x + 10*10^(1/2))^2)))
without calculating a value. Again, if I set the limits of the integration to lower values I do get a result (also for low p), but I know that the result that I get is wrong – e.g., if x=[-100,100] and p=1000, the result is >1, which should be wrong as the equation should be asymptotic to 1 (or alternatively the codomain should be [0,1) ).
Am I doing something wrong with vpa or is there another way to calculate high precision values for my integrals?
First, you're doing something that makes solving symbolic problems more difficult and less accurate. The variable pi is a floating-point value, not an exact symbolic representation of the fundamental constant. In Matlab symbolic math code, you should always use sym('pi'). You should do the same for any other special numeric values, e.g., sqrt(sym('2')) and exp(sym('1')), you use or they will get converted to an approximate rational fraction by default (the source of strange large number you see in the code in your question). For further details, I recommend that you read through the documentation for the sym function.
Applying the above, here's a runnable example:
syms x;
p = 1000;
f = (-exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)).^2./(exp(-(x+sqrt(p)).^2)...
+exp(-(x-sqrt(p)).^2))/(2*sqrt(sym('pi')));
Now vpa(int(f,x,-100,100)) and vpa(int(f,x,-1e3,1e3)) return exactly 1.0 (to 32 digits of precision, see below).
Unfortunately, vpa(int(f,x,-Inf,Inf)), does not return an answer, but a call to the underlying MuPAD function numeric::int. As I explain in this answer, this is what can happen when int cannot obtain a result. Normally, it should try to evaluate the the integral numerically, but your function appears to be ill-defined at ±∞, resulting in divide by zero issues that the variable precision quadrature methods can't handle well. You can evaluate the integral at wider bounds by increasing the variable precision using the digits function (just remember to set digits back to the default of 32 when done). Setting digits(128) allowed me to evaluate vpa(int(f,x,-1e4,1e4)). You can also more efficiently evaluate your integral over a wider range via 2*vpa(int(f,x,0,1e4)) at lower effective digits settings.
If your goal is to see exactly how much less than one p = 1000 corresponds to, you can use something like vpa(1-2*int(f,x,0,1e4)). At digits(128), this returns
0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000086457415971094118490438229708839420392402555445545519907545198837816908450303280444030703989603548138797600750757834260181259102
Applying double to this shows that it is approximately 8.6e-89.

Spearman's rank correlation significance

I'm calculating Spearman's rank correlation in matlab with the following code:
[RHO,PVAL] = corr(x,y,'Type','Spearman');
RHO =
0.7211
PVAL =
4.9473e-04
and then with different variables
[RHO,PVAL] = corr(x2,y2,'Type','Spearman');
RHO =
0.3277
PVAL =
0.0060
How do you categorize these as p < 0.05, p < 0.01, p < 0.001 etc. Commonly in scientific journals these pvalues are represented as the examples I've shown and not as one number. Would both of these be p < 0.01? When defining whether a correlation is significant to a specific value do you always look for the smallest error i.e if its PVAL = 0.0005, both p > 0.05 and p > 0.001 would be correct here, do we simply write the lowest i.e. p > 0.001?
As Martin Dinov wrote, this is at least partially a matter of journal policy. But, as long as there is no explicit journal convention against it, I would recommend to always report the actual p-value, in this case in the form p = 4.9·10-4 and p = 0.006, respectively. You can then proceed to say that the effect you found is statistically significant, usually based on comparison with a previously chosen significance level, typically 0.05, unless you need to correct for multiple comparisons.
The reason is that the commonly used significance levels are purely a matter of convention. By only saying that p is below one conventional threshold means to withhold valuable information from the reader, which she might use to make up her own mind about the result – and this truncation is not even justified by relevant saving of print space.
You should also, of course, report the value of the correlation coefficient itself (which in this case doubles as a test statistic and an effect size) as well as the sample size.
At least for the field of psychology, these are official recommendations:
Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.
…
Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
L. Wilkinson and the Task Force on Statistical Inference, "Statistical Methods in Psychology Journals. Guidelines and Explanations"
You mean pval is < 0.05 and also < 0.001 and not >. In general, you do want to show that it is smaller than the smallest significance level (alpha) threshold that you can. So yes, it is best to say for the second example that the p-value is < 0.001. Depending on the journal convention, it may be preferable to put the actual p-value in (so, for the first example, 4.9473e-04) or just that it's < some good alpha (0.0001 for the first case).

Fixed point arithmetic

I'm currently using Microchip's Fixed Point Library, but I think this applies to most fixed point libraries. It supports Q15 and Q15.16 types, respectively 16-bit and 32-bit data.
One thing I noticed is that it does not include add, subtract, multiply or divide functions.
How am I supposed to do these? Is it as simple as just adding/subtracting/multiplying/dividing them together using integer math? I can see addition and subtraction working, but multiplying or dividing wouldn't take care of the fractional part...?
The Microsoft library includes functions for adding and subtracting that deal with underflow/overflow (_Q15add and _Q15sub).
Multiplication can be implemented as an assembly function (I think the code is good - this is from memory).
C calling prototype is:
extern _Q15 Q15mpy(_Q15 a, _Q15 b);
The routine (placed in a .s source file in your project) is:
.global _Q15mpy
_Q15mpy:
mul.ss w0, w1, w2 ; signed multiple parameters, result in w2:w3
SL w2, w2 ; place most significant bit of W2 in carry
RLC w3, w0 ; rotate left carry into w3; result in W0
return ; return value in W0
.end
Remember to include libq.h
This routine does a left-shift of one bit rather than a right-shift of 15 bit on the result. There are no overflow concerns because Q15 numbers always have a magnitude <= 1.
It turns out that all basic arithmetic functions are performed by using the native operators due to how the numbers are represented. e.g. divide uses the / operator and multiply the * operator, and these compile to simple 32-bit divides and multiplies.

Problem with arithmetic using logarithms to avoid numerical underflow (take 2)

I have two lists of fractions;
say A = [ 1/212, 5/212, 3/212, ... ]
and B = [ 4/143, 7/143, 2/143, ... ].
If we define A' = a[0] * a[1] * a[2] * ... and B' = b[0] * b[1] * b[2] * ...
I want to calculate a normalised value of A' and B'
ie specifically the values of A' / (A'+B') and B' / (A'+B')
My trouble is A are B are both quite long and each value is small so calculating the product causes numerical underflow very quickly...
I understand turning the product into a sum through logarithms can help me determine which of A' or B' is greater
ie max( log(a[0])+log(a[1])+..., log(b[0])+log(b[1])+... )
and that using logs I can calculate the value of A' / B' but how do I do A' / A'+B'
My best bet to date is to keep the number representations as fractions, ie A = [ [1,212], [5,212], [3,212], ... ] and implement my own arithmetic but it's getting clumsy and I have a feeling there is a (simple) way of logarithms I'm just missing....
The numerators for A and B don't come from a sequence. They might as well be random for the purpose of this question. If it helps the denominators for all values in A are the same, as are all the denominators for B.
Any ideas most welcome!
( ps. I asked a similar question 24 hours ago regarding the ratio A'/B' but it was actually the wrong question to ask. I'm actually after A'/(A'+B'). Sorry, my mistake. )
I see few ways here
First of all you can notice that
A' / (A'+B') = 1 / (1 + B'/A')
and you know how to calculate B'/A' with logarithms.
Another way is to implement your own rational arithmetic but you don't need to go far about it. Since you know that denominators are the same for the whole array, it immediately gives you
numerator(A') = numerator(a[0]) * numerator(a[1]) ...
denumerator(A') = denumerator(a[0]) ^ A.length
All you now need to do is to sum A' and B' which is easy and then multiply A' and 1/(A'+B') which is also easy. The hardest part here is to normalize resulting value which is done with modulo operation and is trivial.
Alternatively, since you most likely using some popular scripting language, most of them have classes for rational arithmetic built in, Python and Ruby have them for sure.
My best bet to date is to keep the number representations as fractions and implement my own arithmetic but it's getting clumsy
What language are you using? If you can overload operators, it should be really easy to make up a Fraction class that you can treat as a number pretty much everywhere.
As an example, determining whether a fraction A / B is larger than C / D is basically comparing whether A * D is larger than B * C.
Both A and B have the same denominator in each fraction you mention. Is that true for every term in the list? If that's so, why don't you factor that out when you calculate the product? The denominator will simply be X^n, when X is the value and n is the number of terms in the list.
If you do that, you'll have the opposite problem: overflow in the numerator. You know that it can't be smaller than max(X)^n, where max(X) is the maximum value in the numerator and n is the number of terms in the list. If you can calculate that, you can see if your computer will have a problem. You can't put 10 pounds of anything in a 5 pound bag.
Unfortunately, the properties of logarithms limit you to the following simplifications:
(source: equationsheet.com)
and
(source: equationsheet.com)
So you're stuck with:
(source: equationsheet.com)
If you're using a language that supports infinite precision numbers (e.g., Java BigDecimal) it might make your life a little easier. But there's still a good argument for doing some thinking before you compute. Why use brute force when you can be elegant?
Well ... if you know A'(A'+B'), then B'(A'+B') should be one minus that. I personally would not use logarithms. I would use the actual fractions. I would also use some sort of BigInt class to represent the numerator and denominator. Which language are you using? Python can be a good fit.