Orange Bayes algorithm with continuous features - orange

I have a two class Bayes classification problem with four continuous features. I'm trying to partially reproduce bayes algorithm algorithm that Orange uses for calculating probabilities. But I haven't succeeded to obtain same values that Orange outputs.
Data set size : 150 (class0 : 88 and class1 : 62)
I use the following algorithm
p(class0 | X1, X2, X3, X4) = L0 / (L0 + L1)
p(class1 | X1, X2, X3, X4) = L1 / (L0 + L1)
where L0 and L1 are likelihoods
L0 = prior_class0 * product( p(Xi|class0) )
L1 = prior_class1 * product( p(Xi|class1) )
prior_class0 and prior_class1 are Laplacian estimators
prior_class0 = (88 + 1) / (150 + 2)
prior_class1 = (62 + 1) / (150 + 2)
Orange uses LOESS for calculating conditional probabilities (I guess its not necessary to reproduce that). For this dataset it outputs 49 points for both classes as given in python object classifier.conditional_distributions. By using linear interpolation between surrounding points for Xi, I can calculate p(Xi|class0) and p(Xi|class1).
1) Can anyone comment on Orange Bayes algorithm with continuous features?
2) Or any technical advice how to setup compiler/IDE that I could debug Orange C++ code and inspect some intermediary results from functions in orange/source/orange/bayes.cpp?

Orange uses a slightly different formula that, according to Kononenko, gives the same result but allows for better interpretability and m-estimate of probabilities. Instead of product( p(Xi|class0) ) it computes product( p(class0|Xi) / p(class0)). I don't think this should affect your computation, though, but you can check. The code that computes those probabilities is at https://github.com/biolab/orange/blob/master/source/orange/bayes.cpp#L169. Note that it does it for all classes in parallel.
The other piece of the code you're interested in is the computation of probabilities from LOESS density estimates. It's at https://github.com/biolab/orange/blob/master/source/orange/estimateprob.cpp#L307. Note that most operations there are on vectors, e.g. all variables in *result *= (x-x1)/(x2-x1); are actually vectors.
As for debugging, I wrote this code (many years ago, and somewhat ashamed to admit -- seeing the terrible coding style I used) with Visual Studio. I forgot the version and can't check it since I no longer use Windows. But I never really debugged Orange on any other OS.
If you load the project and build a debug version, you'll also have to build a debug version of Python. This is actually simple (see the instructions in the Python source code), the problem would be that you'd have to build debug version of any other binary libraries you use as well (e.g. numpy). A simpler way is to build a release version of Orange but switch the debug info flags on. This way you can use standard Python and libraries.

Related

Is nearest centroid classifier really inefficient?

I am currently reading "Introduction to machine learning" by Ethem Alpaydin and I came across nearest centroid classifiers and tried to implement it. I guess I have correctly implemented the classifier but I am getting only 68% accuracy . So, is the nearest centroid classifier itself is inefficient or is there some error in my implementation (below) ?
The data set contains 1372 data points each having 4 features and there are 2 output classes
My MATLAB implementation :
DATA = load("-ascii", "data.txt");
#DATA is 1372x5 matrix with 762 data points of class 0 and 610 data points of class 1
#there are 4 features of each data point
X = DATA(:,1:4); #matrix to store all features
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Y = DATA(:,5); # to store outputs
mean0 = sum(X0)/610; #mean of features of class 0
mean1 = sum(X1)/610; #mean of featurs of class 1
count = 0;
for i = 1:1372
pre = 0;
cost1 = X(i,:)*(mean0'); #calculates the dot product of dataset with mean of features of both classes
cost2 = X(i,:)*(mean1');
if (cost1<cost2)
pre = 1;
end
if pre == Y(i)
count = count+1; #counts the number of correctly predicted values
end
end
disp("accuracy"); #calculates the accuracy
disp((count/1372)*100);
There are at least a few things here:
You are using dot product to assign similarity in the input space, this is almost never valid. The only reason to use dot product would be the assumption that all your data points have the same norm, or that the norm does not matter (nearly never true). Try using Euclidean distance instead, as even though it is very naive - it should be significantly better
Is it an inefficient classifier? Depends on the definition of efficiency. It is an extremely simple and fast one, but in terms of predictive power it is extremely bad. In fact, it is worse than Naive Bayes, which is already considered "toy model".
There is something wrong with the code too
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Once you subsamples X0, you have 1220 training samples, yet later during "testing" you test on both training and "missing elements of X0", this does not really make sense from probabilistic perspective. First of all you should never test accuracy on the training set (as it overestimates true accuracy), second of all by subsampling your training data your are not equalizing priors. Not in the method like this one, you are simply degrading quality of your centroid estimate, nothing else. These kind of techniques (sub/over- sampling) equalize priors for models that do model priors. Your method does not (as it is basically generative model with the assumed prior of 1/2), so nothing good can happen.

How to mention shadowing and fading in wireless channel model?

I am trying to model a wireless channel with following parameters in matlab.
Multipath Fading: Exponential distribution with unit mean
Shadowing: Log-normal distribution with standard deviation 8 dB
Path-loss exponent: 2.4
Path-loss constant: 30
How should I mention shadowing and fading in the channel model in dB?
I tried to use log-normal and exponential distributions in matlab to generate random numbers with given parameters. But I am not sure if it is true or not.
Can anyone help me?
(There is a similar question in Sjaffry question, but it doesn't have any answer and because I don't have enough reputation to comment on that topic, I tried to ask my own question.)
More Information:
I know that:
g_i,j = 10log10(K) - 10log10(B) - 10log10(T) -10alog10(L_i,j);
Where g_i,j is channel gain, B is fading gain, T is shadowing gain, L_i,j is distance between i , j and K is path loss constant.
I wrote this code in matlab:
k = 30;
a = 2.4;
T = 8; % dB
Distance = Dist([i_x, i_y], [j_x,j_y]);
G_dB = 10*log10(k) - 10*log10(exprnd(1)) - 10*log10(random('logn', 0 , (10^(T/10)))) -10 * a * log10(Distance);
The channel gain values (for distances about 300 m) must be more than one or less than one?

Efficient size choice for SciPy Discrete Sine Transform

I noticed that SciPy has an implementation of the Discrete Sine Transform, and I was comparing it to the one that's in MATLAB. The MATLAB documentation notes that for best performance, the size of the inputs should be 2^p -1, presumably for a divide and conquer strategy. Is this also true for the SciPy implementation?
Although this question is old, I happen to have just ran some tests and then stumbled upon this question.
The answer is yes. Internally, scipy seems to converts the array to size M = 2*(N+1).
Ideally, M = 2^i, for some integer i. Therefore, N should follow N = 2^i - 1. The following picture shows how timings scale with fft-size. Note that the orange line is much smoother, indicating no unexpected memory overhead.
Green line: N = 2^i
Blue line: N = 2^i + 1
Orange line: N = 2^i - 1
UPDATE
After digging some more into the documentation of scipy.fftpack, I found that the above answer is only partly true. According to the documentation, "SciPy’s FFTPACK has efficient functions for radix {2, 3, 4, 5}". This means that instead of efficiently doing arrays of size M = 2^i, it can handle any M = 2^i * 3^j * 5^k (4 is not a prime). The optimum for scipy.fftpack.dst (or dct) is then M - 1. Finding those numbers can be a little awkward, but luckily there's a function for that, too!
Please note that the above graph is log-log scale, so speedups of 40 or so are not uncommon. Thus, choosing a fast size can make you calculations orders of magnitudes faster! (I found this out the hard way).

Regarding Time scale issue in Netlogo

I am new user of netlogo. I have a system of reactions (converted to Ordinary Differential Equations), which can be solved using Matlab. I want to develop the same model in netlogo (for comparison with matlab results). I have the confusion regarding time/tick because netlogo uses "ticks" for increment in time, whereas Matlab uses time in seconds. How to convert my matlab sec to number of ticks? Can anyone help me in writing the code. The model is :
A + B ---> C (with rate constant k1 = 1e-6)
2A+ C ---> D (with rate constant k2 = 3e-7)
A + E ---> F (with rate constant k3 = 2e-5)
Initial values are A = B = C = 500, D = E = F = 10
Initial time t=0 sec and final time t=6 sec
I have a general comment first, NetLogo is intended for agent-based modelling. ABM has multiple entities with different characteristics interacting in some way. ABM is not really an appropriate methodology for solving ODEs. If your goal is to simply build your model in something other than Matlab for comparison rather than specifically requiring NetLogo, I can recommend Vensim as more appropriate. Having said that, you can build the model you want in NetLogo, it is just very awkward.
NetLogo handles time discretely rather than continuously. You can have any number of ticks per second (I would suggest 10 and then final time is 60 ticks). You will need to convert your equations into a discrete form, so your rates would be something like k1-discrete = k1 / 10. You may have precision problems with very small numbers.

Solving Algebraic Equations Programmatically [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I have six parametric equations using 18 (not actually 26) different variables, 6 of which are unknown.
I could sit down with a couple of pads of paper and work out what the equations for each of the unknowns are, but is there a simple programatic solution (I'm thinking in Matlab) that will spit out the six equations I'm looking for?
EDIT:
Shame this has been closed, but I guess I can see why. In case anyone is still interested, the equations are (I believe) non-linear:
r11^2 = (l_x1*s_x + m_x)^2 + (l_y1*s_y + m_y)^2
r12^2 = (l_x2*s_x + m_x)^2 + (l_y2*s_y + m_y)^2
r13^2 = (l_x3*s_x + m_x)^2 + (l_y3*s_y + m_y)^2
r21^2 = (l_x1*s_x + m_x - t_x)^2 + (l_y1*s_y + m_y - t_y)^2
r22^2 = (l_x2*s_x + m_x - t_x)^2 + (l_y2*s_y + m_y - t_y)^2
r23^2 = (l_x3*s_x + m_x - t_x)^2 + (l_y3*s_y + m_y - t_y)^2
(Squared the rs, good spot #gnovice!)
Where I need to find t_x t_y m_x m_y s_x and s_y
Why am I calculating these? There are two points p1 (at 0,0) and p2 at(t_x,t_y), for each of three coordinates (l_x,l_y{1,2,3}) I know the distances (r1 & r2) to that point from p1 and p2, but in a different coordinate system. The variables s_x and s_y define how much I'd need to scale the one set of coordinates to get to the other, and m_x, m_y how much I'd need to translate (with t_x and t_y being a way to account for rotation differences in the two systems)
Oh! And I forgot to mention, I also know that the point (l_x,l_y) is below the highest of p1 and p2, ie l_y < max(0,t_y) as well as l_y > 0 and l_y < t_y.
It does seem specific enough that I might have to just get my pad out and work it through mathematically!
If you have the Symbolic Toolbox, you can use the SOLVE function. For example:
>> solve('x^2 + y^2 = z^2','z') %# Solve for the symbolic variable z
ans =
(x^2 + y^2)^(1/2)
-(x^2 + y^2)^(1/2)
You can also solve a system of N equations for N variables. Here's an example with 2 equations, 2 unknowns to solve for (x and y), and 6 parameters (a through f):
>> S = solve('a*x + b*y = c','d*x - e*y = f','x','y')
>> S.x
ans =
(b*f + c*e)/(a*e + b*d)
>> S.y
ans =
-(a*f - c*d)/(a*e + b*d)
Are they linear? If so, then you can use principles of linear algebra to set up a 6x6 matrix that represents the system of equations, and solve for it using any standard matrix inversion routine...
if they are not linear, they you need to use numerical analysis methods.
As I recall from many years ago, I believe you then create a system of linear approximations to the non-linear equations, and solve that linear system, over and over again iteratively, feeding the answers back into the inputs each time, until some error metric gets sufficiently small to indicate you have reached the solution. It's obviously done with a computer, and I'm sure there are numerical analysis software packages that will do this for you, although I imagine that as any arbitrary system of non-linear equations can include almost infinite degree of different types and levels of complexity, that these software packages can't create the linear approximations for you, (except maybe in the most straightforward standard cases) and you will have ot do that part of the thing manually.
Yes there is (assuming these are linear equations) - you do this by creating a matrix equiation which is equivalent to your 6 linear equations, for example if you had the two equatrions:
6x + 12y = 9
7x - 8y = 14
This could be equivalently represented as:
|6 12| |a| |9 |
|7 -8| |b| = |14|
(Where the 2 matrices are multipled together). Matlab can then solve this for the solution matrix (a, b).
I don't have matlab installed, so I'm afraid I'm going to have to leave the details up to you :-)
As mentioned above, the answer will depend on whether your equations are linear or nonlinear. For linear systems, you can set up a simple matrix system (but don't use matrix inversion, use LU decomposition (if your system is well-conditioned) ).
For non-linear systems, you'll need to use a more advanced solver, most likely some variation on Newton's method. Essentially you'll give Matlab your six equations, and ask it to simultaneously solve for the root (zero) of all of the equations. There are several caveats and complications that come into play when dealing with non-linear systems, one of which is the need for an initial guess that assigns each of your six unknown variables a value close to the true solution. Without a good initial guess, the solver may take a long time finding a solution, or may not converge to a solution at all, even if one exists.
Decades ago, MIT developed MACSYMA, a symbolic algebra system for just this kind of thing. MIT sold MACSYMA to Symbolics, which has pretty well folded, dried up, and blown away. However, because of the miracle of military funding, an early version of MACSYMA was required to be released to the government. THAT version was subsequently released under the GPL, and is continuing to be maintained, under the name MAXIMA.
See http://maxima.sourceforge.net/ for more information.