How to update weights on Scala-Graph? - scala

I'm using scala-graph for some graph computations on Scala, and I can't seem to understand how to do one simple thing: how do I update a certain weight?
Let's say I have:
import scalax.collection.Graph
val g = Graph(1~>2 % 1, 2~>3 % 1, 1~>3 % 3)
and now I'd like to create g2 which will be the same as g but with 1~>2 % 2. How do I do that?

There doesn't seem to be any native method to update the weight of an edge. What you can do is to remove an edge and add a new one with a different weight:
scala> g - 1~>3 % 3 + 1~>3 % 1337
res = Graph(1, 2, 3, 1~>2 %1, 1~>3 %1337, 2~>3 %1)
Edit: Note that the weight of the edge that is being removed, 1~>3 % <weight>, can have any value, since edges aren't identified by their weight.
See this thread for more details.

Related

Minimize difference between indicator variables in Matlab

I'm new to Matlab and want to write a program that chooses the value of a parameter (P) to minimize the difference between two vectors, where each vector is a variable in a dataframe. The first vector (call it A) is a predetermined vector of 1s and 0s, and the second vector (call it B) has each of its entries determined as an indicator function that depends on the value of the parameter P and other variables in the dataframe. For instance, let C be a third variable in the dataset, so
A = [1, 0, 0, 1, 0]
B = [x, y, z, u, v]
where x = 1 if (C[1]+10)^0.5 - P > (C[1])^0.5 and otherwise x = 0, and similarly, y = 1 if (C[2]+10)^0.5 - P > (C[2])^0.5 and otherwise y = 0, and so on.
I'm not really sure where to start with the code, except that it might be useful to use the fminsearch command. Any suggestions?
Edit: I changed the above by raising to a power, which is closer to the actual example that I have. I'm also providing a complete example in response to a comment:
Let A be as above, and let C = [10, 1, 100, 1000, 1]. Then my goal with the Matlab code would be to choose a value of P to minimize the differences between the coordinates of the vectors A and B, where B[1] = 1 if (10+10)^0.5 - P > (10)^0.5 and otherwise B[1] = 0, and similarly B[2] = 1 if (1+10)^0.5 - P > (1)^0.5 and otherwise B[2] = 0, etc. So I want to choose P to maximize the likelihood that A[1] = B[1], A[2] = B[2], etc.
I have the following setup in Matlab, where ds is the name of my dataset:
ds.B = zeros(size(ds,1),1); % empty vector to fill
for i = 1:size(ds,1)
if ((ds.C(i) + 10)^(0.5) - P > (ds.C(i))^(0.5))
ds.B(i) = 1;
else
ds.B(i) = 0;
end
end
Now I want to choose the value of P to minimize the difference between A and B. How can I do this?
EDIT: I'm also wondering how to do this when the inequality is something like (C[i]+10)^0.5 - P*D[i] > (C[i])^0.5, where D is another variable in my dataset. Now P is a scalar being multiplied rather than just added. This seems more complicated since I can't solve for P exactly. How can I solve the problem in this case?
EDIT 1: It seems fminbnd() isn't optimal, likely due to the stairstep nature of the indicator function. I've updated to test the midpoints of all the regions between indicator function flips, plus endpoints.
EDIT 2: Updated to include dataset D as a coefficient of P.
If you can package your distance calculation up in a single function based on P, you can then search for its minimum.
arraySize = 1000;
ds.A = double(rand([arraySize,1]) > 0.5);
ds.C = rand(size(ds.A));
ds.D = rand(size(ds.A));
B = #(P)double((ds.C+10).^0.5 - P.*ds.D > ds.C.^0.5);
costFcn = #(P)sqrt(sum((ds.A-B(P)).^2));
% Solving the equation (C+10)^0.5 - P*D = C^0.5 for P, and sorting the results
BCrossingPoints = sort(((ds.C+10).^0.5-ds.C.^0.5)./ds.D);
% Taking the average of each crossing point with its neighbors
BMidpoints = (BCrossingPoints(1:end-1)+BCrossingPoints(2:end))/2;
% Appending endpoints onto the midpoints
PsToTest = [BCrossingPoints(1)-0.1; BMidpoints; BCrossingPoints(end)+0.1];
% Calculate the distance from A to B at each P to test
costResult = arrayfun(costFcn,PsToTest);
% Find the minimum cost
[~,lowestCostIndex] = min(costResult);
% Find the optimum P
optimumP = PsToTest(lowestCostIndex);
ds.B = B(optimumP);
semilogx(PsToTest,costResult)
xlabel('P')
ylabel('Distance from A to B')
1.- x is assumed positive real only, because with x<0 then complex values show up.
Since no comment is made in the question it seems reasonable to assume x real and x>0 only.
As requested, P 'the parameter' a scalar, P only has 2 significant states >0 or <0, let's see how is this:
2.- The following lines generate kind-of random A and C.
Then a sweep of p is carried out and distances d1 and d2 are calculated.
d1 is euclidean distance and d2 is the absolute of the difference between A and and B converting both from binary to decimal:
N=10
% A=[1 0 0 1 0]
A=randi([0 1],1,N);
% C=[10 1 1e2 1e3 1]
C=randi([0 1e3],1,N)
p=[-1e4:1:1e4]; % parameter to optimize
B=zeros(1,numel(A));
d1=zeros(1,numel(p)); % euclidean distance
d2=zeros(1,numel(p)); % difference distance
for k1=1:1:numel(p)
B=(C+10).^.5-p(k1)>C.^.5;
d1(k1)=(sum((B-A).^2))^.5;
d2(k1)=abs(sum(A.*2.^[numel(A)-1:-1:0])-sum(B.*2.^[numel(A)-1:-1:0]));
end
figure;
plot(p,d1)
grid on
xlabel('p');title('d1')
figure
plot(p,d2)
grid on
xlabel('p');title('d2')
The only degree of freedom to optimise seems to be the sign of P regardless of |P| value.
3.- f(p,x) has either no root, or just one root, depending upon p
The threshold funtion is
if f(x)>0 then B(k)==1 else B(k)==0
this is
f(p,x)=(x+10)^.5-p-x^.5
Now
(x+10).^.5-p>x.^.5 is same as (x+10).^.5-x.^.5>p
There's a range of p that keeps f(p,x)=0 without any (real) root.
For the particular case p=0 then (x+10).^.5 and x.^.5 do not intersect (until Inf reached = there's no intersection)
figure;plot(x,(x+10).^.5,x,x.^.5);grid on
[![enter image description here][3]][3]
y2=diff((x+10).^.5-x.^.5)
figure;plot(x(2:end),y2);
grid on;xlabel('x')
title('y2=diff((x+10).^.5-x.^.5)')
[![enter image description here][3]][3]
% 005
This means the condition f(x)>0 is always true holding all bits of B=1. With B=1 then d(A,B) turns into d(A,1), a constant.
However, for a certain value of p then there's one root and f(x)>0 is always false keeping all bits of B=0.
In this case d(A,B) the cost function turns into d(A,0) and this is A itself.
4.- P as a vector
The optimization gains in degrees of freedom if instead of P scalar, P is considered as vector.
For a given x there's a value of p that switches B(k) from 0 to 1.
Any value of p below such threshold keeps B(k)=0.
Equivalently, inverting f(x) :
g(p)=(10-p^2)^2/(4*p^2)>x
Values of x below this threshold bring B closer to A because for each element of B it's flipped to the element value of A.
Therefore, it's convenient to consider P as a vector, not a ascalar, and :
For all, or as many (as possible) elements of C to meet c(k)<(10-p^2)^2/(4*p^2) in order to get C=A or
minimize d(A,C)
5.- roots of f(p,x)
syms t positive
p=[-1000:.1:1000];
zp=NaN*ones(1,numel(p));
sol=zeros(1,numel(p));
for k1=1:1:numel(p)
p(k1)
eq1=(t+10)^.5-p(k1)-t^.5-p(k1)==0;
s1=solve(eq1,t);
if ~isempty(s1)
zp(k1)=s1;
end
end
nzp=~isnan(zp);
zp(nzp)
returns
=
620.0100 151.2900 64.5344 34.2225 20.2500 12.7211
8.2451 5.4056 3.5260 2.2500 1.3753 0.7803
0.3882 0.1488 0.0278

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

Why is my logical mask not working on a 2D matrix in matlab properly?

X(100,371)
%% contains 100 datapoints for 371 variables
I want to keep only the data which are within mean+standard deviation:mean-standard deviation.
This is how I am proceeding:
mx=mean(X);sx=std(X);
%%generate mean, std
%%this generates mx(1,371) and sx(1,371)
mx=double(repmat(mx,100,1));
%%this fills a matrix with the same datapoints,
%%100 times
sx=double(repmat(sx,100,1));
%% this gives mx(100,371) and sx(100,371)
g=X>mx-sx & X<mx+sx;
%%this creates a logical mask g(100,371)
%%filled with 1s and 0s
test(g)=X(g);
%%this should give me test(100,371), but I get
%%test(37100), which is wrong as it doesnt maintain
%%the shape of X
test=reshape(test,100,371)
%% but when I compare this to the my original matrix
%% X(100,371) I hardly see a difference (datapoints
%% in test are still outside the range I want.
What am I doing wrong?
There is just a little bit of syntax issue with the line
test(g) = X(g);
When the compiler executes X(g) it returns all the elements in X for which g indicates 1, and at the assignment when it executes test(g) it creates a test variable good enough in size to be indexed by g which is 1x37100 and then assigns all the elements at the right places. Long story short before the assignment you can add something like:
test = zeros(size(X));
While we are at this, you could use bsxfun to get the logical indexing without having to do repmat
g = bsxfun(#gt,X,mx - sx) & bsxfun(#lt,X,mx + sx)
In R2016b or recent there is implicit bsxfun expansion
g = X > mx - sx & X < mx + sx

How to display the similarity between classes?

I have handwritten samples from two writers. I am using a feature extractor to extract features from both.
I want to display the similarity between the classes. As to show how identical both are and how difficult it can be for a classifier to classify them correctly.
I have read papers that use PCA to demonstrate this. I tried with PCA but I dont think I'm implementing this correctly. I'm using this to display the similarity.
[COEFF,SCORE] = princomp(features_extracted);
plot(COEFF,'.')
But for every class and every sample I get exactly the same plot. I mean they should be similar not exactly the same. What am I doing wrong?
You will struggle to show anything significant with only 10 samples per class, and over 4000 features.
Nevertheless, the following code will calculate PCA and show the relationship between the first two principal components (the components that contain 'most' variance).
% Truly indistinguishable data
dummy_data = randn(20, 4000);
% Uncomment this to make the data distinguishable
%dummy_data(1:10, :) = dummy_data(1:10, :) - 0.5;
% Normalise the data - this isn't technically required for the dummy data
% above, but is included for completeness.
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of 10 0's and 10 1's
class_labels = reshape(repmat([0 1], 10, 1), 20, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')
For indistinguishable data, this will show a scatter plot similar to the following:
Clearly it is not possible to draw a separation boundary between the two classes.
If you uncomment the fifth line to make the data distinguishable, then the plot will instead come out as follows:
However, to repeat what I wrote in my comment, PCA does not necessarily find the components that give the best separation. It is an unsupervised method and only finds the components with the largest variance. In some applications, this is also the components that give good separation. With only 10 samples per class, you will not be able to demonstrate anything statistically significant. Also have a look at this question for more details on PCA and the number of samples per class.
EDIT: This also extends naturally to having more classes:
numer_of_classes = 10;
samples_per_class = 20;
% Truly indistinguishable data
dummy_data = randn(numer_of_classes * samples_per_class, 4000);
% Make the data distinguishable
for i = 1:numer_of_classes
ixd = (((i - 1) * samples_per_class) + 1):(i * samples_per_class);
dummy_data(ixd, :) = dummy_data(ixd, :) - (0.5 * (i - 1));
end
% Normalise the data
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of classes (1 to numer_of_classes)
class_labels = reshape(repmat(1:numer_of_classes, samples_per_class, 1), numer_of_classes * samples_per_class, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')

How do I calculate conditional probabilities from data

I'm doing a naive Bayes in Matlab, and it was all good until they said I needed the conditional probabilities. Now I know the formula for conditional p(A|B) = P(A and B)/p(B), but when I have data to get it from I'm lost. The data is:
1,0,3,0,?,0,2,2,2,1,1,1,1,3,2,2,1,2,2,0,2,2,2,2,1,2,2,2,3,2,1,1,1,3,3,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,1,1,1,2,2
1,0,3,3,1,0,3,1,3,1,1,1,1,1,3,3,1,2,2,0,0,2,2,2,1,2,1,3,2,3,1,1,1,3,3,2,2,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,1,2,2
1,0,3,3,2,0,3,3,3,1,1,1,0,3,3,3,1,2,1,0,0,2,2,2,1,2,2,3,2,3,1,3,3,3,1,2,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,2,2,1,2
1,0,2,3,2,1,3,3,3,1,2,1,0,3,3,1,1,2,2,0,0,2,2,2,2,1,3,2,3,3,1,3,3,3,1,1,1,1,2,2,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2
1,0,3,2,1,1,3,3,3,2,2,2,1,1,2,2,2,2,2,0,0,2,2,2,1,1,2,3,2,2,1,1,1,3,2,1,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,1,2,2
1,0,3,3,2,0,3,3,3,1,2,2,0,3,3,3,2,2,1,0,0,1,2,2,2,1,3,3,1,2,2,3,3,3,2,1,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,2
1,0,3,2,1,0,3,3,3,1,2,1,2,3,3,3,3,2,2,0,0,2,2,2,2,1,3,2,2,2,2,3,3,3,2,1,1,2,2,1,2,1,2,2,2,2,1,2,2,2,2,1,2,2,2,1,2
1,0,2,2,1,0,3,1,3,3,3,3,2,1,3,3,1,2,2,0,0,1,1,2,1,2,1,3,2,1,1,3,3,3,2,2,1,2,1,2,2,1,2,2,2,1,2,2,2,1,2,2,2,2,1,2,2
1,0,3,1,1,0,3,1,3,1,1,1,3,2,3,3,1,2,2,0,0,2,2,2,1,2,1,2,1,1,1,3,3,3,3,2,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2
2,0,2,3,2,0,2,2,2,1,2,2,2,2,2,2,1,2,2,2,2,2,2,1,3,2,3,3,3,3,3,3,3,3,2,1,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,3,2,1,1,2,2
2,0,2,2,0,0,3,2,3,1,1,3,1,3,1,1,2,2,2,0,2,1,1,2,1,1,2,2,2,2,1,3,3,3,1,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
2,0,2,3,2,0,1,2,1,1,2,1,0,1,2,2,1,2,1,0,2,2,2,2,1,2,1,2,2,3,1,3,3,3,1,2,2,1,2,2,2,2,1,2,2,2,2,2,2,2,2,2,1,1,2,2,1
2,0,2,1,1,0,1,2,2,1,2,1,1,2,2,2,1,2,2,0,2,2,2,2,1,2,1,3,2,2,1,1,1,1,1,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
2,0,2,2,1,1,2,3,3,1,1,1,1,2,2,2,1,2,2,0,1,2,2,2,1,2,1,2,2,2,1,1,1,3,2,1,1,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,1,1,1,2,2
2,1,3,0,?,1,1,2,2,1,1,1,1,2,1,1,1,2,2,0,2,2,2,2,1,2,2,2,2,2,3,3,3,3,1,1,2,1,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,1,2,1
2,0,3,2,2,1,2,2,2,1,1,2,1,2,3,3,2,2,2,0,1,2,2,2,1,2,3,2,2,1,2,2,2,3,1,3,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2
2,0,3,2,2,0,1,1,3,1,1,1,0,1,3,3,1,2,2,0,2,2,2,2,1,1,2,2,2,2,1,3,3,3,3,3,1,2,2,1,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
2,0,2,1,1,0,2,1,3,1,1,1,0,3,1,3,1,2,2,0,0,1,2,2,3,3,3,2,2,2,1,3,3,3,1,1,1,2,1,2,2,2,1,2,1,2,2,2,2,2,2,2,1,1,1,2,2
2,0,2,0,?,0,2,3,3,3,2,1,0,2,2,1,1,1,2,0,0,2,1,2,1,2,3,2,2,3,1,3,3,3,2,1,1,2,1,2,2,2,3,2,2,2,2,2,2,2,2,2,2,2,2,1,2
2,0,1,2,1,0,3,3,3,1,2,2,1,1,3,3,1,2,2,0,0,2,2,2,1,2,1,3,2,3,1,1,1,3,1,1,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,1,1,2,2,1
2,0,2,0,?,1,3,3,3,1,2,1,1,3,3,3,1,2,2,0,0,2,2,2,2,1,1,2,3,2,1,1,1,3,1,3,1,1,2,2,2,1,2,2,1,2,2,2,2,2,2,1,2,2,1,2,2
2,0,3,3,2,0,2,1,3,1,1,3,3,3,3,3,1,2,2,0,0,2,2,1,1,2,2,3,3,3,3,3,3,3,2,2,2,1,2,1,2,1,2,2,2,2,2,2,2,1,2,2,2,2,2,1,2
3,0,2,3,1,1,2,2,1,1,1,1,1,1,2,2,1,2,2,2,2,1,2,1,1,1,1,2,2,3,1,3,3,3,1,1,1,3,1,3,3,3,3,3,3,3,3,3,3,3,3,1,3,3,2,2,1
3,0,2,3,1,1,1,2,1,1,1,2,1,1,1,2,2,1,1,1,2,1,2,1,1,2,2,2,2,2,1,3,3,3,2,2,2,3,3,1,1,2,2,3,2,2,2,2,2,2,2,2,2,2,2,2,1
3,0,3,3,1,0,3,3,1,1,1,2,1,1,2,2,2,2,2,2,2,1,1,1,1,1,2,2,2,2,3,3,3,3,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,2,1
3,0,2,3,2,0,1,2,2,1,2,1,2,1,1,1,2,1,2,2,1,2,1,2,2,1,3,2,1,1,2,2,2,2,1,1,2,2,?,2,1,1,1,2,2,2,1,2,2,2,1,3,1,2,2,1,2
3,0,2,2,2,0,2,1,2,1,1,1,0,2,2,3,1,2,2,2,2,2,2,2,3,3,3,2,2,1,2,2,2,2,3,1,2,2,2,2,1,2,1,1,2,2,1,2,2,2,2,2,2,2,1,2,1
3,0,2,2,1,0,2,2,2,1,1,2,0,2,2,2,1,2,2,2,2,2,2,2,1,2,1,3,3,3,1,3,3,2,2,3,1,2,1,3,2,2,3,2,2,2,3,3,3,2,2,3,2,2,2,2,1
3,0,3,2,2,0,2,2,2,1,1,2,0,2,2,2,1,2,2,2,2,2,2,1,1,2,2,2,2,2,2,1,1,1,2,1,1,3,1,3,3,3,2,3,2,2,2,2,2,2,3,1,2,2,2,2,2
3,0,2,1,1,0,2,2,1,1,1,1,0,1,1,1,2,1,2,0,2,1,1,1,1,1,2,2,1,2,1,3,3,3,1,1,3,3,3,2,3,1,2,2,3,3,2,2,2,3,2,2,2,2,2,2,1
3,0,2,3,2,1,2,2,3,1,1,2,1,2,2,2,1,2,2,0,2,2,2,1,1,2,2,2,2,2,1,2,2,3,2,2,2,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2
3,0,2,3,1,0,2,3,3,1,1,1,1,2,2,2,1,2,2,0,2,2,2,2,1,2,1,2,2,2,1,1,1,1,1,2,2,1,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2
Where the classes are in the first column from 1 to 3 the ? I will change them as the mean of the column, the prior of the class well that can be done by counting as class #/total of first column. That is simple, but the conditionals?
---class---
/ / \ \
x1 x2...x_i xn
Bayes p(c|x) = p(x|c)p(c)/p(x). Thanks.
EDIT: What I think I need is, someone who can explain the process of getting the conditionals from data, with apples if possible if I need to do a CPT and give me pointers on how to do it, I'm a programmer mostly.
This is a brute-force code. Just to be exactly sure we are talking about the same problem.
% 1º step: Write the 32-by-56 matrix (excluding column of classes),
% replacing "?" with nans. Name it "data".
% Check the values in data:
unique(data(~isnan(data)))
% These are 0, 1, 2 and 3
% 2º step: Find mean of each variable without considering the NaN values
data_mean = nanmean(data);
% 3º step: replace missing values with class sample mean
data_new = data;
for hh = 1:56
inds = isnan(data(:, hh));
data_new(inds, hh) = data_mean(hh);
end
% Only NaN values have been replaced:
find(isnan(data(:))) % indices of NaN values in data
find(data_new(:) ~= data(:)) % indices of data_new different from data
% 4º step: compute probabilities of outcome conditional to each class
n = [0, 9, 22, 32]; % indices for classes
probs = zeros(56, 3, 4);
for hh = 1:56 % for each variable
for ii = 1:3 % for each class
inds = (n(ii)+1):n(ii+1);
for jj = 1:4 % for each outcome
probs(hh, ii, jj) = sum(data(inds, hh) == jj-1);
end
end
end
% The conditional probability of the outcome conditional to the class, for
% the first variable is
squeeze(probs(1, :, :))