"bnlearn" usage in R for prediction of discrete variables - prediction

I am working on "bnlearn" in R to construct a probabilistic network on discrete variables. I need to predict one variable based on the constructed BN. Let's take iris data as an example, I wrote R codes referring to another post in CrossValidated. The codes are as below:
> data(iris)
> set.seed(1)
> tr_idx <- sample(seq_len(nrow(iris)), size = 100)
> tr <- iris[tr_idx,]
> te <- iris[-tr_idx,]
> res <- hc(tr)
> fit <- bn.fit(res, tr)
> pred <- predict(fit, "Species", te)
> cb <- cbind(pred, te[, "Species"])
> accuracy(f = pred, x = te[, "Species"])
Am I right? My questions are:
1. Do I have to discretise each variables?
2. If my data include both discrete and continuous variables, which bn structure learning function I might use?
3. Function accuracy did not work in terms of the above example, I got an error as below:
accuracy(f = pred, x = te[, "Species"])
Error in accuracy(f = pred, x = te[, "Species"]) : First argument should be a forecast object or a time series.
So, Would you please help on this? Many thanks in advance.
Regards,
Xuemei

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

Why `libsvm` in matlab gives me all 1 prediction

I use svm in Rand matlab with the same dataset.
My R code works fine, which gives me some reasonable predictions.
matdat <- readMat(con = "data.mat")
svm.model <- svm(x = matdat$normalize.X, y = matdat$Yt)
pred <- predict(svm.model, newdata = matdat$normalize.X)
pred <- sapply(pred, function(x){ifelse(x > 0, 1, -1)})
sum(pred == matdat$Yt)/length(matdat$Yt)
But, my matlab code gives me all 1 prediction on the training data.
load('data.mat')
model2 = svmtrain(Yt, normalize_X,'-s 3 -c 1 -t 2 -p 0.1');
[predicted_label,accuracy, decision_values] = svmpredict(Yt, normalize_X, model2);
I have checked the default parameters of svm{e1071}, which in my opinion agrees with the matlab version.
I use the e1071 package with verion 1.6-7 in R. And the latest libsvm from the official page.
So, what can I do to find the reason, any ideas?
==== update====
Before feeding the data to libsvm in data, I apply mapstd to normalize the data which is automatically done in R. Then I got the same trained model in both R and Matlab.
In Matlab you use the -s 3 option which is regression, not classification.
As a starting point, don't assume anything about default parameters, just specify parameters explicitly in both R and Matlab.

How can we find percentile or quantile of gamma distribution in MATLAB?

Suppose that we have this gamma distribution in MATLAB:
I want this part of distribution with higher density (x-axis range). How can I extract this in MATLAB? I fit this distribution using histfit function.
PS. My codes:
figure;
histfit(Data,20,'gamma');
[phat, pci] = gamfit(Data);
phat =
11.3360 4.2276
pci =
8.4434 3.1281
15.2196 5.7136
When you fit a gamma distribution to your data with [phat, pci] = gamfit(Data);, phat contains the MLE parameters.
You can plug this into gaminv:
x = gaminv(p, phat(1), phat(2));
where p is a vector of percentages, e.g. p = [.2, .8].
I always base my code on the following. This is a code once made which I now alter where necessary. Maybe you will find it usefull as well.
table2.10 <- cbind(lower=c(0,2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300),upper=c(2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300,Inf),freq=c(41,48,24,18,15,14,16,12,6,11,5,4,3))
loglik <-function(p,d){
upper <- d[,2]
lower <- d[,1]
n <- d[,3]
ll<-n*log(ifelse(upper<Inf,pgamma(upper,p[1],p[2]),1)-
pgamma(lower,p[1],p[2]))
sum( (ll) )
}
p0 <- c(alpha=0.47,beta=0.014)
m <- optim(p0,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10)
theta <- qgamma(0.95,m$par[1],m$par[2])
theta
One can also create a 95% confidence interval by using the Delta method. To do so, we need to differentiate the
distribution function of the claims F^(−1)_{X} (0.95; α, β) with respect to α and β.
p <- m$par
eps <- c(1e-5,1e-6,1e-7)
d.alpha <- 0*eps
d.beta <- 0*eps
for (i in 1:3){
d.alpha[i] <- (qgamma(0.95,p[1]+eps[i],p[2])-qgamma(0.95,p[1]-eps[i],p[2]))/(2*eps[i])
d.beta[i] <- (qgamma(0.95,p[1],p[2]+eps[i])-qgamma(0.95,p[1],p[2]-eps[i]))/(2*eps[i])
}
d.alpha
d.beta
var.p <- solve(-m$hessian)
var.q95 <- t(c(d.alpha[2],d.beta[2])) %*% var.p %*% c(d.alpha[2],d.beta[2])
qgamma(0.95,p[1],p[2]) + qnorm(c(0.025,0.975))*sqrt(c(var.q95))
It is even possible using the parametric bootstrap on the estimates for α and β to get B = 1000 different estimates for the
95th percentile of the loss distribution. And use these estimates to construct a 95% confidence interval
library(mvtnorm)
B <- 10000
q.b <- rep(NA,B)
for (b in 1:B){
p.b <- rmvnorm(1,p,var.p)
if (!any(p.b<0)) q.b[b] <- qgamma(0.95,p.b[1],p.b[2])
}
quantile(q.b,c(0.025,0.975))
To do the nonparametrtic bootstrap, we first ‘expand’ the data to reflect each individual observation. Then
we sample with replacement from the line numbers, calculate the frequency table, estimate the model and its
95% percentile.
line.numbers <- rep(1:13,table2.10[,"freq"])
q.b <- rep(NA,B)
table2.10b <- table2.10
for (b in 1:B){
line.numbers.b <- sample(line.numbers,size=217,replace=TRUE)
table2.10b[,"freq"] <- table(factor(line.numbers.b,levels=1:13))
m.b <- optim(m$par,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10b)
q.b[b] <- qgamma(0.95,m.b$par[1],m.b$par[2])
}
q.npb <- q.b
quantile(q.b,c(0.025,0.975))

K-fold cross validation

How do i perform k-fold cross validation on a data set, say X.
I have gone through the matlab site and have tried this for a data set X.
Following is the code for 10 fold cross validation on set X.
c= cvcrossvalidate(X,'kFold',10);
This creates an object c, but how do i access the different parts and use them to test my classifier? I am not able to comprehend even after going through various texts.
Follow this:
C = crossvalind('Kfold', X_label, 10);
for i = 1:10
Test = (C == i);
Train = ~Test;
SVMStruct = svmtrain ( X (Train,:), X_label (Train,:));
Result = svmclassify(SVMStruct, X (Test,:));
end
X_label = your data labelling.
X = your data set.

How to make a graph from function output in matlab

I'm completely lost at this using MATLAB functions, so here is the case:
lets assume I have SUM=0, and
I have a constant probability P that the user gives me, and I have to compare this constant P, with other M (also user gives M) random probabilities, if P is larger I add 1 to SUM, if P is smaller I add -1 to SUM... and at the end I want print on the screen the graph of the process.
I managed till now to make only one stage with this code:
function [result] = ex1(p)
if (rand>=p) result=1;
else result=-1;
end
(its like M=1)
How do You suggest I can modify this code in order to make it work the way I described it before (including getting a graph) ?
Or maybe I'm getting the logic wrong? the question says I get 1 with probability P, and -1 with probability (1-P), and the SUM is the same
Many thanks
I'm not sure how you achieve your input, but this should get you on the way:
p = 0.5; % Constant probability
m = 10;
randoms = rand(m,1) % Random probabilities
results = ones(m,1);
idx = find(randoms < p)
results(idx) = -1;
plot(cumsum(results))
For m = 1000:
You can do it like this:
p = 0.25; % example data
M = 20; % example data
random = rand(M,1); % generate values
y = cumsum(2*(random>=p)-1); % compute cumulative sum of +1/-1
plot(y) % do the plot
The important function here is cumsum, which does the cumulative sum on the sequence of +1/-1 values generated by 2*(random>=p)-1.
Example graph with p=0.5, M=2000: