OneR WEKA - wrong prediction? - classification

I am trying to make a ranking of attributes depending on their predictive power by using OneR in WEKA iteratively. At every run I remove the chosen attribute to see what the next best is.
I have done this for all my attributes and some (3 out of ten attributes) get 'ranked' higher than others, although they have less % correct prediction, a smaller ROC Area average and their rules are less compact.
As I understand, OneR just looks at the frequency tables for the attribute it has and then the class values, so it wouldn't care about whether I take attributes out or not...but I am probably missing something
Would anyone have an idea?

As an alternative you can you use the OneR package (available on CRAN, more information here: OneR - Establishing a New Baseline for Machine Learning Classification Models)
With the option verbose = TRUE you get the accuracy of all attributes, e.g.:
> library(OneR)
> example(OneR)
OneR> data <- optbin(iris)
OneR> model <- OneR(data, verbose = TRUE)
Attribute Accuracy
1 * Petal.Width 96%
2 Petal.Length 95.33%
3 Sepal.Length 74.67%
4 Sepal.Width 55.33%
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'
OneR> summary(model)
Rules:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
Accuracy:
144 of 150 instances classified correctly (96%)
Contingency table:
Petal.Width
Species (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
setosa * 50 0 0 50
versicolor 0 * 48 2 50
virginica 0 4 * 46 50
Sum 50 52 48 150
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 266.35, df = 4, p-value < 2.2e-16
(full disclosure: I am the author of this package and I would be very interested in the results you get)

The OneR classifier looks a bit like nearest-neighbor. Given that, the following applies: In the source code of the OneR classifier, it says:
// if this attribute is the best so far, replace the rule
if (noRule || r.m_correct > m_rule.m_correct) {
m_rule = r;
}
Thus, it should be possible (either in 1-R generally or in this implementation) for an attribute to block another, yet be later removed in your process.
Say you have attributes 1,2, and 3 with the distribution 1: 50%, 2: 30%, 3: 20%. In all cases where attribute 1 is best, attribute 3 is second best.
Thus, when attribute 1 is left out, attribute 3 wins with 70%, even though before attribute 2 ranked as "better" than 3 in the comparison of all three.

Related

95% Confidence Interval for a weighted column in KDB+/Q

I have a table like the following where each row corresponds to an execution:
table:([]name:`account1`account1`account1`account2`account2`account1`account1`account1`account1`account2;
Pnl:13.7,13.2,74.1,57.8,29.9,15.9,7.8,-50.4,2.3,-16.2;
markouts:.01,.002,-.003,-.02,.004,.001,-.008,.04,.011,.09;
notional:1370,6600,-24700,-2890,7475,15900,-975,-1260,210,-180)
I'd like to create a 95% confidence interval of Pnl for `account1. The problem is, Pnl is the product of markouts and notional values, so it's weighted and the mean wouldn't be a simple mean. I'm pretty sure the standard deviation calculation would also be a bit different than normal.
Is there a way to still do this in KDB? I'm not really sure how to go about this. Any advice is greatly appreciated!
statistics isn't my strong point but most of this can be done with some keywords for the standard calculation:
q)select { avg[x] + -1 1* 1.960*sdev[x]%sqrt count x } Pnl by name from table
name | Pnl
--------| ------------------
account1| -15.90856 37.76571
account2| -18.45611 66.12278
https://code.kx.com/q/ref/avg/#avg
https://code.kx.com/q/ref/sqrt/
https://code.kx.com/q/ref/dev/#sdev
As shown on the kx ref, the sdev calculation is as follows which you could use as a base to create your own to suit what you want/expect.
{sqrt var[x]*count[x]%-1+count x}
There is also wavg if you want to do weighted average:
https://code.kx.com/q/ref/avg/#wavg
Edit: Assuming this can work by substituting in weighted calculations, here's a weighted sdev I've thrown together wsdev:
table:update weight:2 6 3 5 2 4 5 6 7 3 from table;
wsdev:{[w;v] sqrt (sum ( (v-wavg[w;v]) xexp 2) *w)%-1+sum w }
// substituting avg and sdev above
w95CI:{[w;v] wavg[w;v] + -1 1* 1.960*wsdev[w;v]%sqrt count v };
select w95CI[weight;Pnl] by name from table
name | Pnl
--------| ------------------
account1| -19.70731 28.47701
account2| -8.201463 68.24146

Possible to write Typology into dataset?

I am working with TraMineR and I am new to R and TraMineR.
Actually I made a typology of a life course dataset with TraMineR and the cluster library in R.
(used this guide: http://traminer.unige.ch/preview-typology.shtml)
Now I have different Cases sorted into different Types (all in all 4 Types).
I want to get into deeper analysis with a certain Type but I need to know which cases ( I have case numbers) belong to which type.
#
Is it possible to write the certain type a case is sorted to into the dataset itself as a new variable Is there another way?
In the example of the referenced guide, the Type is obtained as follows using an optimal matching distances with substitution costs based on transition probabilities
library(TraMineR)
data(mvad)
mvad.seq <- seqdef(mvad, 17:86)
dist.om1 <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
library(cluster)
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
cl1.4 <- cutree(clusterward1, k = 4)
cl.4 is a vector with the cluster membership of the sequences in the order corresponding to the mvad dataset. (It could be convenient to transform it into a factor.) Therefore, you can simply add this variable as an additional column to the dataset. For instance, if we want to name this new column as Type
mvad$Type <- cl1.4
tail(mvad[,c("id","Type")]) ## id and Type of the last 6 sequences
## id Type
## 707 707 3
## 708 708 3
## 709 709 4
## 710 710 2
## 711 711 1
## 712 712 4

Detect contiguous numbers - MATLAB

I coded a program that create some bunch of binary numbers like this:
out = [0,1,1,0,1,1,1,0,0,0,1,0];
I want check existence of nine 1 digit together in above out, for example when we have this in our output:
out_2 = [0,0,0,0,1,1,1,1,1,1,1,1,1];
or
out_3 = [0,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,0,0,0,1,1,0];
condition variable should be set to 1. We don't know exact position of start of ones in outvariable. It is random. I only want find existence of duplicate ones values in above variable (one occurrence or more).
PS.
We are searching for a general answer to find other duplicate numbers (not only 1 here and not only for binary data. this is just an example)
You can use convolution to solve such r-contiguous detection cases.
Case #1 : To find contiguous 1s in a binary array -
check = any(conv(double(input_arr),ones(r,1))>=r)
Sample run -
input_arr =
0 0 0 0 1 1 1 1 1 1 1 1 1
r =
9
check =
1
Case #2 : For detecting any number as contiguous, you could modify it a bit, like so -
check = any(conv(double(diff(input_arr)==0),ones(1,r-1))>=r-1)
Sample run -
input_arr =
3 5 2 4 4 4 5 5 2 2
r =
3
check =
1
To save Stackoverflow from further duplicates, also feel free to look into related problems -
Fast r-contiguous matching (based on location similarities).
r-contiguous matching, MATLAB.

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

Change the class of columns in a data frame

First of all, excuse me if I do any mistakes, but English is not a language I use very often.
I have a data frame with numbers. A small part of the data frame is this:
nominal 2
2
2
2
ordinal
2
1
1
2
So, I want to use the gower distance function on these numbers.
Here ( http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/StatMatch/man/gower.dist.html ) says that in order to use gower.dist, all nominal variables must be of class "factor" and all ordinal variables of class "ordered".
By default, all the columns are of class "integer" and mode "numeric". In order to change the class of the columns, i use these commands:
DF=read.table("clipboard",header=TRUE,sep="\t")
# I select all the cells and I copy them to the clipboard.
#Then R, with this command, reads the data from there.
MyHeader=names(DF) # I save the headers of the data frame to a temp matrix
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="nominal") DF[[i]]=as.factor(DF[[i]])
}
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="ordinal") DF[[i]]=as.ordered(DF[[i]])
}
The first for/if loop changes the class from integer to factor, which is what I want, but the second changes the class of ordinal variables to: "ordered" "factor".
I need to change all the columns with the header "ordinal" to "ordered", as the gower.dist function says.
Thanks in advance,
B.T.
What you are doing is fine --- if perhaps a little inelegantly.
With your ordered factor, you have something like:
> foo <- as.ordered(1:10)
> foo
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10
> class(foo)
[1] "ordered" "factor"
Notice that it has two classes, indicating that it is an ordered factor and that is is a factor:
> is.ordered(as.ordered(1:10))
[1] TRUE
> is.factor(as.ordered(1:10))
[1] TRUE
In some senses, you might like to think that foo is an ordered factor but also inherits from the factor class too. Alternatively, if there isn't a specific method that handles ordered factors, but there is a method for factors, R will use the factor method. As far as R is concerned, an ordered factor is an object with classes "ordered" and "factor". This is what your function for Gower's distance will require.
You could easily do this with:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
which gives you a dataframe with the correct structure. If you work with data frames, please stay away from [[]] unless you know very well what you're doing. Take Dirks advice, and check Owen's R Guide as well. You definitely need it.
If i do the conversion as I showed above, gower.dist() works perfectly fine. On a sidenote, the gowers distance can easily be calculated using the daisy() function as well:
DF <- data.frame(
ordinal= c(1,2,3,1,2,1),
nominal= c(2,2,2,2,2,2)
)
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
library(cluster)
daisy(DF,metric="gower")
library(StatMatch)
gower.dist(DF)