Orange3 concatenate tables with different targets - orange

I have two input datafiles to use in orange, one corresponds to the train set (with targets "A", "B" and "C") and the other to the unknown samples ( with targets "D" and "E" to be able to identify the unknown samples in the scatterplot of the two first principal components).
I have applied PCA to the train dataset and through a python script i have reapplied the PCA transformation to the test dataset, however the result have a ? in the target value for all entries in the unknown samples set.
I have tried to merge the train and unknown samples sets with the merge table widget, and apparently it does the same, all samples in train are correct, but the unknown samples have ? as targets.
The only way i managed to have this running properly is to have unknown samples and train set on the same input file. Which is not practical for obvious reasons.
Is there any way to fix this?
Please note that i have tried to change the domain.class_var and the target value directly on the transformed unknown samples, but it also alters the domain of the train dataset. Apparently when the new table is created it just have a reference to the domain of the original train data after PCA.

I have managed it by converting the data into numpy arrays concatenate them and then back to table.
Here is the code if anyone is interested:
import numpy
from Orange.data.table import Table
from Orange.data import Domain, DiscreteVariable, ContinuousVariable
trnsfrmd_knwn_data = numpy.array(in_object)
trnsfrmd_unkwn_data = numpy.array(Table(in_object.domain,in_data))
ndx = list(set(trnsfrmd_knwn_data[:,len(trnsfrmd_knwn_data[0])-1].tolist()))[-1] + 1
trnsfrmd_unkwn_data[:,len(trnsfrmd_knwn_data[0])-1] = numpy.array([i for i in range(0, len(trnsfrmd_unkwn_data))]) + ndx
targets = in_object.domain.class_var.values + in_data.domain.class_var.values
dm = Domain([ContinuousVariable(x.name) for x in in_object.domain.attributes], DiscreteVariable('region', values=targets))
out_data = Table.from_numpy(dm, numpy.append(trnsfrmd_knwn_data,trnsfrmd_unkwn_data,axis=0))

Related

Target Encoding in Python Model

I made a model in python and this uses target encoding. I used a dataset with 25000 rows and this gets divided into training and test data sets. The model is really working fine. However, I now want to run the model on totally fresh data - say just one row of data in an excel file. I need to know the code for it and will really appreciate it if someone can help. I am somewhat new to python. Here is the part of code I have written to create the training and test data sets from 25000 rows and train the model on training and predict on the test. However, I need the code that runs this model that uses target encoding to predict fresh data. If I need to post more code for greater clarity please let me know.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(train_x.values, train_y.values)
pred_train = rf.predict(train_x.values)
pred = rf.predict(test_x.values)
Thanks
You might want to have have a look the comment section in this notebook-
here
"After we apply target encoding on the train data and target. We can have the result for one category like column A has a,b,c. Then we calculate the mean of each a,b,c in column A and apply it to the test data.We then apply it to test using the pd.merge function."

Gaussian mixture model in pyspark

I have gone through the link https://spark.apache.org/docs/latest/mllib-clustering.html regarding fitting a GMM in pyspark. I have carried out the same operation successfully in python, but after several iteration, I am unable to run in pyspark.
The questions i have are as follow;
1. The link mentioned above & another example of fitting GMM in pyspark I checked, takes a txt file with no column headings. I have a csv with 17 columns. The code is,
data = sc.textFile("..path/mydata.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
This worked, but when i am trying to fit GaussianMixture.train specifying some components, It is not working.
If the data used in the examples have no column headings, how can I judge which column is coming from which distribution & how the change in pattern is appearing?
How can I get heat-map from here so that whenever a new data comes in, I will use my trained model's heat-map to judge the distribution pattern of my new test data & can point out the mis-matches.
Thanks.

How to use `crossval` in matlab for a Leave one Out Validation method

I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548

Partitioning dataset into train, test and validate subset (Matlab)

Im new to using Matlab and i'm trying to achieve the following situation:
I've one dataset of 7000+ entries. The goal is to train a classification tree (fitctree) on this data. I've seperated the data into a matrix with observations (predictors) and a matrix with classes(class). To partition the data i'm using cvpartition. Everything works fine up until this point.
Problem: I want to create three subsets with data: 1 training set, 1 validation set and 1 testing set. I want to train the tree using the training set, and validate its performance using the validation set. After tweeking the parameters I want to run the final test on the test data partition.
To partition the data I tried creating a cvpartition, which works, e.g.
cvpart = cvpartition(class, 'k', 10);
and then performing another cvpartition on that testing set, seperating this into another two sets:
cvpart2 = cvpartition(cvpart.TestSize, 'k', 10);
Sadly, when validating the performance of the tree this doesn't seem to work. When I skip the seond cvpartition, and validate the performance on the test set of cvpart, the model performs perfectly.
Update: after days I found that it seems to work when using it in this way:
cvpart2 = cvpartition(cvpart.TrainSize, 'k', 10);
Anyone care to explain me why it does work in this way, but not when using the test set?
Hope you guys can help me out;)
Kind regards.

Text Categorization datasets for MATLAB

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.
It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);