Target Encoding in Python Model - encoding

I made a model in python and this uses target encoding. I used a dataset with 25000 rows and this gets divided into training and test data sets. The model is really working fine. However, I now want to run the model on totally fresh data - say just one row of data in an excel file. I need to know the code for it and will really appreciate it if someone can help. I am somewhat new to python. Here is the part of code I have written to create the training and test data sets from 25000 rows and train the model on training and predict on the test. However, I need the code that runs this model that uses target encoding to predict fresh data. If I need to post more code for greater clarity please let me know.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(train_x.values, train_y.values)
pred_train = rf.predict(train_x.values)
pred = rf.predict(test_x.values)
Thanks

You might want to have have a look the comment section in this notebook-
here
"After we apply target encoding on the train data and target. We can have the result for one category like column A has a,b,c. Then we calculate the mean of each a,b,c in column A and apply it to the test data.We then apply it to test using the pd.merge function."

Related

Using Weka NaiveBayes with Matlab

I created a NaiveBayes model in Weka. I exported the model to disk. I now want to inject this model into MATLAB 2018, so that I can check how it performs via some data that I am receiving.
I load my model in MATLAB, by stating something like this:
loadedModel = weka.core.SerializationHelper.read('myweka.model');
I then create a Weka Instance object, and let it contain this data:
instance = infrequent,low,high,medium-high,high,medium,medium,low,low
If I run these two commands:
loadedModel.distributionForInstance(instance)
loadedModel.classifyInstance(instance)
I see the following output:
0.0001
0.9999
1
This is odd to me because if I observe the same record in WEKA ui, I see the same instance with probabilities 0.993 and 0.007, classified as '2'. (I can load the same model multiple times from disk in WEKA, and reproduce this behavior, which is correct) After further investigation, I noticed that regardless of the sequence of attributes my Instance object has, I always get the same probability output and the same classification by invoking the model via MATLAB.
There are some posts on the net that share the same problem, like these:
Always getting the same output
Weka - Classifier returns the same distribution for any input
However, the recommended solution to call 'instance.setClassMissing()' did not solve my issue. Is there anything I am missing, or can try to do in order to further troubleshoot the issue?
Does your test instance has same structure as your train set? If not, you need yo provide the same structure.
Weka indexes nominal attributes and stores the indices internally. So the nominal attributes order in train file is important. For example if your attribute is mapped as low=>0, high=>1 in training, you need to map them like this in your test set. Usually this is achieved by serializing the train header with the model.
Sample code for creating train header:
Instances trainHeader = new Instances(instances, 0);
trainHeader.setClassIndex(instances.classIndex());
When creating a new instance set its dataset:
Instance instance = ...
instance.setDataset(trainHeader);

Gaussian mixture model in pyspark

I have gone through the link https://spark.apache.org/docs/latest/mllib-clustering.html regarding fitting a GMM in pyspark. I have carried out the same operation successfully in python, but after several iteration, I am unable to run in pyspark.
The questions i have are as follow;
1. The link mentioned above & another example of fitting GMM in pyspark I checked, takes a txt file with no column headings. I have a csv with 17 columns. The code is,
data = sc.textFile("..path/mydata.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
This worked, but when i am trying to fit GaussianMixture.train specifying some components, It is not working.
If the data used in the examples have no column headings, how can I judge which column is coming from which distribution & how the change in pattern is appearing?
How can I get heat-map from here so that whenever a new data comes in, I will use my trained model's heat-map to judge the distribution pattern of my new test data & can point out the mis-matches.
Thanks.

Orange3 concatenate tables with different targets

I have two input datafiles to use in orange, one corresponds to the train set (with targets "A", "B" and "C") and the other to the unknown samples ( with targets "D" and "E" to be able to identify the unknown samples in the scatterplot of the two first principal components).
I have applied PCA to the train dataset and through a python script i have reapplied the PCA transformation to the test dataset, however the result have a ? in the target value for all entries in the unknown samples set.
I have tried to merge the train and unknown samples sets with the merge table widget, and apparently it does the same, all samples in train are correct, but the unknown samples have ? as targets.
The only way i managed to have this running properly is to have unknown samples and train set on the same input file. Which is not practical for obvious reasons.
Is there any way to fix this?
Please note that i have tried to change the domain.class_var and the target value directly on the transformed unknown samples, but it also alters the domain of the train dataset. Apparently when the new table is created it just have a reference to the domain of the original train data after PCA.
I have managed it by converting the data into numpy arrays concatenate them and then back to table.
Here is the code if anyone is interested:
import numpy
from Orange.data.table import Table
from Orange.data import Domain, DiscreteVariable, ContinuousVariable
trnsfrmd_knwn_data = numpy.array(in_object)
trnsfrmd_unkwn_data = numpy.array(Table(in_object.domain,in_data))
ndx = list(set(trnsfrmd_knwn_data[:,len(trnsfrmd_knwn_data[0])-1].tolist()))[-1] + 1
trnsfrmd_unkwn_data[:,len(trnsfrmd_knwn_data[0])-1] = numpy.array([i for i in range(0, len(trnsfrmd_unkwn_data))]) + ndx
targets = in_object.domain.class_var.values + in_data.domain.class_var.values
dm = Domain([ContinuousVariable(x.name) for x in in_object.domain.attributes], DiscreteVariable('region', values=targets))
out_data = Table.from_numpy(dm, numpy.append(trnsfrmd_knwn_data,trnsfrmd_unkwn_data,axis=0))

Partitioning dataset into train, test and validate subset (Matlab)

Im new to using Matlab and i'm trying to achieve the following situation:
I've one dataset of 7000+ entries. The goal is to train a classification tree (fitctree) on this data. I've seperated the data into a matrix with observations (predictors) and a matrix with classes(class). To partition the data i'm using cvpartition. Everything works fine up until this point.
Problem: I want to create three subsets with data: 1 training set, 1 validation set and 1 testing set. I want to train the tree using the training set, and validate its performance using the validation set. After tweeking the parameters I want to run the final test on the test data partition.
To partition the data I tried creating a cvpartition, which works, e.g.
cvpart = cvpartition(class, 'k', 10);
and then performing another cvpartition on that testing set, seperating this into another two sets:
cvpart2 = cvpartition(cvpart.TestSize, 'k', 10);
Sadly, when validating the performance of the tree this doesn't seem to work. When I skip the seond cvpartition, and validate the performance on the test set of cvpart, the model performs perfectly.
Update: after days I found that it seems to work when using it in this way:
cvpart2 = cvpartition(cvpart.TrainSize, 'k', 10);
Anyone care to explain me why it does work in this way, but not when using the test set?
Hope you guys can help me out;)
Kind regards.

How to use Decision Tree Classification Matlab?

I have data in form of rows and columns where rows represent a record and column represents its attributes.
I also have the labels (classes) for those records.
I know about decision trees concept and I would like to use matlab for classification of unseen records using decision trees.
How can this be done? I followed this link but its not giving me correct output-
Decision Tree in Matlab
Essentially I want to construct a decision tree based on training data and then predict the labels of my testing data using that tree. Can someone please give me a good and working example for this ?
I used following code to achieve it. And it is working correctly
function DecisionTreeClassifier(trainingFile, testingFile, labelsFile, outputFile)
training = csvread(trainingFile);
labels = csvread(labelsFile);
testing = csvread(testingFile);
tree = ClassificationTree.fit(training,labels)
prediction = predict(tree, testing)
csvwrite(outputFile, prediction)
ClassificationTree.fit will be removed in a future release. Use fitctree instead.