What is the best approach to deal with continuous and nominal attributes to predict target? - naivebayes

I tried one of the ways that binning the continuous attribute but the accuracy was lower than I just used one nominal attribute to predict. I'm wondering the reason. The complementNB was the model I used because my raw data is imbalanced.
Appreciate it if could get some suggestions.

Related

Is duplicating data a valid way to fix bias?

I’m reading a paper in the area of engineering. They have a labelled dataset which is biased. There are many more instances labelled A than B. They want to train a classifier to predict the A or B label based on some inputs (states).
The authors say:
To artificially remedy this problem, random replicas of the B states are incorporated into the dataset to even out the lot.
I don’t know much on data analytics, but this doesn’t sound very valid to me. Is it?
This type of data normally called as imbalanced data. what author said was right to deal with imbalanced data we need to add some duplication to bring as a balanced(but instead of adding randomly will see the data patterns and add the data). there many algorithms methods to deal with imbalance classification just go through this it might help you
https://datascience.stackexchange.com/questions/24392/why-we-need-to-handle-data-imbalance

how to improve TensorFlow object detection model?

I need to diagnosis captcha for a project. I did this using the object_detection provided by Tensorflow.
also, I added 500 captcha samples by turning images into XML by LabelImg and then to TFRecord.
beside I used "faster_rcnn_inception_v2_coco_2018_01_28"
The problem is that the accuracy of the machine is very low.
My questions are:
Can the problem be solved by increasing the number of training data?
Should I change my algorithm?
How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
Q. Can the problem be solved by increasing the number of training data?
A. It would be depend on how many data you can get more. I think that only increasing the number of training data is not good approach.
Consider using Fine-tuning existing trained model to detect object class. If you want to fine-tune the model, you need to be careful class label assignment because existing trained model like YOLO3, Faster RCNN, etc. has no label "captcha" in their training dataset.
I recommend you to refer to this website that can help you to fine-tune the model.
Q. Should I change my algorithm?
A. Do as you wish.
Q. How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
A. In my opinion, two different models are much the same if you don't need to consider inference time.

How to deal with data when making a decision tree

I am trying to make a decision tree for dataset I got from Kaggle.
Since I don't have any experience for dealing with real-life datasets, I have no idea how to deal with cleaning, integrating, and scaling the data (mainly scaling).
For example, let's say I have a feature that has real numbers. So I want to make that feature to something like categorical data by scaling them into the specific number of groups (for making decision tree).
In this case, I have no idea how many groups of data is a reasonable for decision tree purpose.
I am sure it depends on the distribution of the data for the feature and the number of unique values in target dataset but I don't know how I find the good guess by looking at the distribution and target dataset.
My best guess is divide the data of the feature into similar number with the number of unique values of target dataset. (I don't even know if this makes sense..)
When I learned from school, I was already given with 2-5 categorical data for every features so that I didn't have to worry about, but real-life is totally different from school.
Please help me out.
For DT you need numerical data to be numerical, categorical - to be in dummies-style. No scaling is needed for numerical columns.
To process categorical data use one-hot encoding. Please be sure that before one-hot encoding you have rather big amounts of each feature (>= 5%), otherwise group small variables.
And consider other model. DT are good but it's old school and they are easy to be overfitted.
You can use decision tree regressors which eliminate the need for stratifying real numbers in to categories: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
When you do this, it will help to scale input data to zero mean, and unit variance; this helps prevent any large-category inputs from dominating the model
That being said, a decision tree may not be the best option. Try SVM, or ANN. Or (most likely) some ensemble of many models (or even just a random forest).

Is Cross Validation enough to ensure that there is no Overfitting in a classification algorithm?

I have a data set with 45 observations for one class and 55 observations for another class. Moreover, I am using 4 different features which were previously chosen by using a Feature Selection filter though the results of this procedure were somewhat strange..
On the other hand, I am using cross validation and getting good accuracy results (75% to 85%) from different classifiers since I'm using the classificationLearner on Matlab. Would this ensure that there is no overfitting? Or there might still be a chance for this? How can I assure that there is no overfitting?
That really depends on your training data set that you have available. If data that is available to you isn't representative enough, you will not get a good model regardless of the methods you use for training and validation.
Having that in mind, if you are sure your data is representative (has the same distribution of values for any subset of "important" attributes as the the global set of all data) than cross validation is good enough to rely on.

why we need cross validation in multiSVM method for image classification?

I am new to image classification, currently working on SVM(support Vector Machine) method for classifying four groups of images by multisvm function, my algorithm every time the training and testing data are randomly selected and the performance is varies at every time. Some one suggested to do cross validation i did not understand why we need cross validation and what is the main purpose of this? . My actual data set consist training matrix size 28×40000 and testing matrix size 17×40000. how to do cross validation by this data set help me. thanks in advance .
Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. As a result, you would like to select the model with the least validation error. Here the model refers to the features you want to use, and of more importance, the gamma and C in your SVM. After cross validation, you will use the selected gamma and C with the least average validation error to train the whole training data.
You may also need to estimate the performance of your features and parameters to avoid both high-bias and high-variance. Whether your model suffers underfitting or overfitting can be observed from both in-sample-error and validation error.
Ideally 10-fold is often used for cross validation.
I'm not familiar with multiSVM but you may want to check out libSVM, it is a popular, free SVM library with support for a number of different programming languages.
Here they describe cross validation briefly. It is a way to avoid over-fitting the model by breaking up the training data into sub groups. In this way you can find a model (defined by a set of parameters) which fits both sub groups optimally.
For example, in the following picture they plot the validation accuracy contours for parameterized gamma and C values which are used to define the model. From this contour plot you can tell that the heuristically optimal values (from those tested) are those that give an accuracy closer to 84 instead of 81.
Refer to this link for more detailed information on cross-validation.
You always need to cross-validate your experiments in order to guarantee a correct scientific approach. For instance, if you don't cross-validate, the results you read (such as accuracy) might be highly biased by your test set. In an extreme case, your training step might have been very weak (in terms of fitting data) and your test step might have been very good. This applies to ALL machine learning and optimization experiments, not only SVMs.
To avoid such problems just divide your initial dataset in two (for instance), then train in the first set and test in the second, and repeat the process invesely, training in the second and testing in the first. This will guarantee that any biases to the data are visible to you. As someone suggested, you can perform this with even further division: 10-fold cross-validation, means dividing your data set in 10 parts, then training in 9 and testing in 1, then repeating the process until you have tested in all parts.