Is duplicating data a valid way to fix bias? - classification

I’m reading a paper in the area of engineering. They have a labelled dataset which is biased. There are many more instances labelled A than B. They want to train a classifier to predict the A or B label based on some inputs (states).
The authors say:
To artificially remedy this problem, random replicas of the B states are incorporated into the dataset to even out the lot.
I don’t know much on data analytics, but this doesn’t sound very valid to me. Is it?

This type of data normally called as imbalanced data. what author said was right to deal with imbalanced data we need to add some duplication to bring as a balanced(but instead of adding randomly will see the data patterns and add the data). there many algorithms methods to deal with imbalance classification just go through this it might help you
https://datascience.stackexchange.com/questions/24392/why-we-need-to-handle-data-imbalance

Related

how to improve TensorFlow object detection model?

I need to diagnosis captcha for a project. I did this using the object_detection provided by Tensorflow.
also, I added 500 captcha samples by turning images into XML by LabelImg and then to TFRecord.
beside I used "faster_rcnn_inception_v2_coco_2018_01_28"
The problem is that the accuracy of the machine is very low.
My questions are:
Can the problem be solved by increasing the number of training data?
Should I change my algorithm?
How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
Q. Can the problem be solved by increasing the number of training data?
A. It would be depend on how many data you can get more. I think that only increasing the number of training data is not good approach.
Consider using Fine-tuning existing trained model to detect object class. If you want to fine-tune the model, you need to be careful class label assignment because existing trained model like YOLO3, Faster RCNN, etc. has no label "captcha" in their training dataset.
I recommend you to refer to this website that can help you to fine-tune the model.
Q. Should I change my algorithm?
A. Do as you wish.
Q. How effective is the use of the Yolo 3 instead of the detection object provided by Tensorflow?
A. In my opinion, two different models are much the same if you don't need to consider inference time.

How to deal with data when making a decision tree

I am trying to make a decision tree for dataset I got from Kaggle.
Since I don't have any experience for dealing with real-life datasets, I have no idea how to deal with cleaning, integrating, and scaling the data (mainly scaling).
For example, let's say I have a feature that has real numbers. So I want to make that feature to something like categorical data by scaling them into the specific number of groups (for making decision tree).
In this case, I have no idea how many groups of data is a reasonable for decision tree purpose.
I am sure it depends on the distribution of the data for the feature and the number of unique values in target dataset but I don't know how I find the good guess by looking at the distribution and target dataset.
My best guess is divide the data of the feature into similar number with the number of unique values of target dataset. (I don't even know if this makes sense..)
When I learned from school, I was already given with 2-5 categorical data for every features so that I didn't have to worry about, but real-life is totally different from school.
Please help me out.
For DT you need numerical data to be numerical, categorical - to be in dummies-style. No scaling is needed for numerical columns.
To process categorical data use one-hot encoding. Please be sure that before one-hot encoding you have rather big amounts of each feature (>= 5%), otherwise group small variables.
And consider other model. DT are good but it's old school and they are easy to be overfitted.
You can use decision tree regressors which eliminate the need for stratifying real numbers in to categories: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
When you do this, it will help to scale input data to zero mean, and unit variance; this helps prevent any large-category inputs from dominating the model
That being said, a decision tree may not be the best option. Try SVM, or ANN. Or (most likely) some ensemble of many models (or even just a random forest).

Is Cross Validation enough to ensure that there is no Overfitting in a classification algorithm?

I have a data set with 45 observations for one class and 55 observations for another class. Moreover, I am using 4 different features which were previously chosen by using a Feature Selection filter though the results of this procedure were somewhat strange..
On the other hand, I am using cross validation and getting good accuracy results (75% to 85%) from different classifiers since I'm using the classificationLearner on Matlab. Would this ensure that there is no overfitting? Or there might still be a chance for this? How can I assure that there is no overfitting?
That really depends on your training data set that you have available. If data that is available to you isn't representative enough, you will not get a good model regardless of the methods you use for training and validation.
Having that in mind, if you are sure your data is representative (has the same distribution of values for any subset of "important" attributes as the the global set of all data) than cross validation is good enough to rely on.

Neural network preprocessing

I'm working on school project about data prediction in NN I have my data normalized and I have three input and one output
My questions is
what is the different between the taring data and test data (is the training data supposed to be the input data and the test the output data)
what is testing rate is it any random number or is there rule to find it
what is training error
and my final question is after training my data I remember something about error I'm not quite sure but do I need to find the error of my prediction and how to find it
I know my questions might not be clear but I'm just confused and tried to explain it as much as I can
Answering in a school spirit: Let's suppose you are given 10 solved exercises to study. You do study them, and then the teacher tests you on these exact exercises. You do well on the test. However, there is an important question. Why did you do well?? Did you really understand the exercises, or did you just memorize them?? And how can the teacher know ??
There is only one way: The teacher must test you on a set of similar but different exercises. If you also do well on them, you have gotten a feel for the subject, and you are able to generalize the knowledge you acquired. If not, you probably memorized them, without understanding a thing. This kind of knowledge is useless.
The same happens with neural networks. You use some patterns to (training set) to train them. But, to check if they are able to generalize, you have to test them on a different set of patterns (test set) without the network knowing the correct answers. Ideally, you should have small differences in performance between the two sets, that is good generalization ability.
So, both train and tests sets are inputs, not outputs. The only difference is when you use them, the training set during the training, and the test set after it. The training/test set rate is the percentage you got correct of the training/test sets respectively. The training/test error is the complementary, that is, the percentage you got wrong.
I know this reply might come late but I will just complement the previous answer by saying that in supervised learning both the training set and the test set are input-output pairs. By structure alone they are exactly the same, a set of input and their corresponding output(or label) pairs. There is no difference in structure between both.
As blue_note said, they are just used in different occasions: one during training and one after that

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.