Validation close to training accuracy in a cross validation but model is bad

Validation close to training accuracy in a cross validation but model is bad - neural-network

This is a question from an exam:
Your task was to build a model that predicts whether an e-mail is spam or not spam
for a mail service. You had access to a huge data set of more than 10 Million mails,
where 99% of the mails were not spam. Your model always achieved an accuracy of
99% on both your validation and training set using 100-fold cross-validation. The
owner of the mail service was very impressed by the high accuracy and your model
is used to ﬁlter real spam mails. However, users of the mail service complain that
spam mails are not ﬁltered at all. What is the most likely reason for this?
What is a possible solution?
So I think the reason might be that the model always says that an email is not spam. But I am lost on the second part (maybe the first as well).

So yes the problem is that the training dataset, is very imbalanced.
And if the model predicts not-spam for every email, it will anyway end-up with 99% accuracy.
There are 3 ways to handle this:
1- balance the dataset by under sampling (Downsampling) the majority class which is "not-spam" emails, and make the dataset more balanced.
2- use class-weights. See the link below:
https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
3- The most important approcah is to use the following factors as Performance metric instead of accuracy:
Precision: True Positives / All Predicted Positives
Recall: True Positives / All actual positives

Related

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?

In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

What to do with not enough training data?

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?

To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

How to train a neural network to detect presence of a pattern?

The question phrasing is vague - and I'm happy to change it based on feedback. But, I am trying to train a neural network to detect fraudulent transactions on a website. I have a lot of parameters as inputs (time of day, country of origin, number of visits in the past month, number of visits from unique IP's in the past month, number of transactions, average transaction size, etc, etc). Altogether, perhaps over 100 inputs. The inputs have been normalized and sanitized and they form a time series. Historically, I can look at my data and identify that a transaction was fraudulent of Type A or of Type B or not fraudulent. My training set can be large (in the thousands or tens of thousands of points).
Ultimately, I want an indicator: Fraud of Type A, Fraud of Type B or Not Fraud. Generally, fraudulent transactions tend to fit a pattern. I can't exactly identify the pattern (that's why I'm using a NN). However, not fraudulent transactions can be of any type of pattern. So it seems strange to identify things into 3 buckets when the third bucket is "other".
If this were a switch / case, it would be something like:
switch transactionData
when transactionData.transaction_count < 0.2 && ....
FRAUD_A
when transactionData.transaction_count > 0.5 && ....
FRAUD_B
else
NOT_FRAUD
Obviously, these are simplified cases, but my problem runs into how to properly train for the else case. Do I get three types of data (fraud_a, fraud_b and not_fraud) and train them? Or is there another way to train for other?

It is usually perfectly ok to have OTHER (NOT FRAUD) class along with these you are interested in. But I understand your concern. Basically, its job of NN to learn "case/switch" and in most cases it will learn right one, assuming that most samples belong to NOT FRAUD class. In some pathological cases classifiers can learn different idea e.g. everything is FRAUD A class, unless proven otherwise. You can't usually control it directly, but it can be changed by creating better features and some other tricks. For now, proceed with what you have and see what happens.
One thing you can do is to train two classifiers, one (FRAUD/NOT FRAUD) and then if fraud is detected feed data into second two-class classifier (FRAUD A/FRAUD B). Sometimes (but not always) this works better.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help

For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse