What to do when one feature has very large importance/weight? - classification

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms.
I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software).
Predictors have following negative correlations with churn :
logo: 0.65
num_active_days_last_month: 0.40
age: 0.3
Feature importances look similar with Logo having the most weight.
When I predict, the model (logistic regression) classifies customers without logo as churners, even thought they are quite active.
For example the following two customers have almost the same probability to churn:
Customer 1:
logo: True
num_active_days_last_month: 1
age:30 days
Customer 2:
logo: False
num_active_days_last_month: 22
age: 250 days
I understand that this is what model learned from the dataset, but it just doesn’t make sense in my mind to have such strong importance assigned to something like Logo.
Is there any way I can avoid completely excluding Logo from the predictors? maybe somehow decrease its importance?
Thank you in advance for any help/ suggestions i can get.

Related

Interesting results from LSTM RNN : lagged results for train and validation data

As an introduction to RNN/LSTM (stateless) I'm training a model with sequences of 200 days of previous data (X), including things like daily price change, daily volume change, etc and for the labels/Y I have the % price change from current price to that in 4 months. Basically I want to estimate the market direction, not to be 100% accurate. But I'm getting some odd results...
When I then test my model with the training data, I notice the output from the model is a perfect fit when compared to the actual data, it just lags by exactly 4 months:
When I shift the data by 4 months, you can see it's a perfect fit.
I can obviously understand why the training data would be a very close fit as it has seen it all during training - but why the 4 months lag?
It does the same thing with the validation data (note the area I highlighted with the red box for future reference):
Time-shifted:
It's not as close-fitting as the training data, as you'd expect, but still too close for my liking - I just don't think it can be this accurate (see the little blip in the red rectangle as an example). I think the model is acting as a naive predictor, I just can't work out how/why it's possibly doing it.
To generate this output from the validation data, I input a sequence of 200 timesteps, but there's nothing in the data sequence that says what the %price change will be in 4 months - it's entirely disconnected, so how is it so accurate? The 4-month lag is obviously another indicator that something's not right here, I don't know how to explain that, but I suspect the two are linked.
I tried to explain the observation based on some general underlying concept:
If you don't provide a time-lagged X input dataset (lagged t-k where k is the time steps), then basically you will be feeding the LSTM with like today's closing price to predict the same today's closing price..in the training stage. The model will (over fit) and behave Exactly as the answer is known already (data leakage)
If the Y is the predicted percentage change (ie. X * (1 + Y%) = 4 months future price), the present value Yvalue predicted really is just the future discounted by the Y%
so the predicted value will have 4 months shift
Okay, I realised my error; the way I was using the model to generate the forecast line was naive. For every date in the graph above, I was getting an output from the model, and then apply the forecasted % change to the actual price for that date - that would give predicted price in 4 months' time.
Given the markets usually only move within a margin of 0-3% (plus or minus) over a 4 month period, that would mean my forecasts was always going to closely mirror the current price, just with a 4 month lag.
So at every date the predicted output was being re-based, so the model line would never deviate far from the actual; it'd be the same, but within a margin of 0-3% (plus or minus).
Really, the graph isn't important, and it doesn't reflect the way I'll use the output anyway, so I'm going to ditch trying to get a visual representation, and concentrate on trying to find different metrics that lower the validation loss.

How to train a neural network to detect presence of a pattern?

The question phrasing is vague - and I'm happy to change it based on feedback. But, I am trying to train a neural network to detect fraudulent transactions on a website. I have a lot of parameters as inputs (time of day, country of origin, number of visits in the past month, number of visits from unique IP's in the past month, number of transactions, average transaction size, etc, etc). Altogether, perhaps over 100 inputs. The inputs have been normalized and sanitized and they form a time series. Historically, I can look at my data and identify that a transaction was fraudulent of Type A or of Type B or not fraudulent. My training set can be large (in the thousands or tens of thousands of points).
Ultimately, I want an indicator: Fraud of Type A, Fraud of Type B or Not Fraud. Generally, fraudulent transactions tend to fit a pattern. I can't exactly identify the pattern (that's why I'm using a NN). However, not fraudulent transactions can be of any type of pattern. So it seems strange to identify things into 3 buckets when the third bucket is "other".
If this were a switch / case, it would be something like:
switch transactionData
when transactionData.transaction_count < 0.2 && ....
FRAUD_A
when transactionData.transaction_count > 0.5 && ....
FRAUD_B
else
NOT_FRAUD
Obviously, these are simplified cases, but my problem runs into how to properly train for the else case. Do I get three types of data (fraud_a, fraud_b and not_fraud) and train them? Or is there another way to train for other?
It is usually perfectly ok to have OTHER (NOT FRAUD) class along with these you are interested in. But I understand your concern. Basically, its job of NN to learn "case/switch" and in most cases it will learn right one, assuming that most samples belong to NOT FRAUD class. In some pathological cases classifiers can learn different idea e.g. everything is FRAUD A class, unless proven otherwise. You can't usually control it directly, but it can be changed by creating better features and some other tricks. For now, proceed with what you have and see what happens.
One thing you can do is to train two classifiers, one (FRAUD/NOT FRAUD) and then if fraud is detected feed data into second two-class classifier (FRAUD A/FRAUD B). Sometimes (but not always) this works better.

Average Effectiveness of Variables in Binary Logistic Model

I have constructed a binary logistic model to check the effect of various variables on the probability of a consumer to buy. I have 5 different brands, and in the model I have 5 price variables which are specific to the brand (interaction between brand dummy and price). So my output looks like this:
Coefficient P-value
Price_Brand_A 0.25 0.02
Price_Brand_B 0.50 0.01
Price_Brand_C 0.10 0.09
Price_Brand_D 0.40 0.15
Price_Brand_E 0.65 0.02
What I would like to ask, is if it is correct to say something about the overall effect of price, and not specifically about the brands. For instance would it be correct to take the average of the coefficients and say that the average effect of price is equal to 0.38? Or is there some statistical procedure I should follow to report the overall effect of price? Also would the same apply for the P-value?
I am working with spss and I am new at modelling so any help would be appreciated.
Many Thanks
If you test an interaction hypothesis, you have to include a number of terms in your model. In this case, you would have to include:
Base effect of price
Base effects of brands (dummies)
Interaction effects of brand dummies * price.
Since you have 5 brands, you will have to include 4 of the 5 dummy variables. The dummy you leave out will be your reference category. The same goes for the interaction terms. In this case, the base effect of price will be the effect of price for the reference category of brands. The base of dummies will be the difference between brands if the price would be 0. The interaction effects can be interpreted in two different ways. On way is to say that the effect of an interaction term would be the additional price effect of one brand, compared to the reference category of brands. The other way is to say that the interaction effect is the additional difference between the brand and the reference brand if price increases by one.
If you want to know what the average effect of price is, why would you include interaction terms? In that case, I would leave out the interactions in a first model, and then include the interactions to show that the average effect of price is not accurate if you would look at the effect for each brand.
Maybe you could post some more output? I think you got more out of it than you posted in your question?
Good luck!

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

How to interpret the Microsoft Neural Network

I've created many types of neural networks with Microsoft's data tools, but I still struggle with the interpretation.
My Neural Network output gives so weird numbers. For the output attribute "Total Spent" it contains ranges that have negative numbers (ex. -384 - -58) even though the data set I use contains no negative numbers. What do these ranges actually mean and how am I supposed to interpret them correctly.
I have using a data set that on customer purchases and I group by CustomerID. On the data set some customers purchase thousands of dollars with of goods and others spend fifty dollars.
Thank you in advance for the help.