Average Effectiveness of Variables in Binary Logistic Model

Average Effectiveness of Variables in Binary Logistic Model - average

I have constructed a binary logistic model to check the effect of various variables on the probability of a consumer to buy. I have 5 different brands, and in the model I have 5 price variables which are specific to the brand (interaction between brand dummy and price). So my output looks like this:
Coefficient P-value
Price_Brand_A 0.25 0.02
Price_Brand_B 0.50 0.01
Price_Brand_C 0.10 0.09
Price_Brand_D 0.40 0.15
Price_Brand_E 0.65 0.02
What I would like to ask, is if it is correct to say something about the overall effect of price, and not specifically about the brands. For instance would it be correct to take the average of the coefficients and say that the average effect of price is equal to 0.38? Or is there some statistical procedure I should follow to report the overall effect of price? Also would the same apply for the P-value?
I am working with spss and I am new at modelling so any help would be appreciated.
Many Thanks

If you test an interaction hypothesis, you have to include a number of terms in your model. In this case, you would have to include:
Base effect of price
Base effects of brands (dummies)
Interaction effects of brand dummies * price.
Since you have 5 brands, you will have to include 4 of the 5 dummy variables. The dummy you leave out will be your reference category. The same goes for the interaction terms. In this case, the base effect of price will be the effect of price for the reference category of brands. The base of dummies will be the difference between brands if the price would be 0. The interaction effects can be interpreted in two different ways. On way is to say that the effect of an interaction term would be the additional price effect of one brand, compared to the reference category of brands. The other way is to say that the interaction effect is the additional difference between the brand and the reference brand if price increases by one.
If you want to know what the average effect of price is, why would you include interaction terms? In that case, I would leave out the interactions in a first model, and then include the interactions to show that the average effect of price is not accurate if you would look at the effect for each brand.
Maybe you could post some more output? I think you got more out of it than you posted in your question?
Good luck!

Related

Interesting results from LSTM RNN : lagged results for train and validation data

As an introduction to RNN/LSTM (stateless) I'm training a model with sequences of 200 days of previous data (X), including things like daily price change, daily volume change, etc and for the labels/Y I have the % price change from current price to that in 4 months. Basically I want to estimate the market direction, not to be 100% accurate. But I'm getting some odd results...
When I then test my model with the training data, I notice the output from the model is a perfect fit when compared to the actual data, it just lags by exactly 4 months:
When I shift the data by 4 months, you can see it's a perfect fit.
I can obviously understand why the training data would be a very close fit as it has seen it all during training - but why the 4 months lag?
It does the same thing with the validation data (note the area I highlighted with the red box for future reference):
Time-shifted:
It's not as close-fitting as the training data, as you'd expect, but still too close for my liking - I just don't think it can be this accurate (see the little blip in the red rectangle as an example). I think the model is acting as a naive predictor, I just can't work out how/why it's possibly doing it.
To generate this output from the validation data, I input a sequence of 200 timesteps, but there's nothing in the data sequence that says what the %price change will be in 4 months - it's entirely disconnected, so how is it so accurate? The 4-month lag is obviously another indicator that something's not right here, I don't know how to explain that, but I suspect the two are linked.

I tried to explain the observation based on some general underlying concept:
If you don't provide a time-lagged X input dataset (lagged t-k where k is the time steps), then basically you will be feeding the LSTM with like today's closing price to predict the same today's closing price..in the training stage. The model will (over fit) and behave Exactly as the answer is known already (data leakage)
If the Y is the predicted percentage change (ie. X * (1 + Y%) = 4 months future price), the present value Yvalue predicted really is just the future discounted by the Y%
so the predicted value will have 4 months shift

Okay, I realised my error; the way I was using the model to generate the forecast line was naive. For every date in the graph above, I was getting an output from the model, and then apply the forecasted % change to the actual price for that date - that would give predicted price in 4 months' time.
Given the markets usually only move within a margin of 0-3% (plus or minus) over a 4 month period, that would mean my forecasts was always going to closely mirror the current price, just with a 4 month lag.
So at every date the predicted output was being re-based, so the model line would never deviate far from the actual; it'd be the same, but within a margin of 0-3% (plus or minus).
Really, the graph isn't important, and it doesn't reflect the way I'll use the output anyway, so I'm going to ditch trying to get a visual representation, and concentrate on trying to find different metrics that lower the validation loss.

Building propensity score for a cluster

I am working on an exercise to build influencer score for each user in my data set. What that means is that a user with higher engagement should get higher score and vice versa. however, i have many different type of engagement variables and i am not sure which one should weight higher.
so, i first did a cluster analysis to divide users into different group based on engagement activity using 5 different types of engagement. Based on this, i found that one of the cluster has high level of engagement across all the different types of engagement variables. This is the group i am interested in. however, it is possible that the group size i get may be smaller than the number of users i want to use in future. so, i want to now use these clusters and create a propensity score.
e.g. in the cluster analysis, say i get 5 clusters c1, c2,c3,c4,c5 and c5 is my cluster of interest. so, i give all users in c5 a value of 1 (= influencer) and i give all users in c1 to c4 a value of 0 (= not influencer). now, i use this binary variable and build a logistic regression model (using same engagement variables as used for clustering) to get propensity for everyone to an influencer. this way, i can change the threshold to reduce or increase the numbers of users i want to select.
Now, the issue i am running in is that one of the engagement variable is able to predict influencer very well and hence my propensity scores are very close to either 1 or 0 which defeats the purpose of why i wanted the propensity score in the first place.
S0, 2 questions -
1) is this approach of building a unsupervised classification and then using this to build supervised classification a sound approach of what i am trying to do?
2) how do i reduce the contribution from the variable that predicts influencer really well to ensure that i get much more smoother curve instead of values near 0 or 1. i don't want to remove this variable from the model as this is important from business perspective.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).

Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.

For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.

You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.

Simulating sports matches in online game

In an online manager game (like Hattrick), I want to simulate matches between two teams.
A team consists of 11 players. Every player has a strength value between 1 and 100. I take these strength values of the defensive players for each team and calculate the average. That's the defensive quality of a team. Then I take the strengths of the offensive players and I get the offensive quality.
For each attack, I do the following:
$offFactor = ($attackerTeam_offensive-$defenderTeam_defensive)/max($attackerTeam_offensive, $defenderTeam_defensive);
$defFactor = ($defenderTeam_defensive-$attackerTeam_offensive)/max($defenderTeam_defensive, $attackerTeam_offensive);
At the moment, I don't know why I divide it by the higher one of both values. But this formula should give you a factor for the quality of offense and defense which is needed later.
Then I have nested conditional statements for each event which could happen. E.g.: Does the attacking team get a scoring chance?
if ((mt_rand((-10+$offAdditionalFactor-$defAdditionalFactor), 10)/10)+$offFactor >= 0)
{ ... // the attack succeeds
These additional factors could be tactical values for example.
Do you think this is a good way of calculating a game? My users say that they aren't satisfied with the quality of the simulations. How can I improve them? Do you have different approaches which could give better results? Or do you think that my approach is good and I only need to adjust the values in the conditional statements and experiment a bit?
I hope you can help me. Thanks in advance!

Here is a way I would do it.
Offensive/Defensive Quality
First lets work out the average strength of the entire team:
Team.Strength = SUM(Players.Strength) / 11
Now we want to split out side in two, and work out the average for our defensive players, and our offensive players.]
Defense.Strength = SUM(Defensive_Players.Strength)/Defensive_Players.Count
Offense.Strength = SUM(Offense_Players.Strength)/Offense_Players.Count
Now, we have three values. The first, out Team average, is going to be used to calculate our odds of winning. The other two, are going to calculate our odds of defending and our odds of scoring.
A team with a high offensive average is going to have more chances, a team with a high defense is going to have more chance at saving.
Now if we have to teams, lets call them A and B.
Team A, have an average of 80, An offensive score of 85 and a defensive score of 60.
Team B, have an average of 70, An offensive score of 50 and a defensive score of 80.
Now, based on the average. Team A, should have a better chance at winning. But by how much?
Scoring and Saving
Lets work out how many times goals Team A should score:
A.Goals = (A.Offensive / B.Defensive) + RAND()
= (85/80) + 0.8;
= 1.666
I have assumed the random value adds anything between -1 and +1, although you can adjust this.
As we can see, the formula indicates team A should score 1.6 goals. we can either round this up/down. Or give team A 1, and calculate if the other one is allowed (random chance).
Now for Team B
B.Goals = (B.Offensive / A.Defensive) + RAND()
= (50/60) + 0.2;
= 1.03
So we have A scoring 1 and B scoring 1. But remember, we want to weight this in A's favour, because, overall, they are the better team.
So what is the chance A will win?
Chance A Will Win = (A.Average / B.Average)
= 80 / 70
= 1.14
So we can see the odds are 14% (.14) in favor of A winning the match. We can use this value to see if there is any change in the final score:
if Rand() <= 0.14 then Final Score = A 2 - 1 B Otherwise A 1 - 1 B
If our random number was 0.8, then the match is a draw.
Rounding Up and Further Thoughts
You will definitely want to play around with the values. Remember, game mechanics are very hard to get right. Talk to your players, ask them why they are dissatisfied. Are there teams always losing? Are the simulations always stagnant? etc.
The above outline is deeply affected by the randomness of the selection. You will want to normalise it so the chances of a team scoring an extra 5 goals is very very rare. But a little randomness is a great way to add some variety to the game.
There are ways to edit this method as well. For example instead of the number of goals, you could use the Goal figure as the number of scoring chances, and then have another function that worked out the number of goals based on other factors (i.e. choose a random striker, and use that players individual stats, and the goalies, to work out if there is a goal)
I hope this helps.

The most basic tactical decision in football is picking formation, which is a set of three numbers which assigns the 10 outfield players to defence, midfield and attack, respectively, e.g. 4/4/2.
If you use average player strength, you don't merely lose that tactic, you have it going backwards: the strongest defence is one with a single very good player, giving him any help will make it more likely the other team score. If you have one player with a rating of 10, the average is 10. Add another with rating 8, and the average drops (to 9). But assigning more people to defence should make it stronger, not weaker.
So first thing, you want to make everything be based on the total, not the average. The ratio between the totals is a good scale-independent way of determining which teams is stronger and by how much. Ratios tend to be better than differences, because they work in a predictable way with teams of any range of strengths. You can set up a combat results table that says how many goals are scored (per game, per half, per move, or whatever).
The next tactical choice is whether it is better to have one exceptional player, or several good ones. You can make that matter that by setting up scenarios that represent things that happen in game, e.g. a 1 on 1, a corner, or a long ball. The players involved in a scenario are first randomly chosen, then the result of the scenario is rolled for. One result can be that another scenario starts (midfield pass leads to cross leads to header chance).
The final step, which would bring you pretty much up to the level of actual football manager games, is to give players more than one type of strength rating, e.g., heading, passing, shooting, and so on. Then you use the strength rating appropriate to the scenario they are in.

The division in your example is probably a bad idea, because it changes the scale of the output variable depending on which side is better. Generally when comparing two quantities you either want interval data (subtract one from the other) or ratio data (divide one by the other) but not both.
A better approach in this case would be to simply divide the offensive score by the defensive score. If both are equal, the result will be 1. If the attacker is better than the defender, it will be greater than 1, and if the defender is stronger, it will be less than one. These are easy numbers to work with.
Also, instead of averaging the whole team, average parts of the team depending on the formations or tactics used. This will allow teams to choose to play offensively or defensively and see the pros and cons of this.
And write yourself some better random number generation functions. One that returns floating point values between -1 and 1 and one that works from 0 to 1, for starters. Use these in your calculations and you can avoid all those confusing 10s everywhere!

You might also want to ask the users what about the simulation they don't like. It's possible that, rather than seeing the final outcome of the game, they want to know how many times their team had an opportunity to attack but the defense regained control. So instead of
"Your team wins 2-1"
They want to see match highlights:
"Your team wins 2-1:
- scored at minute 15,
- other team took control and went for tried for a goal at minute 30,
but the shoot was intercepted,
- we took control again and $PLAYER1 scored a beautiful goal!
... etc
You can use something like what Jamie suggests for a starting point, choose the times at random, and maybe pick who scored the goal based on a weighted sampling of the offensive players (i.e. a player with a higher score gets a higher chance of being the one who scored). You can have fun and add random low-probability events like a red card on a player, someone injuring themselves, streakers across the field...

The average should be the number of players... using the max means if you have 3 player teams:
[4 4 4]
[7 4 1]
The second one would be considered weaker. Is that what you want? I think you would rather do something like:
(Total Scores / Total Players) + (Max Score / Total Players), so in the above example it would make the second team slightly better.
I guess it depends on how you feel the teams should be balanced.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse