Summary/Cross Tab Using Using Multiple Variables + Percentage Change Columns - crosstab

I am trying to use gtsummary to count the number of times someone engaged in an action (a; binary variable, yes/no) in a given year (b, continuous variable, ranging from 2002-2020) by various demographic factors (c-z; i.e. race, income, educational attainment) for complex survey data. Is there anyway to do this in gtsummary? Furthermore, is there any way to use gtsummary to generate columns that would provide the percentage change (in absolute and relative terms) between two years for a given demographic factor (i.e. what is the percentage change between 2006 and 2020 in the number of times someone engaged in action "a" for (black/white/hispanic/mixed race) participants?
So far, I'm seeing the tbl_cross function can handle up to two variables, and tbl_svysummary seems equipped for more general summary statistics (i.e. counting the number of (black/white/hispanic) people by whether they engaged in action "a" or not) and not this more granular question I was wondering about.
Any guidance you have here would be much appreciated (and totally understand if this is beyond the scope of the package)! Thank you as always for your awesome work with gtsummary.

Related

Circular System, how to get numbers back into stock 1

I am creating a system dynamic and agent-based model for my dissertation.
Numbers generated through the different flows must be added back to the start to continue through the process.
For example, numbers flow from a parameter to stock 1, which goes through a flow process at a specific rate to stock 2. From stock 2, there is another flow process based on a particular rate to stock 3. The numbers from stock 3 need to go back into stock 1 to repeat the process.
Methods I have tried have been adding flows, links, and changing the initial value of stock 1.
Any help or suggestions are greatly appreciated!
Updated:
Added screenshots.
I think it is because of the difference between the two flows, e.g. a -9 based on the difference between flow and flow3 as shown in the screenshot.
Screenshots:
Graph of Stock 1
Model as a whole
In system dynamics, if you want to have a circular system (feedback loop) it needs to contain as a minimum 1 stock inside the loop, which means that there is at least 1 delay in the feedback loop
I will explain your model
stock has an outflow of 20 (flow), and an inflow of 11 (flow3)... this produces a net outflow of 9/timeUnit
This is what you see in the graph... and it doesn't even matter if your system is circular or not, that stock will lose 9/timeUnit forever.
Your wording is very strange when you say "numbers are generated and numbers are flowing"... it's not numbers that flow through your system... you can't really say "i have 3 numbers per minute flowing into a pool of numbers"
In your model, the "numbers" are definitely going back to the initial stock, but system dynamics is like water flowing, it is not a discrete paradigm, so you will not see the same "numbers" going back because system dynamics doesn't differentiate what individual "numbers" are flowing.
It's so weird already to have to use the word numbers to be consistent with your question.
So in order to have a better answer, you will need to specify:
what is the behavior you see in here
what is the behavior you expect, and how it differs from what you see
what would you need to see in the system in order to say "yes, my system in working exactly as expected"
It would help if you let us know what your system represents, and if you use names that represents what is flowing through the system (instead of using numbers, stock and flow, because any explanation becomes confusing)

Modeler question: Is there a function in SPSS for multiple 'if' statements? Forecasting dates

I am trying to build a forecast for interest expense for floating debt in my company.
I have been given a set of ResetDates which help me match a given rate based on when the ResetDate is.
I have been successful in forecasting one period, but I need a much longer set of periods to satisfy my requirements.
I've tried derive nodes and nested if statements as well as filler nodes.
I am given this data to work with, I can only look at one ResetDate ahead.
Here you will find the data I used: Columns A/B/C/D is what i'm given, Column E (or 5th column from left to right) is what I want to derive as my output
I want to use 'InterestPayDate' and derive:
if it's more than 'NextReset' , the add 90 days to the 'NextReset' to create 'NextReset2'
That is as far as I can get.... where my problem lies is I want to look at NextReset2 and derive:
if 'InterestPayDate' is more than 'NextReset2', then add 90 days to 'NextReset2', if it's less than 'NextReset2', keep the current value for 'NextReset2'
Output should look like Column E here
Not sure if I need to dig deeper into the logical functions, in all honesty, I've just picked up SPSS and I am really trying to learn. Hopefully, you can point me in the right direction.
Thank you.
After computing the first NextReset2, you need to use a Filler node like the one below to change the value of the field.
You might need more than one identical nodes like this - one for each potential 90-day period that you are looking to extend the NextReset2 date. In your sample data, you will need at least two Filler nodes to get the correct value of NextReset2 for the last of the records.
There might be a more elegant way to do it, but this will work and it's easy enough to make copies of a node and string them together like this.
Please also see a sample IBM SPSS Modeler stream showing this approach here and using your sample data.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.

How would you identify a user's most active sections?

On a website, everything is tagged with keywords assigned by the staff (it's not a community driven site, due to its nature). I am able to determine which tags a user is most active in (or, what tags they view the most). However, I'm not sure how to choose the list. A few options present themselves, but they don't seem right to me.
Take the top n (or m < n if they have fewer than n viewed tags) tags
Take the top n tags where n is a percentage of the total tags viewed
Take the top n tags with m views where n and m are percentages of total tags viewed and total page views
Take all of the tags, regardless of views
The goal is to identify what is most interesting to the user and show them other things that they might be interested in, with respect to the tags that are assigned to the content.
You could look at machine learning algorithms to find algorithms with which to evaluate the effectiveness of your choice.
Like for instance: http://en.wikipedia.org/wiki/Supervised_learning#Approaches_and_algorithms Stuff like nearest neighbour and bayes could help you improve your suggestions.
This is however overkill for just suggesting "Would you like to look at this too?", but it's an interesting approach to providing better tie-ins. It would, however require some method to figure out whether or not your users value your suggestions (e.g. "I like this!"-links or log-analysis based on time spent on links, etc.)
A simple solution is to try several reports and check which report is more informative. The nature of your site and your data may mean that some reports are unexpectedly useful and some are not. If a report get a 'flat' area chart for example - look for something else.
Even better give the consumers of the reports a choice and an ability to provide feedback. Tune the reports based on what they will be really looking for.
P.S. I would go fro the "Take the top n tags with m views where n and m are percentages of total tags viewed and total page views" report first