optimizing my dataset for regression analysis over time - merge

I have a question in regards to optimizing my dataset. I have a dataset of around 5000 individuals that i follow over time (with around 50 variables). However i have duplicate cases... for example: ID:1 Year:1 ID:1 Year:2 ID:1 Year 2. Therefore to make sure that no double years are present i used casestovar to not lose data.
However no that i have used casestovar with index i have many variables. I have to change those variables so that when i revert the dataset back with varstocases i no long have these double years!
How do I do this?
(extra info: i did casestovars so that i did not lose the information as the duplicate cases are only partially the same (some variables are some are not)).
Kind regards,
Instinct

Related

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..
1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

Summary/Cross Tab Using Using Multiple Variables + Percentage Change Columns

I am trying to use gtsummary to count the number of times someone engaged in an action (a; binary variable, yes/no) in a given year (b, continuous variable, ranging from 2002-2020) by various demographic factors (c-z; i.e. race, income, educational attainment) for complex survey data. Is there anyway to do this in gtsummary? Furthermore, is there any way to use gtsummary to generate columns that would provide the percentage change (in absolute and relative terms) between two years for a given demographic factor (i.e. what is the percentage change between 2006 and 2020 in the number of times someone engaged in action "a" for (black/white/hispanic/mixed race) participants?
So far, I'm seeing the tbl_cross function can handle up to two variables, and tbl_svysummary seems equipped for more general summary statistics (i.e. counting the number of (black/white/hispanic) people by whether they engaged in action "a" or not) and not this more granular question I was wondering about.
Any guidance you have here would be much appreciated (and totally understand if this is beyond the scope of the package)! Thank you as always for your awesome work with gtsummary.

usage of CP-SAT to forecast 3 Milions of boolean variables

Dears,
I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
Cheers
You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

Modeler question: Is there a function in SPSS for multiple 'if' statements? Forecasting dates

I am trying to build a forecast for interest expense for floating debt in my company.
I have been given a set of ResetDates which help me match a given rate based on when the ResetDate is.
I have been successful in forecasting one period, but I need a much longer set of periods to satisfy my requirements.
I've tried derive nodes and nested if statements as well as filler nodes.
I am given this data to work with, I can only look at one ResetDate ahead.
Here you will find the data I used: Columns A/B/C/D is what i'm given, Column E (or 5th column from left to right) is what I want to derive as my output
I want to use 'InterestPayDate' and derive:
if it's more than 'NextReset' , the add 90 days to the 'NextReset' to create 'NextReset2'
That is as far as I can get.... where my problem lies is I want to look at NextReset2 and derive:
if 'InterestPayDate' is more than 'NextReset2', then add 90 days to 'NextReset2', if it's less than 'NextReset2', keep the current value for 'NextReset2'
Output should look like Column E here
Not sure if I need to dig deeper into the logical functions, in all honesty, I've just picked up SPSS and I am really trying to learn. Hopefully, you can point me in the right direction.
Thank you.
After computing the first NextReset2, you need to use a Filler node like the one below to change the value of the field.
You might need more than one identical nodes like this - one for each potential 90-day period that you are looking to extend the NextReset2 date. In your sample data, you will need at least two Filler nodes to get the correct value of NextReset2 for the last of the records.
There might be a more elegant way to do it, but this will work and it's easy enough to make copies of a node and string them together like this.
Please also see a sample IBM SPSS Modeler stream showing this approach here and using your sample data.

PIG - filtering groupby by the contents of the group

I am new to pig, and I am wondering if I can do any inter-group filtering easily with it.
I have some data grouped by userid and some timestamps. I want to take only the groups that have two consecutive timestamps that are less than 30 minutes apart. Is this easy to express in Pig?
Thanks a lot!
The cleanest way to do this would be to write a UDF. The function would take a bag of timestamps as input, order them, and compute the minimum difference between timestamps. You could then filter your data based on the output of this UDF.
It is possible to do this in pure Pig Latin, if you really want to, although it involves more temporary data and map-reduce cycles, which means it may not be worth it. This would involve FLATTENing the bag of timestamps twice to get its cross-product, creating an indicator variable for any pairs of timestamps separated by less than 30 minutes, and then summing this variable for each user. Any user with a sum greater than zero has the property you desire.
Give it a go, and if you run into any specific issues, post another question outlining exactly where you're stuck.