usage of CP-SAT to forecast 3 Milions of boolean variables

usage of CP-SAT to forecast 3 Milions of boolean variables - or-tools

Dears,
I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
Cheers

You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

Related

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..

1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

How to tackle skewness and output file size in Apache Spark

I am facing skewness problem, when I am trying to join 2 datasets. One of data partition ( the column I am trying to perform the join operation) has skewness than rest of the partition and due to this one of a final output part file is 40 times greater than rest of the output part files.
I am using Scala, Apache spark for performing my calculation and file format used is parquet.
So I am looking for 2 solutions:
First is how can I tackle the skewness as time taken to process that
skewed data is taking a lot of time. (For the skewed data I have tried Broadcasting but it did not helped )
Seconds is how can I make all the final output part files stored
within a 256 MB range. I have tried a property
spark.sql.files.maxPartitionBytes=268435456 but it is not making any
difference.
Thanks,

Skewness is a common problem when dealing with data.
To handle it, there is a technique called salting exists.
First, you may check out this video by Ted Malaska to get the intuition about salting.
Second, examine his repository oh this theme.
I think that each issue with skewness has its own method to solve it.
Hope these materials will help you.

optimizing my dataset for regression analysis over time

I have a question in regards to optimizing my dataset. I have a dataset of around 5000 individuals that i follow over time (with around 50 variables). However i have duplicate cases... for example: ID:1 Year:1 ID:1 Year:2 ID:1 Year 2. Therefore to make sure that no double years are present i used casestovar to not lose data.
However no that i have used casestovar with index i have many variables. I have to change those variables so that when i revert the dataset back with varstocases i no long have these double years!
How do I do this?
(extra info: i did casestovars so that i did not lose the information as the duplicate cases are only partially the same (some variables are some are not)).
Kind regards,
Instinct

Barriers to translation stage in Modelica?

Some general Modelica advice?
We've built a model with ~2000 equations and three vectors of input from measured data. Using OpenModelica, attempts at simulation have begun to hang in the translation stage (which runs for hours where it used to take less than a minute) and now I regularly "lose connection to omc.exe." Is there perhaps something cumulative occurring that's degrading translation/compilation performance?
In general, are there any good rules of thumb for keeping simulations lighter and faster? I realize that, depending on the couplings, additional equations could be exponentially increasing the size of the resulting system of equations - could this be a problem?
Thanks for your thoughts!

It shouldn't take that long. Seems like a bug.
You can report this bug here:
https://trac.openmodelica.org/OpenModelica (New Ticket).
If your model is public you can post it there, if not you can contact the OpenModelica team privately.

I did some cleaning in the code; and got the part that repeats 12x (the module) down to ~180 equations; in the process I reduced the size of my input vectors (and also a 2D look-up table the module refers to) by quite a bit - they're both down to a few hundred values. It's working now--simulations run in reasonable time, a few minutes each.
Since all these tables were defined within Modelica functions (as you pointed out, Mr. Tiller) perhaps shrinking them helped to improve the performance. I had assumed that all that data just got spread out in a memory array, without going through any real processing, but maybe that's not the case...time to know more about what's going on under the hood in this environment (as always).
Thanks for the help!

Chi-sqaure type-1-error

I have a question about the chi-square test.
I have two between-subject factors, each with two levels (so 4 conditions). Furthermore, I have one dependent variable (qualitative), also consisting of two levels.
Now I want to make pairwise comparisons (so I have 6 chi-sqaure test in total). Is there any way I can control type-1-errors? In the literature I saw they often calculated interaction with a chi-sqaure test. Is this the way to do it, and if so, how do I do it?
I can work with both SPSS and Matlab.
Thank in advance!
Niels