tbl_regression (gtsummary) ordering covariables levels and processing time - gtsummary

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..

1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

Related

usage of CP-SAT to forecast 3 Milions of boolean variables

Dears,
I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
Cheers
You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

I cannot reproduce the results with kmeans in Orange

I've tried to repeat the same results with the same flow, and I don't understand the results are different in each situation.
I describe the situation I have a file with 192 instances and 37 features, y select in all cases the same columns and preprocess by Median and StdDev. It computes the PCA with 7 principal components. The following step is to run the k-means algorithm (k is between 2 and 8) from this 'new' dataset. The scatter plot shows the results for k=5.
I attached different images with my flows.
Image1: original flow
The first one is the original flow (it is painted of yellow color), which I would like to repeat without the rest of the options (the second image).
Image2: flows repeated
However, when I tried to do it, I saw that the results are different (the third image) Of course the colors don't determine the differences, however the clusters are different. In addition the Slhouette Scores are different too for the different flows.
Image3: results of the different flows
K-means initializes with the kmean++ and I have the question if I can "control" this, or if the way to initialize k-means is always randomly. I saw in other programmes that there is an option called seed which is used to control that an experiment can be repeated but I didn't see this option here or something similar.
I wonder if it is possible to obtain always the same results with the same flow (using k-means).
It seems that the issue happens because no random seed is set in the k-means widget. So initialization is different each time you repeat an experiment and because of nature of your data, the method converges differently. Can you please report your issue to Orange3 issue tracker.

Barriers to translation stage in Modelica?

Some general Modelica advice?
We've built a model with ~2000 equations and three vectors of input from measured data. Using OpenModelica, attempts at simulation have begun to hang in the translation stage (which runs for hours where it used to take less than a minute) and now I regularly "lose connection to omc.exe." Is there perhaps something cumulative occurring that's degrading translation/compilation performance?
In general, are there any good rules of thumb for keeping simulations lighter and faster? I realize that, depending on the couplings, additional equations could be exponentially increasing the size of the resulting system of equations - could this be a problem?
Thanks for your thoughts!
It shouldn't take that long. Seems like a bug.
You can report this bug here:
https://trac.openmodelica.org/OpenModelica (New Ticket).
If your model is public you can post it there, if not you can contact the OpenModelica team privately.
I did some cleaning in the code; and got the part that repeats 12x (the module) down to ~180 equations; in the process I reduced the size of my input vectors (and also a 2D look-up table the module refers to) by quite a bit - they're both down to a few hundred values. It's working now--simulations run in reasonable time, a few minutes each.
Since all these tables were defined within Modelica functions (as you pointed out, Mr. Tiller) perhaps shrinking them helped to improve the performance. I had assumed that all that data just got spread out in a memory array, without going through any real processing, but maybe that's not the case...time to know more about what's going on under the hood in this environment (as always).
Thanks for the help!

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.

Prediction/delay forcasting using Machine Learning?

I have a set of data for the past 5 years. Approx 7000 rows of data with features that are binary {yes/no} or are multi-classed {product A, B, C} A total of about 20+ features.
I am trying to make a program (or one time analysis project) to determine (predict) the product shipdate(shipping delay days) based on this historical data. I have 2 columns that indicate when a product was planned to be shipped and another column of when it was actually shipped! Currently.
I'm wondering how I can make a prediction program that determines based on the historic data when new data input of a product will expect to ship. I don't care about a getting a specific date but even just a program that can tell me number of delay days to add...
I took an ML class a while back and I wasn't sure how to start something like this. Any advice? Plus the closest thing to this I can think of is an image recognition assignment using NN. but that was too easy here I have to deal with a date instead of pixel white/black.... I used Matlab back in the day (I still know how to use it) but I just downloaded Weka data mining tool.
I was thinking of a neural network but I'm not sure how to set it up to have my program give me a the expected delay time (# of days/month) from the inputed ship date.
Basically,
I want to input (size = 5, prod = A, ....,expected ship date = jan 1st)
and the program returns the number of days to add as a delay onto my expected ship date given the historical trends...
Would appreciate any any help on how start something like this the correct/easiest/best way... Thanks in advance.
If you use weka, then get your input/label data into the arff format and then you try out all the different regressors (this is a regression problem after all). To avoid having to do too much programming quite yet (if you are just in an exploratory phase), use the weka experimenter which has a GUI for trying out a whole bunch of regressors on your dataset.
Then when you find one that does something expected and you want to do some more data analysis using MATLAB, then you can use a weka/matlab interface.