Lcmm and pooling multiple imputed datasets - imputation

I am working in R. I have 5 imputed datasets that I would like to pool the results of. My original analysis in jointlcmm - similar to syntax in lcmm - does anyone know how I can pool the results from jlcmm or lcmm to produce one result?


usage of CP-SAT to forecast 3 Milions of boolean variables

I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

Trying to get feature importance in Random Forest(PySpark)

I have a customer data with close to 15k columns.
I'm trying to run RF on the data to reduce the number of columns and then run other ML algorithms on it.
I am able to run RF on PySpark but am unable to extract the feature importance of the variables.
Anyone having any clue about the same or any other technique which would help me in reducing the 15k variable to some 200 odd variables.

Is Matlab incorrect for mnrfit?

It seems Matlab is giving incorrect results for multinomial logistic regression.
In their example documentation using Fisher's Iris dataset [link], they give coefficients for the model which can be used on the same data set itself to get the modeled probabilities.
load fisheriris
sp = categorical(species);
[B,dev,stats] = mnrfit(meas,sp);
However, none of the expected value aggregates match the population aggregates which is a requirement for a MaxEnt classifer (See slide 35 [here], or Eq 14 [here], or Agresti "Categorical Data Analysis" pg 298, etc.)
For example
>> sum(PHAT)
>> 49.9828 49.8715 50.1456
should all equal 50 (population values), likewise for other aggregations
If the parameters
B=[36.9450 42.6378
12.2641 2.4653
14.4401 6.6809
-30.5885 -9.4294
-39.3232 -18.2862]
were used instead then all aggregated sufficient statistics match.
Additionally it seems odd that Matlab is solving it with likelihoods, which can produce an error,
Warning: Maximum likelihood estimation did not converge. Iteration
limit exceeded. You may need to merge categories to increase observed
where the only requirement, proved by MLE consideration, is that the expected values match and no likelihood evaluation is needed.
It would be a nice feature that if instead of true classes are given we can give an option for including just the aggregate information.
Submitted a technical error review within Mathworks website. Their reply:
Hello [----],
I am writing in reference to your Technical Support Case #01820504
regarding 'mnrfit'.
Thanks a lot for your patience and reporting this issue. This appears
to be unexpected behavior. It appears to be related to an existing
issue we have in our records, that "mnrfit" does not give correct
maximum likelihood estimates in certain cases. Since the "mnrfit"
function is not finding the maximum likelihood estimates for the
coefficients, we calculated the actual MLEs. When we use these
estimates, we get the desired result of all 50s in this case.
The issue is that, for this particular dataset in our example, the
classes can be separated perfectly. This means that the logistic
function, in order to get exact zero or one probabilities, needs to
have infinite coefficients. The "mnrfit" function carries out an
iterative procedure with the coefficients getting larger, but it stops
at a point where the results have the issue that you have found. We
certainly agree that "mnrfit" could be made to do better. Our
development team is working on it.
At this stage, I am not able to suggest a workaround other than to
write a custom implementation as my colleague and I had tried. For
now, I will be closing this request as I have already forwarded it to
our records. However, if you have any additional questions related to
this case, please do not hesitate to reach me.
MathWorks Technical Support Department

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
Finalization to new center:
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.

Algorithm for returning similar documents represented in Vector space model

I have a DB containing tf-idf vectors of about 30,000 documents.
I would like to return for a given document a set of similar documents - about 4 or so.
I thought about implementing a K-Means (clustering algorithm) on the data (with cosine similarity), but I don't know whether it's the best choice because of many uncertainties: I'm not sure what to put in my initial clusters, I don't know how many clusters to create, I fear the clusters will be too unbalanced, I'm not sure the results quality will be good, etc.
Any advice and help from experienced users will be greatly appreciated.
Thank you,
I would like to return for a given document a set of similar documents - about 4 or so.
Then don't do k-means. Just return the four closest documents by tf-idf similarity, as any search engine would do. You can implement this as a k-nearest neighbor search, or more easily by installing a search engine library and using the initial document as a query. Lucene comes to mind.
If I understand, you
read 30k records from a bigger db to a cache file / to memory
cosine similarity, 10 terms * 30k records -> best 4.
Can you estimate the runtimes of these phases separately ?
read or cache: how often will this be done,
how big are the 30k vectors all together ?
10 * 30k multiply-adds: in your c / java / ... or in some opaque db ?
In c or java, that should take < 1 second.
In general, make some back-of-the-envelope estimates
before getting fancy.
(By the way,
I find best-4 faster and simpler in straight-up c than std::partial_sort; ymmv.)