Cluster analysis with nominal, ordinal and metric data - cluster-analysis

I got a data set wit nominal, ordinal and metric variables.
I want to perform a cluster analysis,
since I have mixed scales it seems that using k-modes clustering is the most appropriate way to explore the data.
Or has anyone a better way in mind? I am thanksful for any advices!

It's not enough to just make there program run.
It needs to answer the right question. K-means, k-medians, k-medoids, k-modes. Each optimizes a different function. Math won't tell you which function it the best for you. That is the question you need to answer, which function solves your problem?

Related

Can you feed OR-tools solver external data inbetween different solutions it finds?

I'm trying to solve a complex variant of a min-SAT problem. So far in the process I have two subproblems, both giving solution values that need to be considered in the objective function. However, only one of the two problems do I solve with the OR-tools cp_model module. The other is solved by an external algorithm. Now, ideally I would do the following:
cp-solver findes a solution to the first subproblem,
pause the solver,
solve the second subproblem with an external algorithm, taking as argument the solution found by the cp-solver,
feed the result of the external algorithm back to the cp-solver,
cp-solver now considers as the objective value the sum of the solution it itself found to first subproblem and the solution that was found by the external algorithm,
cp-solver goes to the next iteration and repeats steps 1-6 for a new assignment
So my question is: is there a functionality for Google OR-tools that lets me do something like steps 1-6 where the solver runs in cooperation with external algorithms and is fed values accordingly? I'm new to using this module so I'm unaware of what terms I could search for on Google to find what I need. Thanks a lot my friends. Best regards, 30centimeter.
In the cp-sat solver, solve() is stateless and a black box.
The only thing you can do is modify the model and resolve.

Using Gurobi to run a MIQP: how can I improve time performance?

I am using Gurobi to run a MIQP (Mixed Integer Quadratic Programming) with linear constraints in Matlab. The solver is very slow and I would like your help to understand whether I can do something about it.
These are the lines which I use to launch the problem
clear model;
clear params;
model.A=[Aineq; Aeq];
model.rhs=[bineq; beq];
model.sense=[repmat('<', size(Aineq,1),1); repmat('=', size(Aeq,1),1)];
model.Q=Q;
model.obj=c;
model.vtype=type;
model.lb=total_lb;
model.ub=total_ub;
params.MIPGap=10^(-1);
result=gurobi(model,params);
This is a screenshot of the output in the Matlab window.
Question 1: It is the first time I am trying to run a MIQP and I would like to have your advice to understand what I can do to improve performance. Let me tell what I have tried so far:
I cheated by imposing params.MIPGap=10^(-1). In this way the phase of node exploration is made shorter. What are the cons of doing this?
I have big-M coefficients and I have tied them to the smallest possible values.
I have tried setting params.ScaleFlag=2; params.ObjScale=2 but it makes things slower
I have changed params.method but it does not seem to help (unless you have some specific recommendation)
I have increase params.Threads but it does not seem to help
Question 2 (minor): Why do I get a negative objective in the root simplex log? How can the objective function be negative?
Without having the full model here, there is not much on advise to give. Tight Big-M formulations are important, but you said, you checked them already. Sometimes splitting them up might help, but this is a complex field.
What might give great benefits for some problems is using the Gurobi parameter tuning tool. So try to export your model and feed the tuning tool with it. It automatically tries different of the hundreds of tuning parameters and might give some nice results.
Regarding the question about negative objectives in the simplex logs, I can think of a couple of possible explanations. First, note that the negative objective values occur in the presence of dual infeasibilities in the dual simplex run. In such a case, I'm not sure exactly what the primal objective values correspond to. Second, if you have a MIQP with products of binaries in the objective, Gurobi may convexify the objective in a way that makes it possible for a negative objective to appear in the reformulated model even when the original model must have a nonnegative objective in any feasible solution.

how to decide splitting a cluster or not?

I have given a Cluster. How can i decide splitting the Cluster in two parts is better than the original Cluster?
I have tried using K-Mean with k = 2 and again stuck.. Is it better to spilt or not to spilt?
EDit: Well i dont get the downvotes... A little explanation would be helpful to improve the question :D
The literature proposes different metrics, e.g,
Bayesiqan Information Criterion
Alaine Information Criterion

Maximum Likelihood, Matlab

I'm writing code, that executes MLE. At each step, I get gradient at one point and then move along it to another point. But I have problem with determination of magnitude of the move. How to determine the best magnitude for good convergence? Can you give me an advice how to avoid other pitfalls, such as presence of several maximums?
Regarding the presence of several maxima: this issue will occur when dealing with a function that is not convex. It can be partially solved by multi-start optimization, which essentially means that you run the simulation multiple times in order to find as many maxima as possible and then selecting the 'highest' maximum from among them. Note that this does not guarantee global optimality, as the global optimum might be hard to reach (i.e. the local optima have a larger domain of attraction).
Regarding the optimal step size for convergence: you might want to look at back-tracking linesearch. A short explanation of it can be found in the answer to this question
We might be able to give you more specific help if you could give us some code to look at, as jkalden already pointed out.

How to cluster categorical variables?

What's the most appropriate family of Machine Learning algorithms for clustering categorical data? Let's assume that we have the following dataset:
V1 V2 V3 V4
"v1a" "v2b" "v3b" "v4c"
"v1b" "v2f" "v3a" "v4c"
"v1a" "v2e" "v3b" "v4c"
Is there any way to cluster them somehow? I am particular interested in doing so through Apache Mahout. Any hint \ idea is highly appreciated.
The question that you need to answer first is:
What is a cluster?
Obviously, many of the existing cluster definitions (connected by steps of Euclidean distance less than epsilon) etc. will not be useful.
There are tricks to vectorize such data so that you can still run k-means on it.
But more often than not, the results will be useless, because people did not consider what they are doing first.
So first try to find out what you want to do, then look for tools to do that.