XMeans ELKI fails at every third input file - cluster-analysis

I'm trying to cluster image data (stored in 100 separate csv files) with ELKI's XMeans algorithm. It works well for the first two files, but then the algorithm hangs on forever while processing the third file. It looks like the problem occurs at every 3rd file or so, because when I start the loop, that goes over all files at the fourth file, it works for the fourth and the fifth file, but not for the sixth file. Same goes for the 9th and 11th file... but maybe that's coincidence.
My XMeans call looks like this:
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
Database db = new StaticArrayDatabase(dbc, null);
db.initialize();
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, init, true);
XMeans<NumberVector, KMeansModel> xm = new XMeans<>(dist, 5, 50, 1, innerKMeans, initializer, splitInitializer, informationCriterion, random);
Clustering<KMeansModel> c = xm.run(db, rel);
I'm not too sure about these four lines, so maybe that's why it works for some files and for others it doesn't:
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
data is just a double[][] which contains the data from the input files.
Any help would be very appreciated!

Please, use the Parameterization API to configure X-means.
Because of the nested k-means, it is very easy to configure things badly.
The initializer of the inner k-means class must be set to this:
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans((double[][]) null);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, splitInitializer, true);
because otherwise X-means currently cannot control the initialization of the inner algorithm. I will remove this parameter, and have XMeans set the initializer of the inner algorithm.
Without a stack trace (as mentioned by #Anony-Mousse) it is hard to say what is happening. My best guess is that this meta-algorithm (an algorithm that runs another algorithm!) is not correctly configured and maybe chooses bad initialial values?

Related

Ironpython script in Ansys Customization tool

I'm a beginner in Python and I'm working with the Ansys Customization Tool (ACT) to add my own extension.
Is there a direct way to fill a file with every node's coordinates after deformation?
hopefully in 3 lines or columns: x , y , z
So far I only found the GetNodeValue object but it only gives me the displacement and I need the deformed coordinates for the entire model.
My first idea was to add the displacements to the initial coordinates but I didn't manage to do it.
Many thanks for your help
Lara
APDL Snippet
Add an APDL Snippet in the solution part of the tree:
/prep7
UPGEOM,1,1,1,file,rst ! adds the displacements to the nodal coordinates.
cdwrite,geom,nodesAndelements,geom ! Writes node and element data to nodesAndelement.geom
I'm not sure if you can work with the output format from cdwrite, but this is the quickest solution i can think of.
If you want to automate you have to insert the command snippet via
solution = ExtAPI.DataModel.Project.Model.Analyses[0].Solution
fullPath = "path//to//snippet"
snippet = solution.AddCommandSnippet()
snippet.ImportTextFile(fullPath)
ACT
If you want to stay in ACT it could be done like this:
global nodeResults
import units
analysis = ExtAPI.DataModel.Project.Model.Analyses[0]
mesh = analysis.MeshData
# Get nodes
allNodes = mesh.Nodes
# get the result data
reader = analysis.GetResultsData()
# get the deformation result
myDeformation = reader.GetResult("U")
nodeResultsTemp = []
result_unit = myDeformation.GetComponentInfo("X").Unit
for node in allNodes:
# get node deformation and convert values in meter
deformationNode1 = myDeformation.GetNodeValues(node.Id)
deformationNode1[0] = units.ConvertUnit(deformationNode1[0],result_unit,"m","Length")
deformationNode1[1] = units.ConvertUnit(deformationNode1[1],result_unit,"m","Length")
deformationNode1[2] = units.ConvertUnit(deformationNode1[2],result_unit,"m","Length")
# add node coordinates (in meter) to the displacement
mesh_unit = mesh.Unit
node1 = mesh.NodeById(node.Id)
node1CoorX = units.ConvertUnit(node1.X,mesh_unit,"m","Length")
node1CoorY = units.ConvertUnit(node1.Y,mesh_unit,"m","Length")
node1CoorZ = units.ConvertUnit(node1.Z,mesh_unit,"m","Length")
deformationNode1[0] = deformationNode1[0]+node1CoorX
deformationNode1[1] = deformationNode1[1]+node1CoorY
deformationNode1[2] = deformationNode1[2]+node1CoorZ
nodeResultsTemp.append([node1.X,node1.Y,node1.Z,deformationNode1[0],deformationNode1[1],deformationNode1[2]])
nodeResults = nodeResultsTemp

assigning a destination frame to a prediction in h2o.ai

I like h2o.ai for machine learning using R.
https://cran.r-project.org/web/packages/h2o/h2o.pdf
I like random forests, but I'm making a few thousand predictions in a loop.
It is spamming up my memory with things like this:
I can't afford to keep them all in memory. I'm making my very nice computer work very hard. That means it doesn't have the capacity to hold all the balls in the air at once.
If I could assign a destination frame name to the prediction then each new one would overwrite the old ones.
How do I assign a destination frame name when I am performing "h2o.predict" on an object?
Things that I have tried that did not work:
h2o.predict(object = rf.hex, newdata = test.hex, predictions_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, destination_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, model_id = "predict.hex")
There is no way that I am aware of.
But as an alternative, inside your loop, you could call h2o.rm() on the return value from h2o.predict(). It is worth calling h2o.gc() as well. Something like:
for(data in alldata){
# ... prepare newdata
p = h2o.predict(model, newdata)
# ... do something with p here
h2o.rm(p)
h2o.rm(newdata) # If also not needed any more
h2o.gc()
}
Aside: you said "I'm making a few thousand predictions in a loop". Assuming they were all against the same model, remember you can batch them up, and give all thousand predictions in a single newdata dataframe. One call to h2o.predict() with 1000 entries is much more efficient than making 1000 h2o.predict() calls, for one newdata entry at a time.

What is the correct svmlight input format in Mallet?

I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can also be strings.
Can anyone help me what is wrong with my code or input?
Here is my code:
public void trainMalletNaiveBayes() throws Exception {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new SvmLight2FeatureVectorAndLabel());
pipes.add(new PrintInputAndTarget());
SerialPipes pipe = new SerialPipes(pipes);
//prepare training instances
InstanceList trainingInstanceList = new InstanceList(pipe);
trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/featureFiles_svm.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
//prepare test instances
InstanceList testingInstanceList = new InstanceList(pipe);
testingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/test_set.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
ClassifierTrainer trainer = new NaiveBayesTrainer();
Classifier classifier = trainer.train(trainingInstanceList);
And here is the first three lines of my input file:
No f1:NP f2:NN f3:1 f4:1 f5:0 f6:0 f7:0 f8:0.0 f9:1 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:NN f17:NOTHING
No f1:NP f2:NN f3:8 f4:4 f5:0 f6:0 f7:1 f8:4.127134385045092 f9:8 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:DT f17:NOTHING
Yes f1:NP f2:NN f3:4 f4:3 f5:0 f6:0 f7:0 f8:0.0 f9:4 f10:true f11:false f12:false f13:false f14:false f15:NP f16:DT f17:NN
The first column is the label of the instance and there rest of the data includes the features and their values. For example, NN shows the POS of the head word of a phrase.
In the meantime, I get the exception for the NN (NumberFormatException: For input string: "NN") . I'm wondering why it doesn't have any problem with the NP which comes before that, but stops at the NN.
All features need to have numeric values. For booleans you can use true=1 and false=0. You would also have to modify f1:NP to f1_NP=1.
The reason it's not dying on the NP is that the SvmLight2FeatureVectorAndLabel class is expecting to parse an entire line (label and data), but the code is reading the file with a CsvIterator that is splitting off the first element as a label.
The classify.tui.SvmLight2Vectors class uses this code for an iterator:
new SelectiveFileLineIterator (fileReader, "^\\s*#.+")

Is there a way to change directory in Modelica/Dymola automatically?

I have the following problem:
I have over 20 different models which I want to simulate one after another but I want to change the simulation directory each time.
Right now I'm manually changing directory after each simulation (from ./ModelOne to ./ModelTwo) and I'd like to know if there's a way to change it automatically when I initialize or translate the new model.
Regards
Nev
the best way is to write a script I think:
pathOfSave = {"E:\\work\\modelica\\SimulationResult\\Model1\\","E:\\work\\modelica\\SimulationResult\\Model2\\"};
nbSim = 2;
pathOfMod = { "MyModel.",
"MyModel.};
modelsToSimulate = { ""Model1" ,
"Model2"};
//If equdistant=true: ensure that the same number of data points is written in all result files
//store variables at events is disabled.
experimentSetupOutput(equdistant=false, events=false);
//Keep in the plot memory the last nbSim results
experimentSetupOutput(equdistant=false, events=false);
for i in 1:nbSim loop
//delete the result file if it already exists
Modelica.Utilities.Files.removeFile(pathOfSave + modelsToSimulate[i]);
//translate models
translateModel(pathOfMod[i]+modelsToSimulate[i]);
// simulate
simulateModel(
pathOfMod[i]+modelsToSimulate[i],
method="dassl",
stopTime=186350,
numberOfIntervals=nbOfPoi,
resultFile=pathOfSave + modelsToSimulate[i]);
end for;
You can also put the command cd("mynewpath") in the initial algorithm section, if you want it tobe attached to the model.
model example
Real variable;
protected
parameter String currDir = Modelica.Utilities.System.getWorkDirectory();
initial algorithm
cd("C:\\Users\\xxx\\Documents\\Dymola\\MyModelFolder");
equation
variable = time;
when terminal() then
cd(currDir);
end when;
end example;
In any case you can find all commands of dymola in the manual one under the section "builtin commands".
I hope this helps,
Marco

Does a function cache exist in Matlab?

In Python we have lru_cache as a function wrapper. Add it to your function and the function will only be evaluated once per different input argument.
Example (from Python docs):
#lru_cache(maxsize=None)
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
>>> [fib(n) for n in range(16)]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610]
>>> fib.cache_info()
CacheInfo(hits=28, misses=16, maxsize=None, currsize=16)
I wonder whether a similar thing exists in Matlab? At the moment I am using cache files, like so:
function result = fib(n):
% FIB example like the Python example. Don't implement it like that!
cachefile = ['fib_', n, '.mat'];
try
load(cachefile);
catch e
if n < 2
result = n;
else
result = fib(n-1) + fib(n-2);
end
save(cachefile, 'result');
end
end
The problem I have with doing it this way, is that if I change my function, I need to delete the cachefile.
Is there a way to do this with Matlab realising when I changed the function and the cache has become invalidated?
Since matlab 2017 this is available:
https://nl.mathworks.com/help/matlab/ref/memoizedfunction.html
a = memoized(#sin)
I've created something like this for my own personal use: a CACHE class. (I haven't documented the code yet though.) It appears to be more flexible than Python's lru_cache (I wasn't aware of that, thanks) in that it has several methods for adjusting exactly what gets cached (to save memory) and how the comparisons are made. It could still use some refinement (#Daniel's suggestion to use the containers.Map class is a good one – though it would limit compatibility with old Matlab versions). The code is on GitHub so you're welcome to fork and improve it.
Here is a basic example of how it can be used:
function Output1 = CacheDemo(Input1,Input2)
persistent DEMO_CACHE
if isempty(DEMO_CACHE)
% Initialize cache object on first run
CACHE_SIZE = 10; % Number of input/output patterns to cache
DEMO_CACHE = CACHE(CACHE_SIZE,Input1,Input2);
CACHE_IDX = 1;
else
% Check if input pattern corresponds something stored in cache
% If not, return next available CACHE_IDX
CACHE_IDX = DEMO_CACHE.IN([],Input1,Input2);
if ~isempty(CACHE_IDX) && DEMO_CACHE.OUT(CACHE_IDX) > 0
[~,Output1] = DEMO_CACHE.OUT(CACHE_IDX);
return;
end
end
% Perform computation
Output1 = rand(Input1,Input2);
% Save output to cache CACHE_IDX
DEMO_CACHE.OUT(CACHE_IDX,Output1);
I created this class to cache the results from time-consuming stochastic simulations and have since used it to good effect in a few other places. If there is interest, I might be willing to spend some time documenting the code sooner as opposed to later. It would be nice if there was a way to limit memory use as well (a big consideration in my own applications), but getting the size of arbitrary Matlab datatypes is not trivial. I like your idea of caching to a file, which might be a good idea for larger data. Also, it might be nice to create a "lite" version that does what Python's lru_cache does.