ELKI Kmeans clustering Task failed error for high dimensional data

ELKI Kmeans clustering Task failed error for high dimensional data - cluster-analysis

I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error.
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
Below is the ELKI settings i have used

The error (which took me a bit to understand, when I saw it the first time) says that your data has the "shape"
variable,mindim=266,maxdim=300
I.e. some lines have only 266 columns, some have 300. This may be a file format issue, for example due to NaN, missing values, or similar bad characters.
You get that error if you try to run an algorithm like kmeans that assumes the data comes from a R^d vectorspace (that is the NumberVector,field requirement), because the input data is not meeting this requirement.

This sounds strange, but i found the solution to this issue by opening the exported CSV file and doing Save As and saving again as a CSV file. While size of the original file is 437MB, the second file is 163MB. I used the numpy function np.savetxt for saving the doc2vec vector. So it seems to be a Python issue instead of being ELKI issue.
Edit: Above solution is not useful. I instead exported the doc2vec output which was created using gensim library and while exporting format of the values were decided explicitly as %1.22e. i.e. the values exported are in exponential format and values have length of 22. Below is the entire line of code.
textVect = model.docvecs.doctag_syn0
np.savetxt('D:\Backup\expo22.csv',textVect,delimiter=',',fmt=('%1.22e'))
CSV file thus created runs without any issue in ELKI environment.

Related

Best Practice to Store Simulation Results

Dear Anylogic Community,
I am struggling with finding the right approach for storing my simulation results. I have datasets created that keep track of every value I am interested in. They live in Main (see below)
My aim is to do a parameter variation experiment. In every run, I change the value for p_nDrones (see below)
After the experiment, I would like to store all the datasets in one excel sheet.
However, when I do the parameter variation experiment and afterwards check the log of the dataset (datasets_log), the changed values do not even show up (2 is the value I did set up in the normal simulation).
Now my question. Do I need to create another type of dataset if I want to track the values that are produced in the experiments? Why are they not stored after executing the experiment?
I really would appreciate if someone could share the best way to set up this export of experiment results. I would like to store the whole time series for every dataset.
Thank you!

Best option would be to write the outputs to some external file at the end of each model run.
If you want to use Excel, which I personally would not advise, even though it has a nice excelFile.writeDataSet() function, you can.
I would rather write the data to a text file as you will have much for control over the writing, the file itself, it is thread-safe, and useable in many many more platforms than Microsoft Excel.
See my example below:
Setup parameters in your model that you will write the data to at the end of the model of type TextFile. Here I used the model on destroy code to write out the data from the data sets.
Here you can immediately see the benefit of using the text file! You can add the number of drones we are simulating (or scenario name or any other parameter) in a column, whereas with Excel this would be a pain...
Now you can pass your specific text file to the model to use by adding it to the parameter variation page, providing it to the model through the parameters.
You will see that I also set up some headers for the text file in the Initial Experiment setup part, and then at the very end of the experiment, I close the text files in the After experiment section so that the text files can be used.
Here is the result if you simply right-click on the text files and open them in Excel. (Excel will always have a purpose, even if it is just to open text files ;-) )

data processing of Dymola's result during simulation

I am working on a complex Modelica model that contains a large set of data, and I need the simulation to keep going until I terminate the simulation process, maybe even for days, so the .mat file could get very large, I got trouble with how to do data processing. So I'd like to ask if there are any methods that allow me to
output the data I need after a fixed time step during simulation, but not using the .mat file after simulation. I am considering using Modelica.Utilities.Stream.Print` function to print the data I need into a CSV file, but I have to write a huge amount of code that prints every variable I need, so I think there should be a better solution.
delete the .mat file during a fixed time step, so the .mat file stored on my PC wouldn't get too large, and don't affect the normal simulation of Dymola.

Long time ago I wrote a small C-program that runs the executable of Dymola with two threads. One of them is responsible for terminating the whole simulation after exceeding an input time limit. I used the executable of this C-program within the standard given mfiles from Dymola. I think with some hacking capabilities, one would be able to conduct the mentioned requirements.
Have a look at https://github.com/Mathemodica/dymmat however I need to warn that the associated mfiles were for particular type of models and the software is not maintained since long time. However, the idea of the C-program would be reproducible.

I didn't fully test this, so please think of this more like "source of inspiration" than a full answer:
In Section "4.3.6 Saving periodic snapshots during simulation" of the Dymola 2021 Release Notes you'll find a description to do the following:
The simulator can be instructed to print the simulation result file “dsfinal.txt” snapshots during simulation.
This can be done periodically using the Simulation Setup options "Complete result snapshots", but I think for your case it could be more useful to trigger it from the model using the function Dymola.Simulation.TriggerResultSnapshot(). A simple example is given as well:
when x > 0 then
Dymola.Simulation.TriggerResultSnapshot();
end when;
Also one property of this function could help, as it by default creates multiple files without overwriting them:
By default, a time stamp is added to the snapshot file name, e.g.: “dsfinal_0.1.txt”.
The format of the created dsfinal_[TIMESTAMP].txt is a bit overwhelming at first, as it contains all information for initializing the model, but there should be everything you need...
So some effort is shifted to the post processing, as you will likely need to read multiple files, but I think this is an acceptable trade-off.

Writing different experiment output run to different cells in a sheet (Excel file)

I am simultaneously running a model with different input values and it is producing different output on each run. I am trying to create a code that will get anylogic to wright each experiment output run in a different cell in excel sheet i.e. throughput Vs. Time. I am using dataset. Wondering If there is any script or hint can help in solving the issue?
Currently I am using the following commands. They keep overwriting the output using the same cells.
Out_excelFile1.setCellValue("Sink1 Out",2,2,2);
Out_excelFile1.writeDataSet(Sink1_D,2,3,2);

Best if you actually use the build-in database for outputs and only write to Excel at the end of all runs, tbh.
But in your case, you need to change the row number by your replication/iteration number. Use getCurrentIteration() or getCurrentReplication() in your "after simulation run" or "after replication" or "after iteration" experiment code sections to get this right.
Then, it would look something like Out_excelFile1.setCellValue("Sink1 Out",2,getCurrentIteration(),2);
(Details depend on your actual implementation, check the help for further info on replications, iterations and those functions)

Importing data from postgres to cytoscape

I have been trying to load some gis data from a postgis database into Cytoscape 3.6. I am trying to get some inDegree and outDegree values I have used the sif file format.
As long as the data is written out in the follow format
source_point\tinteracts with\ttarget_point
Cytoscape is happy to read it.
I am just wondering if there is anyway of including my own metric for the cost of getting between source_point and target_point

Sure! There are several ways to read in text files into Cytoscape -- SIF is just one of them. I would create a file that looks like SIF, but is actually a more complete text file:
Source\tTarget\tScore
source_point\ttarget_point\t1.1
...
And then use the "File->Import Network->File", choose your source and target and leave score as an edge attribute. You can have as many attributes on each line as you want, and can even mix edge attributes, source node attributes, and target node attributes.
-- scooter

Datastage is not reading the correct dataset

I have a datastage job that is comparring a previous dataset with a current one and loads the values into another dataset. The problem is that it does not search for the correct previous dataset and fails to see the physical file.
This happens to only a part of the files and is not a parameter problem as all the parameters are project level and set up correctly.

Is there a possibility of being overwriting? What type of error do you get? Was the dataset correctly created? Is the file system asociated full? There are many possibilities, that could be happening to you.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

ELKI Kmeans clustering Task failed error for high dimensional data - cluster-analysis

Related

Best Practice to Store Simulation Results

data processing of Dymola's result during simulation

Writing different experiment output run to different cells in a sheet (Excel file)

Importing data from postgres to cytoscape

Datastage is not reading the correct dataset

Categories

Resources