ELKI: How to Specify Feature Columns of CSV for K-Means - cluster-analysis

I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.
Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?
If not, what is the simplest way to achieve this by changin/extending ELKI in Java?

This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.
But there is a filter to select columns: NumberVectorFeatureSelectionFilter. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):
-dbc.filter transform.NumberVectorFeatureSelectionFilter
-projectionfilter.selectedattributes 0,1,2
The filter could be extended using our newer IntRangeParameter to allow for specifications such as 1..3,5..8; but this has not been implemented yet.

Related

How to write matrix to multiple ranges in excel spreadsheet?

I would like to add multiple matrices to multiple ranges within one spreadsheet. Since there are thousands of matrices I do not want to call writematrix/xlswrite multiple times for reasons of computing time. Is there another possibility out there?
Matlabs Documentation has nothing to offer, I wish I could just do:
xlswrite(filename,cell_array_with_matrices,'Sheetname',cell_array_with_ranges);

equivalent of sklearn's StratifiedGroupKFold for PySpark?

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.
Outside of PySpark, I could use StratifiedGroupKFold from sklearn. What is the easiest way to achieve the same effect with PySpark?
I looked at the sampleBy method from PySpark, but I'm not sure how to use it while keeping the groups separate.
Documentation links:
StratifiedGroupKFold
sampleBy

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Running k-medoids algorithm in ELKI

I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI):
The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does this only for 3 attributes, ignoring the rest. Like this:
Could anyone help with how can I obtain a clustering visualization for all dimensions?
ELKI is mostly used with numerical data.
Currently, ELKI does not have a "mixed" data type, unfortunately.
The ARFF parser will split your data set into multiple relations:
a 1-dimensional numerical relation containing age
a LabelList relation storing sex and region
a 1-dimensional numerical relation containing salary
a LabelList relation storing married
a 1-dimensional numerical relation storing children
a LabelList relation storing car
Apparently it has messed up the relation labels, though. But other than that, this approach works perfectly well with arff data sets that consist of numerical data + a class label, for example - the use case this parser was written for. It is a well-defined and consistent behaviour, though not what you expected it to do.
The algorithm then ran on the first relation it could work with, i.e. age only.
So here is what you need to do:
Implement an efficient data type for storing mixed type data.
Modify the ARFF parser to produce a single relation of mixed type data.
Implement a distance function for this type, because the lack of a mixed type data representation means we do not have a distance to go with it either.
Choose this new distance function in k-Medoids.
Share the code, so others do not have to do this again. ;-)
Alternatively, you could write a script to encode your data in a numerical data set, then it will work fine. But in my opinion, the results of one-hot-encoding etc. are not very convincing usually.

Cluster text documents in database

I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.
I know that there is approximately 8 types of documents. Here are my questions:
How can I find these groups?
Are there some similarity, dissimilarity measures I can use?
Is there some implementation of longest common substring in PostgreSQL?
Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)
I can probably use some like '%%' or SIMILAR TO, but there might be better approach.
You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).
For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.
You can use a clustering technique such as K-Means or Hierarchical Clustering.
Yes you can use the Cosine similarity between documents, looking at the binary term count, term counts, term frequencies, or TF-IDF frequencies.
I don't know about that one.
Not sure, but you could use R or RapidMiner to do the data mining against your database.