Self organized map understaning - neural-network

I have a self organized map created a Som_pak-3.1 here
If I have three different type of elements, and they are different. Why the elements are not in different parts of the map? Why the "A", "B" and "C" are in many many cases together at the same hexagon? Why "B" and "C" are never alone in an hexagon?
Thanks in advance!

I feel that it is a normal result for SOM. The unsupervised SOM algorithm is not aware of the elements. Using the distance metric, the neurons have learned the vectors, and then the elements were placed as labels at the best matching neuron.
One possible reason for two elements appearing on the same node is if they have the same values for each of the features. Otherwise, they have different values for each feature, but the values still seem similar according to the distance metric.
The spatial resolution can be increased by increasing the map size. This may allow the classes to be separable. However, the trade-off is that statistical significance of each neuron goes down when it is associated with fewer data points. So what I would suggest is that you can try different sizes of maps to find the one that is appropriate for your data set and goals.
Actually I was just reading about this exact point, p. 19 in Kohonen's book "MATLAB Implementations and Applications of the Self-Organizing Map" available at http://docs.unigrafia.fi/publications/kohonen_teuvo/. It covers the MATLAB SOM-Toolkit that was created after SOM-PAK. The book only briefly covers SOM-PAK but I believe that the theory from the book would help out.

Related

Putative correspondences

I am trying to implement the algorithm for estimating the fundamental matrix between two images using RANSAC. So far I have found the interest points using Harris corner detection. I am stuck at computing the putative correspondences using these interest points. I don't want to use matlab toolbox for that , I like to know a way to learn about corresponding point extraction from two images and it's implementation. I have read about block matching but have not completely understood the concept of it. Any samples and guidelines would help me to understand this problem better.
Thanks in advance.
There are many ways to search for corresponding interest points, but they're usually based on describing each of these interest points using the characteristics of the image around them, and, for each point in one image, comparing its surrounding's characteristics to the characteristics of the surroundings of other interest points in the other image.
Now assume you've decided to consider only a squared region (a block) around each point of interest that contains the intensity values of the image around the point. Now you can compare these blocks, and match those that are close to each other. The problem is now how to define "close" or, in other words, how to define the distance metric you'll use to compare these blocks.
There are many approaches, for example, you could use the sum-of-absolute-differences between two blocks, which means you could subtract two blocks, take the absolute value of the resulting block, and then sum all values in this resulting block, obtaining a scalar value which represents how close these blocks are. If this distance is less than a given threshold, you can consider the two blocks a match. This is basically what block matching does.
Similarly, you could define other types of regions to describe your points of interes, for example by changing their shapes, sizes, orientations etc, and create more complex descriptors for these points of interest, which might capture more distinguishable characteristics (which is highly desired if you have the purpose of matching them later).
If you want to learn more about the topic, I think this presentation can get you started:
http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

MANOVA - huge matrices

First, sorry by the tag as "ANOVA", it is about MANOVA (yet to become a tag...)
From the tutorials I found, all the examples use small matrices, following them would not be feasible for the case of big ones as it is the case of many studies.
I got 2 matrices for my 14 sampling points, 1 for the organisms IDs (4493 IDs) and other to chemical profile (190 variables).
The 2 matrices were correlated by spearman and based on the correlation, split in 4 clusters (k-means regarding the square euclidian clustering values), the IDs on the row and chemical profile on line.
The differences among them are somewhat clear, but to have it in a more robust way I want to perform MANOVA to show the differences between and within the clusters - that is a key factor for the conclusion, of course.
Problem is that, after 8h trying, could not even input the data in a format acceptable to the analysis.
The tutorials I found are designed to very few variables and even when I think I overcame that, the program says that my matrices can't be compared by their difference in length.
Each cluster has its own set of IDs sharing all same set of variables.
What should I do?
Thanks in advance.
Diogo Ogawa
If you have missing values in your data (which practically all data sets seem to contain) you can either remove those observations or you can create a model using those observations. Use the first approach if something about your methodology gives you conviction that there is something different about those observations. Most of the time, it is better to run the model using the missing values. In this case, use the general linear model instead of a balanced ANOVA model. The balanced model will struggle with those missing data.

Clustering on non-numeric dimensions

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV
Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !
In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.
You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.