Get information about selected group of neighbors when using KNN-Algorithm - deployment

I am trying to develop a ken-algorithm to find the 5-6 closest datasets and use the mean or median of that peer group for the prediction.
Economically it is very important to get to know the information about the selected peer group by the algorithm.
Is it possible to get the information about the nearest neighbours the algorithm used for the prediction in a certain case?
Thanks for all your help!

Related

MATLAB - Connect different arrays/matrixes to a bigger matrix

I have heard that it is possible to connect different matrixes into a bigger matrix in Matlab. In that case I have tried searching around but haven`t found anything yet so I thought I ask here.
In that case I have some matrixes I want to connect to a bigger one. I have one matrix of random arrivals times into a city, and another matrix with random departures from the city, and other matrixes with different types electric cars with driving range and battery status.
Is it possible to do this, and in that case how? Im not asking to provide some code, but giving links to references where I can learn this would be really thankful:) thank you.

Minimum amount of data for an item-based collaborative filter

I'm working on a recommendation engine which uses an item-based collaborative filter to create recommendations for restaurants. Each restaurant has reviews with a rating from 1-5.
Every recommendation algorithm struggles with the data sparsity issue, so I have been looking for solutions to calculate a correct correlation.
I'm using an adjusted cosine similarity between restaurants.
When you want to compute a similarity between restaurants, you need users who have rated both restaurants. But what would be the minimum of users who have rated both restaurants to get a correct correlation?
From testing, I have discovered that 1 set of users who have rated both restaurants results in bad similarities (Obviously). Often it's -1 or 1. So I have increased it to 2 set of users who have both restaurants, which gave me better similarities. I just find it difficult to determine if this similarity is good enough. Is there a method which either tests the accuracy of this similarity or are there guidelines on how what the minimum is?
The short answer is a parameter sweep: try several values of "minimum users who have rated both restaurants" and measure the outcomes. With more users, you'll get a better sense of the similarity between items (restaurants). But your similarity information will be sparser. That is, you'll focus on the more popular items and be less able to recommend items in the long tail. This means you'll always have a tradeoff, and you should measure everything that will allow you to make the tradeoff. For instance, measure predictive accuracy (e.g., RMSE) as well as the number of items possible to recommend.
If your item space becomes too sparse, you may want to find other ways to do item-item similarity beyond user ratings. For instance, you can use content-based filtering methods to include information about each restaurants' cuisine, then create an intermediate step to learn each user's cuisine preferences. That will allow you to do recommendations even when you don't have item-item similarity scores.

Dimensionality reduction for high dimensional sparse data before clustering or spherical k-means?

I am trying to build my first recommender system where i create a user feature space and then cluster them into different groups. Then for the recommendation to work for a particular user , first i find out the cluster to which the user belongs and then recommend entities(items) in which his/her nearest neighbor showed interest. The data which i am working on is high dimensional and sparse. Before implementing the above approach, there are few questions, whose answers might help me in adopting a better approach.
As my data is high dimensional and sparse, should i go for dimensionality reduction and then apply clustering or should I go for an algorithm like spherical K-means which works on sparse high dimensional data?
How should I find the nearest neighbors after creating clusters of users.(Which distance measure should i take as i have read that Euclidean distance is not a good measure for high dimensional data)?
It's not obvious that clustering is the right algorithm here. Clustering is great for data exploration and analysis, but not always for prediction. If your end product is based around the concept of "groups of like users" and the items they share, then go ahead with clustering and simply present a ranked list of items that each user's cluster has consumed (or a weighted average rating, if you have preference information).
You might try standard recommender algorithms that work in sparse high-dimensional situations, such as item-item collaborative filtering or sparse SVD.

Determining cluster membership in SOM (Self Organizing Map) for time series data

I am also working on a project that requires clustering of time series data. I am using the SOM toolbox that works in MATLAB for clustering purpose and stuck with the following problem:
"How can we determine which data belongs to which cluster?" SOM randomly chooses data sample from dataset and finds BMU for each data sample. As far as I know, data sample identifier is not regarded as dimension of data in SOM algorithm. If it is the case then how can we track the samples? I don't think that som_bmus solves this issue. Any idea how you do it without changing any functions included in SOM toolbox?
y=vec2ind(output)
will give you the index number for the output generated by MATLAB.With this information,you can see which input data belongs to which neuron#.
Just use the above code directly in your script, it will do the rest.
I know this is an old topic, but maybe still usefull for others.
Is your question on determining a what should be a cluster or is it which data belongs to which neuron? If it is the last I believe GulshanS has answered correctly but if it is on the question how you determine what is a cluster and what not it is still unanswered. You can do this with neighbor connections, dark regions mostly show 'walls' which is a break line for a cluster.
Cluster analysis is something different than what SOM does. SOM determines connections and assigns BMU which end up being a pre-determined grid with neurons. Multiple data inputs can belong to a neuron. Multiple neurons can belong to a cluster but these are not the output of SOM.

What kind of analysis to use in SPSS for finding out groups/grouping?

My research question is about elderly people and I have to find out underlying groups. The data comes from a questionnaire. I have thought about cluster analysis, but the thing is that I would like to search perceived health and which things affect on the perceived health, e.g. what kind of groups of elderly rank their health as bad.
I have some 30 questions I would like to check with the analysis, to see if for example widows have better or worse health than the average. I also have weights in my data so I need to use complex samples.
How can I use an already existing function, or what analysis should I use?
The key challenge you have to solve first is to specify a similarity measure. Once you can measure similarity, various clustering algorithms become available.
But questionnaire data doesn't make a very good vector space, so you can't just use Euclidean distance.
If you want to generate clusters using SPSS, standard options include: k-means, hierarhical cluster analysis, or 2-step. I have some general notes on cluster analysis in SPSS here. See from slide 34.
If you want to see if widows differ in their health, then you need to form a measure of health and compare means on that measure between widows and non-widows (presumably using a between groups t-test). If you have 30 questions related to health, then you may want to do a factor analysis to see how the items group together.
If you are trying to develop a general model of whats predicts perceived health then there are a wide range of modelling options available. Multiple regression would be an obvious starting point. If you have many potential predictors then you have a lot of choices regarding whether you are going to be testing particular models or doing a more data driven model building approach.
More generally, it sounds like you need to clarify the aims of your analyses and the particular hypotheses that you want to test.