Compare multiple groups - linear-regression

I'd like to predict discussion group performance based on members' social network centralities and knowledge contribution. Hierarchical linear regression in SPSS doesn't seem to distinguish groups (there're 11 groups). Any advice on choosing a statistical method is appreciated!

Related

Classification with multiple features?

I have:
1) 2 groups of subjects (controls and cancer patients)
2) a group of features, for each of them.
I want to find the feature, or which combination of which features, discriminate best between the two groups.
I have started with evaluation of AUC, then with some k means clustering, but I don't know how to combine features for the classification.
Thank you
I suggest you use some method of feature importance evaluation. There is many different way to test importance of features. At first, in my opinion, easy one is Random Forest classifier. This model has "build-in" feature importance evaluation during training, based on out-of-bag error. Tree-based classifiers must evaluate gain of information after getting value of feature in training process.
You can also test feature importance by checking model score by modifying data set, i.e using backward elimination strategy.
You can use also PCA or statistic tests. Finally you can also looking for dependency between feature to remove from your data feature that not provide enough information.

What is the type or family of recsys algorithms for recommending similar users based on their interests?

I am learning recommendation systems from Coursera MooC. I see there are majorly three types of filtering methods (in introduction course).
a. Content-based filtering
b. Item-Item collaborative filtering
c. User-User collaborative filtering
Having understood this, I am not sure where does the - similar users recommendation based on the interests/preferences belong to? For example, consider I have User->TopicsOfInterest0..n relation. I want to recommend other similar users based on their respective TopicsOfInterest (vector).
I'm not sure that these three types are an exhaustive classification of all recommender systems.
In fact, any matrix-factorization based algorithm (SVD, etc.) is both item-based and user-based at the same time. But the TopicsOfInterest (factors) are inferred automatically by the algorithm. For example, Apache Spark includes an implementation of the alternating least squares (ALS) algorithm. Spark's API has the userFeatures method, which returns (roughly) a matrix, predicting users's attitude to each feature.
The only thing left to do is to compute a set of most similar users to a given one (e.g. find vectors, that are closest to a given one by cosine similarity).

Collaborative Filtering Algorithm

If I have the following users with the following ratings for movies they watched:
User1 Movie1-5 Movie2-4
User2 Movie2-5 Movie2-3 Movie3-4
User3 Movie1-4 Movie2-4 Movie4-4
How would I use collaborative filtering to suggest movie3 to user1 and how do I calculate the probability of user1 giving movie3 a 4 or better?
Well there are a few different ways of generating recommendations using collaborative filtering, I'll explain user-based and item-based collaborative filtering methods. These methods are most used in recommendation algorithms.
User-based collaborative filtering
This basically calculates a similarity between users. The similarity can be a pearson correlation or cosine similarity. There are more correlation numbers, but those are most used. This article gives a good explanation on how to calculate this.
User-based filtering does come with a few challenges. First is the data sparsity issue, this occurs when there are a lot of movies with a few reviews. This makes it difficult to calculate a correlation between users. This wikipedia page explains more about this.
Second is the scalability issue. When you have millions of users with thousands of movies, the performance of calculating correlations between users is going to drop tremendously.
Item-based collaborative filtering
This method differs from user-based filtering because it calculates a similarity between movies instead of users. You can then use this similarity to predict a rating for a user. I have found that this presentation explains it very well.
Item-based filters have outperformed user-based filters, but they also suffer from the same issues, but a little less.
Content-based filtering
Seeing your data, it's going to be difficult to generate recommendations because you have too little data from users. I would suggest using a content-based filter until you have enough data to use collaborative filtering methods. It's a very simple method which basically looks at the user's profile and compares it to certain tags of a movie. This page explains it in more detail.
I hope this answered some of your questions!
You can either calculate similarity between users or among items. Some easy methods to find similarity are 'cosine similarity', 'Pearson similarity'.
This GFG page explains user-based approach, with an example to find similarity among users, and thus make predictions on items they didn't watch yet.

what should be my input in ANN

I am getting confusing about Input data set . I am studying about Artificial Neural Network , my purpose is that I wanted to use the historical data (I have stock data of last 10 years ) to predict stock value in the future (for example 2015). So, where is my input? For example i have a Excel sheet data as [Column1-Date| Column2-High | Column3-low |Column4-opening|Column5-closing]
By profession I am a quant and I am currently pursuing a masters degree in Computer Science. There are a many considerations when selecting financial input for a neural network including,
Select indicators which which are positively correlated to returns.
Indicators are independent variables which have predictive power on the dependent variable (stock returns). Common popular indicators include technical indicators derived from price and volume data, fundamental indicators about the underlying company or asset, and quantitative indicators such as descriptive statistics or even model parameters. If you have many indicators, you can narrow them down using correlation analysis, best subset, or principal component analysis.
Pre-process the indicators for use in Neural Networks
Neural networks work by connecting perceptrons together. Each perceptron contains an activation function e.g. the sigmoid function or tanh. Most activation functions have an active range. For the sigmoid function this is between -sqrt(3) and +sqrt(3). What this means is that you should normalize your data to within the active range and seriously consider removing outliers.
There are many other potential issues with using Neural Networks. I wrote an article a while back which identified ten issues, including the ones mentioned here. Feel free to check it out.

What kind of analysis to use in SPSS for finding out groups/grouping?

My research question is about elderly people and I have to find out underlying groups. The data comes from a questionnaire. I have thought about cluster analysis, but the thing is that I would like to search perceived health and which things affect on the perceived health, e.g. what kind of groups of elderly rank their health as bad.
I have some 30 questions I would like to check with the analysis, to see if for example widows have better or worse health than the average. I also have weights in my data so I need to use complex samples.
How can I use an already existing function, or what analysis should I use?
The key challenge you have to solve first is to specify a similarity measure. Once you can measure similarity, various clustering algorithms become available.
But questionnaire data doesn't make a very good vector space, so you can't just use Euclidean distance.
If you want to generate clusters using SPSS, standard options include: k-means, hierarhical cluster analysis, or 2-step. I have some general notes on cluster analysis in SPSS here. See from slide 34.
If you want to see if widows differ in their health, then you need to form a measure of health and compare means on that measure between widows and non-widows (presumably using a between groups t-test). If you have 30 questions related to health, then you may want to do a factor analysis to see how the items group together.
If you are trying to develop a general model of whats predicts perceived health then there are a wide range of modelling options available. Multiple regression would be an obvious starting point. If you have many potential predictors then you have a lot of choices regarding whether you are going to be testing particular models or doing a more data driven model building approach.
More generally, it sounds like you need to clarify the aims of your analyses and the particular hypotheses that you want to test.