How do I identify the correct clustering algorithm for the available data? - cluster-analysis

I have the sample data of flight routes, number of searches for that route, gross profit for the route, number of transactions for the route. I want to bucket flight routes which shows similar characteristics based on above mentioned variables. What are the steps to fix on the particular clustering algorithm?
Below is sample data which I would like to cluster.
Route Clicks Impressions CPC Share of Voice Gross-Profit Number of Transactions Conversions
AAE-ALG 2 25 0.22 $4.00 2 1
AAE-CGK 5 40 0.21 $6.00 1 1
AAE-FCO 1 25 0.25 $13.00 4 1
AAE-IST 8 58 0.30 $18.00 3 2
AAE-MOW 22 100 0.11 $1.00 6 5
AAE-ORN 11 70 0.21 $22.00 3 2
AAE-ORY 8 40 0.18 $3.00 4 4

For me it seems an N dimension clustering problem where N is the number of features, N = 7 (Route, Clicks, Impressions, CPC, Share of Voice, Gross-Profit, Number of Transactions, Conversions).
I think if you preprocess the feature values to be able to interpret distance on them you can apply K-means for clustering your data.
E.g. Route can be represented by the distance* of the airports: dA than you can find diff between 2 distances* that will be the distance between them: d = ABS(dA - dA')
Don't forget to scale your features.

Related

Mixed-effects linear regression model using multiple independent measurements

I am trying to implement a linear mixed effect (LME) regression model for an x-ray imaging quality metric "CNR" (contrast-to-noise ratio) for which I measured for various tube potentials (kV) and filtration materials (Filter). CNR was measured for 3 consecutive slices so I have a standard deviation of the CNR from these independent measurements as well. I am wondering how I can incorporate these multiple independent measurements in my analysis. A representation of the data for a single measurement and my first attempt using fitlme is shown below. I tried looking at online resources but could not find an answer to my specific questions.
kV=[80 90 100 80 90 100 80 90 100]';
Filter={'Al','Al','Al','Cu','Cu','Cu','Ti','Ti','Ti'}';
CNR=[10 9 8 10.1 8.9 7.9 7 6 5]';
T=table(kV,Filter,CNR);
kV Filter CNR
___ ______ ___
80 'Al' 10
90 'Al' 9
100 'Al' 8
80 'Cu' 10.1
90 'Cu' 8.9
100 'Cu' 7.9
80 'Ti' 7
90 'Ti' 6
100 'Ti' 5
OUTPUT
Linear mixed-effects model fit by ML
Model information:
Number of observations 9
Fixed effects coefficients 4
Random effects coefficients 0
Covariance parameters 1
Formula:
CNR ~ 1 + kV + Filter
Model fit statistics:
AIC BIC LogLikelihood Deviance
-19.442 -18.456 14.721 -29.442
Fixed effects coefficients (95% CIs):
Name Estimate SE pValue
'(Intercept)' 18.3 0.17533 1.5308e-09
'kV' -0.10333 0.0019245 4.2372e-08
'Filter_Cu' -0.033333 0.03849 -0.86603
'Filter_Ti' -3 0.03849 -77.942
Random effects covariance parameters (95% CIs):
Group: Error
Name Estimate Lower Upper
'Res Std' 0.04714 0.0297 0.074821
Questions/Issues with current implementation:
How is the fixed effects coefficients for '(Intercept)' with P=1.53E-9 interpreted?
I only included fixed effects. Should the standard deviation of the ROI measurements somehow be incorporated into the random effects as well?
How do I incorporate the three independent measurements of CNR for three consecutive slices for a give kV/filter combination? Should I just add more rows to the table "T"? This would result in a total of 27 observations.

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

Create a new variable in Tableau

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6
The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168

Setting up an ANN to classify Tic-Tac-Toe End-Games

I'm having an hard time setting up a neural network to classify Tic-Tac-Toe board states (final or intermediate) as "X wins", "O wins" or "Tie".
I will describe my current solution and results. Any advice is appreciated.
* DATA SET *
Dataset = 958 possible end-games + 958 random-games = 1916 board states
(random-games might be incomplete but are all legal. i.e. do not have both players winning simultaneously).
Training set = 1600 random sample of Dataset
Test set = remaining 316 cases
In my current pseudo-random development scenario the dataset has the following characteristics.
Training set:
- 527 wins for "X"
- 264 wins for "O"
- 809 ties
Test set:
- 104 wins for "X"
- 56 wins for "O"
- 156 ties
* Modulation *
Input Layer: 18 input neurons where each one corresponds to a board position and player. Therefore,
the board (B=blank):
x x o
o x B
B o X
is encoded as:
1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0
Output Layer: 3 output neurons which correspond to each outcome (X wins, O wins, Tie).
* Architecture *
Based on: http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz
1 Single Hidden Layer
Hidden Layer activation function: Logistic
Output Layer activation function: Softmax
Error function: Cross-Entropy
* Results *
No combination of parameters seems to achieve 100% correct classification rate. Some examples:
NHidden LRate InitW MaxEpoch Epochs FMom Errors TestErrors
8 0,0025 0,01 10000 4500 0,8 0 7
16 0,0025 0,01 10000 2800 0,8 0 5
16 0,0025 0,1 5000 1000 0,8 0 4
16 0,0025 0,5 5000 5000 0,8 3 5
16 0,0025 0,25 5000 1000 0,8 0 5
16 0,005 0,25 5000 1000 0,9 10 5
16 0,005 0,25 5000 5000 0,8 15 5
16 0,0025 0,25 5000 1000 0,8 0 5
32 0,0025 0,25 5000 1500 0,8 0 5
32 0,0025 0,5 5000 600 0,9 0 5
8 0,0025 0,25 5000 3500 0,8 0 5
Important - If you think I could improve any of the following:
- The dataset characteristics (source and quantities of training and test cases) aren't the best.
- An alternative problem modulation is more suitable (encoding of input/output neurons)
- Better network architecture (Number of Hidden Layers, activation/error functions, etc.).
Assuming that my current options in this regard, even if not optimal, should not prevent the system from having a 100% correct classification rate, I would like to focus on other possible issues.
In other words, considering the simplicity of the game, this dataset/modulation/architecture should do it, therefore, what am I doing wrong regarding the parameters?
I do not have much experience with ANN and my main question is the following:
Using 16 Hidden Neurons, the ANN could learn to associate each Hidden Unit with "a certain player winning in a certain way"
(3 different rows + 3 different columns + 2 diagonals) * 2 players
In this setting, an "optimal" set of weights is pretty straightforward: Each hidden unit has "greater" connection weights from 3 of the input units (corresponding to a row, columns or diagonal of a player) and a "greater" connection weight to one of the output units (corresponding to "a win" of that player).
No matter what I do, I cannot decrease the number of test errors, as the above table shows.
Any advice is appreciated.
You are doing everything right, but you're simply trying to tackle a difficult problem here, namely to generalize from some examples of tic-tac-toe configurations to all others.
Unfortunately, the simple neural network you use does not perceive the spatial structure of the input (neighborhood) nor can it exploit the symmetries. So in order to get perfect test error, you can either:
increase the size of the dataset to include most (or all) possible configurations -- which the network will then be able to simply memorize, as indicated by the zero training error in most of your setups;
choose a different problem, where there is more structure to generalize from;
use a network architecture that can capture symmetry (e.g. through weight-sharing) and/or spatial relations of the inputs (e.g. different features). Convolutional networks are just one example of this.