Calculation of price elasticities with mlogit (mixed logit) - effects

I estimated a mixed logit model with the mlogit package. My data has the following structure:
long format
individual id ("id)
alternative A, B, C D that represent four car types ("alternative")
choice situation ("card")
I estimated a mixed logit with random parameters for price, fuelcost, and range.
Here is my code:
#Read data such that it fits the choice experiment
CAR_P <- mlogit.data(Car_choices, choice = "choice", shape = "long", alt.var = "alternative", chid.var = "card", id.var = "id")
estimation of the mixed logit:
CAR_model_mixed7 <- mlogit(choice ~ price + fuelcost + yearlycost + range + co2 + chargehighway + chargehome | mini + small + large + premium + luxe + age + female + long_distance + ccar + income_fitted + no_charging ,
CAR_P, rpar=c(price ="n", fuelcost = "n", range = "n"), R = 2000, halton = NA, panel = TRUE, reflevel = "A")
summary(CAR_model_mixed7)
Now I need to estimate a price elasticity. But this doesn't work.
A screenshot of my data is added below
The code I entered to estimate the price elasticity is the following:
1. compute a data.frame containing the mean value of the covariates in the sample
z <- with(CAR_P, data.frame(price = tapply(price, idx(CAR_model_mixed7, 2), mean),
fuelcost = tapply(fuelcost, idx(CAR_model_mixed7, 2), mean),
yearlycost = tapply(yearlycost, idx(CAR_model_mixed7, 2), mean),
range = tapply(range, idx(CAR_model_mixed7, 2), mean),
co2 = tapply(co2, idx(CAR_model_mixed7, 2), mean),
chargehighway = tapply(chargehighway, idx(CAR_model_mixed7, 2), mean),
chargehome = tapply(chargehome, idx(CAR_model_mixed7, 2), mean),
mini = mean(mini),
small = mean(small),
large = mean(large),
premium = mean(premium),
luxe = mean(luxe),
age = mean(age),
female = mean(female),
long_distance = mean (long_distance),
ccar = mean(ccar),
income_fitted = mean(income_fitted),
no_charging = mean(no_charging)))
compute the price elasticity:
effects(CAR_model_mixed7, covariate = "price", type = "rr", data = z)
However, I get the error message:
Error in mlogit(formula = choice ~ price + fuelcost + yearlycost + :
no individual index
Does anyone know what I'm doing wrong and how to solve this?

Related

number of items to replace is not a multiple of , MonteCarlo Stochastic Process R error

##
set.seed(123)
SimpleEulerApproximation = function(T,x,a,b,delta){
numberofSteps = T/delta;
TimeSteps = rep(numberofSteps,1);
Y = rep(numberofSteps,1)
Y[1] = x;
for (i in 1:numberofSteps){
TimeSteps[i] = 0 + i*delta;
}
for (j in 2:numberofSteps){
Y[j] = Y[j-1] + a*Y[j-1]*delta + b*Y[j-1]*rnorm(1,0,sqrt(delta));
}
##plot(TimeSteps,Y, type = "l")
}
SimpleEulerApproximation(1,20,-0.01,0.25,0.001)
set.seed(123)
MultipleEulerApproximation = function(T,x,a,b,delta,numberofTrajectories){
numberofSteps = round(T/delta);
TimeSteps = rep(numberofSteps,1);
Y = rep(numberofSteps,rep(numberofTrajectories))
Y = data.matrix(Y)
for (i in 1:numberofTrajectories){
Y[,i] = SimpleEulerApproximation(T,x,a,b,delta);
}
for (i in 1:numberofSteps){
TimeSteps[i] = 0 + i*delta;
}
AverageTrajectory = rep(numberofSteps,1)
for (i in 1:numberofSteps){
AverageTrajectory[i] = mean(Y[i,])
}
##plot(TimeSteps,AverageTrajectory)
}
MultipleEulerApproximation(1,52,0.12,0.30,0.0001,10000)
MonteCarloSimulation = function(T,x,r,sigma,K,delta,numberofTrajectories){
Y = MultipleEulerApproximation(T,x,r,sigma,delta,numberofTrajectories);
lastStep = round(T/delta);
max(Y[lastStep,]-K,0);
size(Y)
price = 1/numberofTrajectories * sum(max(Y[lastStep,]-K,0))*exp(-r*T)
}
MonteCarloSimulation(0.25,52,0.12,0.3,50,0.0001,10000)
When I run the code for multipleEulerApproximation, I get replacement has length 0 error. Can someone help me with this? Much Appreciated.
The first one is simple Euler Approximation for stochastic differential equation dXt =
−0.1Xtdt + 0.25XtdBt, X0 = 20 over the time interval [0, 1] with time step size
∆ = 0.001.
The second chunk of code is for multipleeulerapproximation that is where the error.
The third-chunk is for calculating European call option price using projections.

Multi-objective optimization example Pyomo

Any example for multi-objective optimization in Pyomo?
I am trying to minimize 4 Objectives (Non Linear) and I would like to use pyomo and ipopt. Have also access to Gurobi.
I want to see even very simple example where we try to optimize for two or more objective (one minimization and one maximization) for a list of decision variables (not just one dimension but maybe a vector).
Pyomo book that I have (https://link.springer.com/content/pdf/10.1007%2F978-3-319-58821-6.pdf) does not provide a signle clue.
With Pyomo you have to implement it yourself. I am doing it right now. The best method is the augmented epsilon-constraint method. It will always be efficient and always find the global pareto-optimum. Best example is here:
Effective implementation of the epsilon-constraint method in Multi-Objective Mathematical Programming problems, Mavrotas, G, 2009.
Edit: Here I programmed the example from the Paper above in pyomo:
It will first maximize for f1 then for f2. Then It'll apply the normal epsilon-constraint and plot the inefficient Pareto-front and then It'll apply the augmented epsilon-constraint, which finally is the method to go with!
from pyomo.environ import *
import matplotlib.pyplot as plt
# max f1 = X1 <br>
# max f2 = 3 X1 + 4 X2 <br>
# st X1 <= 20 <br>
# X2 <= 40 <br>
# 5 X1 + 4 X2 <= 200 <br>
model = ConcreteModel()
model.X1 = Var(within=NonNegativeReals)
model.X2 = Var(within=NonNegativeReals)
model.C1 = Constraint(expr = model.X1 <= 20)
model.C2 = Constraint(expr = model.X2 <= 40)
model.C3 = Constraint(expr = 5 * model.X1 + 4 * model.X2 <= 200)
model.f1 = Var()
model.f2 = Var()
model.C_f1 = Constraint(expr= model.f1 == model.X1)
model.C_f2 = Constraint(expr= model.f2 == 3 * model.X1 + 4 * model.X2)
model.O_f1 = Objective(expr= model.f1 , sense=maximize)
model.O_f2 = Objective(expr= model.f2 , sense=maximize)
model.O_f2.deactivate()
solver = SolverFactory('cplex')
solver.solve(model);
print( '( X1 , X2 ) = ( ' + str(value(model.X1)) + ' , ' + str(value(model.X2)) + ' )')
print( 'f1 = ' + str(value(model.f1)) )
print( 'f2 = ' + str(value(model.f2)) )
f2_min = value(model.f2)
# ## max f2
model.O_f2.activate()
model.O_f1.deactivate()
solver = SolverFactory('cplex')
solver.solve(model);
print( '( X1 , X2 ) = ( ' + str(value(model.X1)) + ' , ' + str(value(model.X2)) + ' )')
print( 'f1 = ' + str(value(model.f1)) )
print( 'f2 = ' + str(value(model.f2)) )
f2_max = value(model.f2)
# ## apply normal $\epsilon$-Constraint
model.O_f1.activate()
model.O_f2.deactivate()
model.e = Param(initialize=0, mutable=True)
model.C_epsilon = Constraint(expr = model.f2 == model.e)
solver.solve(model);
print('Each iteration will keep f2 lower than some values between f2_min and f2_max, so [' + str(f2_min) + ', ' + str(f2_max) + ']')
n = 4
step = int((f2_max - f2_min) / n)
steps = list(range(int(f2_min),int(f2_max),step)) + [f2_max]
x1_l = []
x2_l = []
for i in steps:
model.e = i
solver.solve(model);
x1_l.append(value(model.X1))
x2_l.append(value(model.X2))
plt.plot(x1_l,x2_l,'o-.');
plt.title('inefficient Pareto-front');
plt.grid(True);
# ## apply augmented $\epsilon$-Constraint
# max f2 + delta*epsilon <br>
# s.t. f2 - s = e
model.del_component(model.O_f1)
model.del_component(model.O_f2)
model.del_component(model.C_epsilon)
model.delta = Param(initialize=0.00001)
model.s = Var(within=NonNegativeReals)
model.O_f1 = Objective(expr = model.f1 + model.delta * model.s, sense=maximize)
model.C_e = Constraint(expr = model.f2 - model.s == model.e)
x1_l = []
x2_l = []
for i in range(160,190,6):
model.e = i
solver.solve(model);
x1_l.append(value(model.X1))
x2_l.append(value(model.X2))
plt.plot(x1_l,x2_l,'o-.');
plt.title('efficient Pareto-front');
plt.grid(True);
Disclaimer: I am the main developer of pymoo, a multi-objective optimization framework in Python.
You might want to consider other frameworks in Python that have a focus on multi-objective optimization. For instance, in pymoo the definition of the rather simple test problem mentioned above is more or less straightforward. You can find an implementation of it below. The results in the design and objectives space look as follows:
pymoo is well documented and provides a getting started guide that demonstrates defining your own optimization problem, obtaining a set of near-optimal solutions and analyzing it: https://pymoo.org/getting_started.html
The focus of the framework is anything related to multi-objective optimization including visualization and decision making.
import matplotlib.pyplot as plt
import numpy as np
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.model.problem import Problem
from pymoo.optimize import minimize
from pymoo.visualization.scatter import Scatter
class MyProblem(Problem):
def __init__(self):
"""
max f1 = X1 <br>
max f2 = 3 X1 + 4 X2 <br>
st X1 <= 20 <br>
X2 <= 40 <br>
5 X1 + 4 X2 <= 200 <br>
"""
super().__init__(n_var=2,
n_obj=2,
n_constr=1,
xl=np.array([0, 0]),
xu=np.array([20, 40]))
def _evaluate(self, x, out, *args, **kwargs):
# define both objectives
f1 = x[:, 0]
f2 = 3 * x[:, 0] + 4 * x[:, 1]
# we have to negate the objectives because by default we assume minimization
f1, f2 = -f1, -f2
# define the constraint as a less or equal to zero constraint
g1 = 5 * x[:, 0] + 4 * x[:, 1] - 200
out["F"] = np.column_stack([f1, f2])
out["G"] = g1
problem = MyProblem()
algorithm = NSGA2()
res = minimize(problem,
algorithm,
('n_gen', 200),
seed=1,
verbose=True)
print(res.X)
print(res.F)
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
Scatter(fig=fig, ax=ax1, title="Design Space").add(res.X, color="blue").do()
Scatter(fig=fig, ax=ax2, title="Objective Space").add(res.F, color="red").do()
plt.show()
To my knowledge, while Pyomo supports the expression of models with multiple objectives, it does not yet have automatic model transformations to generate common multi-objective optimization formulations for you.
That said, you can still create these formulations yourself. Take a look at epsilon-constraint, 1-norm, and infinity norm for some ideas.

The method of characteristics for two dimensional advection equation

Given the following function to solve the two-dimensional advection equation in a rectangle:
function qnew = SemiLagrAdvect(u,v,q,qS,qN,qW,qE)
global N M
global dx dy
global dt Re
u = reshape(u,N,M);
v = reshape(v,N,M);
q = reshape(q,N,M);
%...embedding
qq = zeros(N+2,M+2);
qq(2:N+1,2:M+1) = q;
%...set the ghost values (four edges)
qq(1,2:M+1) = 2*qW-qq(2,2:M+1);
qq(N+2,2:M+1) = 2*qE-qq(N+1,2:M+1);
qq(2:N+1,1) = 2*qS-qq(2:N+1,2);
qq(2:N+1,M+2) = 2*qN-qq(2:N+1,M+1);
%...set the ghost values (four corners)
qq(1,1) = -qq(2,2);
qq(N+2,1) = -qq(N+1,2);
qq(N+2,M+2) = -qq(N+1,M+1);
qq(1,M+2) = -qq(2,M+1);
q1 = qq(2:N+1,2:M+1);
q2p = qq(3:N+2,2:M+1);
q2m = qq(1:N,2:M+1);
q3p = qq(2:N+1,3:M+2);
q3m = qq(2:N+1,1:M);
q4pp = qq(3:N+2,3:M+2);
q4mm = qq(1:N,1:M);
q4pm = qq(3:N+2,1:M);
q4mp = qq(1:N,3:M+2);
xi = -u*dt/dx;
eta = -v*dt/dy;
Q2 = q2p.*(xi>0) + q2m.*(xi<0);
Q3 = q3p.*(eta>0) + q3m.*(eta<0);
Q4 = q4pp.*((xi>0) & (eta>0)) + q4mm.*((xi<0) & (eta<0)) + ...
q4pm.*((xi>0) & (eta<0)) + q4mp.*((xi<0) & (eta>0));
qnew = (1-abs(xi)).*(1-abs(eta)).*q1 + ...
abs(xi).*(1-abs(eta)).*Q2 + ...
abs(eta).*(1-abs(xi)).*Q3 + ...
abs(xi).*abs(eta).*Q4;
qnew = qnew(:);
Having elementary knowledge in MATLAB, how can I modify it to solve the equation in a composite domain?
you need to use continuity on the interface.Depending on your method.
In the method of characteristics the interface is part of the initial curve.
S
I'm

find optimum values of model iteratively

Given that I have a model that can be expressed as:
y = a + b*st + c*d2
where st is a smoothed version of some data, and a, b and c are model coffieicients that are unknown. An iterative process should be used to find the best values for a, b, and c and also an additional value alpha, shown below.
Here, I show an example using some data that I have. I'll only show a small fraction of the data here to get an idea of what I have:
17.1003710350253 16.7250000000000 681.521316544969
17.0325989276234 18.0540000000000 676.656460644882
17.0113862864815 16.2460000000000 671.738125420192
16.8744356336601 15.1580000000000 666.767363772145
16.5537077980594 12.8830000000000 661.739644621949
16.0646524243248 10.4710000000000 656.656219934146
15.5904357723302 9.35000000000000 651.523986525985
15.2894427136087 12.4580000000000 646.344231349275
15.1181450512182 9.68700000000000 641.118300709434
15.0074128442766 10.4080000000000 635.847600747838
14.9330905954828 11.5330000000000 630.533597865332
14.8201069920058 10.6830000000000 625.177819082427
16.3126863409751 15.9610000000000 619.781852331734
16.2700386755872 16.3580000000000 614.347346678083
15.8072873786912 10.8300000000000 608.876012461843
15.3788908036751 7.55000000000000 603.369621360944
15.0694302370038 13.1960000000000 597.830006367160
14.6313314652840 8.36200000000000 592.259061672302
14.2479738025295 9.03000000000000 586.658742460043
13.8147156115234 5.29100000000000 581.031064599264
13.5384821473624 7.22100000000000 575.378104234926
13.3603543306796 8.22900000000000 569.701997272687
13.2469020140965 9.07300000000000 564.004938753678
13.2064193251406 12.0920000000000 558.289182116093
13.1513460035983 12.2040000000000 552.557038340513
12.8747853506079 4.46200000000000 546.810874976187
12.5948999131388 4.61200000000000 541.053115045791
12.3969691298003 6.83300000000000 535.286235826545
12.1145822760120 2.43800000000000 529.512767505944
11.9541188991626 2.46700000000000 523.735291710730
11.7457790927936 4.15000000000000 517.956439908176
11.5202981254529 4.47000000000000 512.178891679167
11.2824263926694 2.62100000000000 506.405372863054
11.0981930749608 2.50000000000000 500.638653574697
10.8686514170776 1.66300000000000 494.881546094641
10.7122053911554 1.68800000000000 489.136902633882
10.6255883267131 2.48800000000000 483.407612975178
10.4979083986908 4.65800000000000 477.696601993434
10.3598092538338 4.81700000000000 472.006827058220
10.1929490084608 2.46700000000000 466.341275322034
10.1367069580204 2.36700000000000 460.702960898512
10.0194072271384 4.87800000000000 455.094921935306
9.88627023967911 3.53700000000000 449.520217586971
9.69091601129389 0.417000000000000 443.981924893704
9.48684595125235 -0.567000000000000 438.483135572389
9.30742664359900 0.892000000000000 433.026952726910
9.18283037670750 1.50000000000000 427.616487485241
9.02385722622626 1.75800000000000 422.254855571341
8.90355705229410 2.46700000000000 416.945173820367
8.76138912769045 1.99200000000000 411.690556646207
8.61299614111510 0.463000000000000 406.494112470755
8.56293606861698 6.55000000000000 401.358940124780
8.47831879772002 4.65000000000000 396.288125230599
8.42736865902327 6.45000000000000 391.284736577104
8.26325535934842 -1.37900000000000 386.351822497948
8.14547793724500 1.37900000000000 381.492407263967
8.00075641792910 -1.03700000000000 376.709487501030
7.83932517791044 -1.66700000000000 372.006028644665
7.68389447250257 -4.12900000000000 367.384961442799
7.63402151555169 -2.57900000000000 362.849178517935
The results that follow probably won't be meaningful as the full data would be needed (but this is an example). Using this data I have tried to solve iteratively by
y = d(:,1);
d1 = d(:,2);
d2 = d(:,3);
alpha_o = linspace(0.01,1,10);
a = linspace(0.01,1,10);
b = linspace(0.01,1,10);
c = linspace(0.01,1,10);
defining different values for a, b, and c as well as another term alpha, which is used in the model, and am now going to find every possible combination of these parameters and see which combination provides the best fit to the data:
% every possible combination of values
xx = combvec(alpha_o,a,b,c);
% loop through each possible combination of values
for j = 1:size(xx,2);
alpha_o = xx(1,j);
a_o = xx(2,j);
b_o = xx(3,j);
c_o = xx(4,j);
st = d1(1);
for i = 2:length(d1);
st(i) = alpha_o.*d1(i) + (1-alpha_o).*st(i-1);
end
st = st(:);
y_pred = a_o + (b_o*st) + (c_o*d2);
mae(j) = nanmean(abs(y - y_pred));
end
I can then re-run the model using these optimum values:
[id1,id2] = min(mae);
alpha_opt = xx(:,id2);
st = d1(1);
for i = 2:length(d1);
st(i) = alpha_opt(1).*d1(i) + (1-alpha_opt(1)).*st(i-1);
end
st = st(:);
y_pred = alpha_opt(2) + (alpha_opt(3)*st) + (alpha_opt(4)*d2);
mae_final = nanmean(abs(y - y_pred));
However, to reach a final answer I would need to increase the number of initial guesses to more than 10 for each variable. This will take a long time to run. Thereofre, I am wondering if there is a better method for what I am trying to do here? Any advice is appreciated.
Here's some thoughts: If you could decrease the amount of computation within each for loop, you could possibly speed it up. One possible way is to look for common factors between each loop and move it outside for loop:
If you look at the iteration, you'll see
st(1) = d1(1)
st(2) = a * d1(2) + (1-a) * st(1) = a *d1(2) + (1-a)*d1(1)
st(3) = a * d1(3) + (1-a) * st(2) = a * d1(3) + a *(1-a)*d1(2) +(1-a)^2 * d1(1)
st(n) = a * d1(n) + a *(1-a)*d1(n-1) + a *(1-a)^2 * d1(n-2) + ... +(1-a)^(n-1)*d1(1)
Which means st can be calculated by multiplying these two matrices (here I use n=4 for example to illustrate the concept) and sum along the first dimension:
temp1 = [ 0 0 0 a ;
0 0 a a(1-a) ;
0 a a(1-a) a(1-a)^2 ;
1 (1-a) (1-a)^2 (1-a)^3 ;]
temp2 = [ 0 0 0 d1(4) ;
0 0 d1(3) d1(3) ;
0 d1(2) d1(2) d1(2) ;
d1(1) d1(1) d1(1) d1(1) ;]
st = sum(temp1.*temp2,1)
Here's codes that utilize this concept: Computation has been moved out of the inner for loop and only assignment is left.
alpha_o = linspace(0.01,1,10);
xx = nchoosek(alpha_o, 4);
n = size(d1,1);
matrix_d1 = zeros(n, n);
d2 = d2'; % To make the dimension of d2 and st the same.
for ii = 1:n
matrix_d1(n-ii+1:n, ii) = d1(1:ii);
end
st = zeros(size(d1)'); % Pre-allocation of matrix will improve speed.
mae = zeros(1,size(xx,1));
matrix_alpha = zeros(n, n);
for j = 1 : size(xx,1)
alpha_o = xx(j,1);
temp = (power(1-alpha_o, [0:n-1])*alpha_o)';
matrix_alpha(n,:) = power(1-alpha_o, [0:n-1]);
for ii = 2:n
matrix_alpha(n-ii+1:n-1, ii) = temp(1:ii-1);
end
st = sum(matrix_d1.*matrix_alpha, 1);
y_pred = xx(j,2) + xx(j,3)*st + xx(j,4)*d2;
mae(j) = nanmean(abs(y - y_pred));
end
Then :
idx = find(min(mae));
alpha_opt = xx(idx,:);
st = zeros(size(d1)');
temp = (power(1-alpha_opt(1), [0:n-1])*alpha_opt(1))';
matrix_alpha = zeros(n, n);
matrix_alpha(n,:) = power(1-alpha_opt(1), [0:n-1]);;
for ii = 2:n
matrix_alpha(n-ii+1:n-1, ii) = temp(1:ii-1);
end
st = sum(matrix_d1.*matrix_alpha, 1);
y_pred = alpha_opt(2) + (alpha_opt(3)*st) + (alpha_opt(4)*d2);
mae_final = nanmean(abs(y - y_pred));
Let me know if this helps !

How can we measure the similarity distance between categorical data ?

How can we measure the similarity distance between categorical data ?
Example:
Gender: Male, Female
Numerical values: [0 - 100], [200 - 300]
Strings: Professionals, beginners, etc,...
Thanks in advance.
There are different ways to do this. One of the simplest would be as follows.
1) Assign numeric value to each property so the order matches the meaning behind the property if possible. It is important to order property values from lower to higher if property can be measured. If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.
P1 - Gender
-------------------
0 - Male
1 - Female
P2 - Experience
-----------
0 - Beginner
5 - Average
10 - Professional
P3 - Age
-----------
[0 - 100]
P4 - Body height, cm
-----------
[50 - 250]
2) For each concept find scale factor and offset so all property values fall in the same chosen range, say [0-100]
Sx = 100 / (Px max - Px min)
Ox = -Px min
In sample provided you would get:
S1 = 100
O1 = 0
S2 = 10
O2 = 0
S3 = 1
O3 = 0
S4 = 0.5
O4 = -50
3) Now you can create a vector containing all the property values.
V = (S1 * P1 + O1, S2 * P2 + O2, S3 * P3 + O3, S4 * P4 + O4)
In sample provided it would be:
V = (100 * P1, 10 * P2, P3, 0.5 * P4 - 50)
4) Now you can compare two vectors V1 and V2 by subtracting one from other. The length of resulting vector will tell how different they are.
delta = |V1 - V2|
Vectors are subtracted by subtracting each dimension. Vector length can be calculated as square root of sum of squared vector dimensions.
Imagine we have 3 persons:
John
P1 = 0 (male)
P2 = 0 (beginner)
P3 = 20 (20 years old)
P4 = 190 (body height is 190 cm)
Kevin
P1 = 0 (male)
P2 = 10 (professional)
P3 = 25 (25 years old)
P4 = 186 (body height is 186 cm)
Lea
P1 = 1 (female)
P2 = 10 (professional)
P3 = 40 (40 years old)
P4 = 178 (body height is 178 cm)
Vectors would be:
J = (100 * 0, 10 * 0, 20, 0.5 * 190 - 50) = (0, 0, 20, 45)
K = (100 * 0, 10 * 10, 25, 0.5 * 186 - 50) = (0, 100, 25, 43)
L = (100 * 1, 10 * 10, 40, 0.5 * 178 - 50) = (100, 100, 40, 39)
To determine we need to subtract vectors:
delta JK = |J - K| =
= |(0 - 0, 0 - 100, 20 - 25, 45 - 43)| =
= |(0, -100, -5, 2)| =
= SQRT(0 ^ 2 + (-100) ^ 2 + (-5) ^ 2 + 2 ^ 2) =
= SQRT(10000 + 25 + 4) =
= 100,14
delta KL = |K - L| =
= |(0 - 100, 100 - 100, 25 - 40, 43 - 39)| =
= |(-100, 0, -15, 4)| =
= SQRT((-100) ^ 2 + 0 ^ 2 + (-15) ^ 2 + 4 ^ 2) =
= SQRT(10000 + 225 + 16) =
= 101,20
delta LJ = |L - J| =
= |(100 - 0, 100 - 0, 40 - 20, 39 - 45)| =
= |(100, 100, 20, -6)| =
= SQRT(100 ^ 2 + 100 ^ 2 + (20) ^ 2 + (-6) ^ 2) =
= SQRT(10000 + 10000 + 400 + 36) =
= 142,95
From this you can see that John and Kevin are more similar than any other as delta is smaller.
There are a number of measures for finding similarity between categorical data. The following paper discuses briefly about these measures.
https://conservancy.umn.edu/bitstream/handle/11299/215736/07-022.pdf?sequence=1&isAllowed=y
If you're trying to do this in R, there's a package named 'nomclust', which has all these similarity measures readily available.
Hope this helps!
If you are using python, there is a latest library which helps in finding the proximity matrix based on similarity measures such as Eskin, overlap, IOF, OF, Lin, Lin1, etc.
After obtaining the proximity matrix we can go on clustering using Hierarchical Cluster Analysis.
Check this link to the library named "Categorical_similarity_measures":
https://pypi.org/project/Categorical-similarity-measures/0.4/
Just a thought, We can also apply euclidean distance between two variables to find a drift value. If it is 0, then there is no drift or else call as similar. But the vector should be sorted and same length before calculation.