How do I calculate the closeness between points/clusters using euclidean distance? - cluster-analysis

Consider a market basket database containing the following 4 transactions over
items 1, 2, 3, 4, 5 and 6.
(a) {1, 2, 3, 5},
(b) {2, 3, 4, 5},
(c) {1, 4}, and
(d) {6}.
The transactions can be viewed as points with boolean (0/1) attributes corresponding to the items 1, 2, 3, 4, 5 and 6. The four points thus become
(1,1,1,0,1,0),
(0,1,1,1,1,0),
(1,0,0,1,0,0),
(0,0,0,0,0,1).
Using euclidean distance to measure the closeness between points/clusters, how do I calculate
d(1,2)=?
d(1,3)=?
d(1,4)=?
d(2,3)=?
d(2,4)=?
d(3,4)=?
(They says d(3,4) = sqrt(3), is it? Is there something missing in the question)

The Euclidean distance is defined in your case as:
d(i, j) = sqrt( Sum_{k=1..6} (i_k - j_k)^2 )
where i_k is the k-th item of the i-th transaction and Sum means the total sum of those operations.
So you have to iterate over the items to compute those values.

Related

Strategy to get the 'best' number

If there is a vector of numbers (size = n), we want to find the number that is the 'best'.
The criteria for the best number is its frequency should be > 50% of the total size of the vector of numbers.
Given that there will always be only one best number.
How will you solve this with O(1) space complexity and O(n) time complexity?
eg. input : [1, 1, 1, 3, 3, 3, 3]
ans : 3 because (its frequency i.e. 4 is greater than 50% of 7 = 3)

Spark method for subtracting 2 vectors

I am using scala spark. I have a dataframe that 2 column each containing a Vector with the same cardinality/size. I want to find the distance between each element of the 2 Vectors and put the results in a vector in another column of the dataframe.
Example: [1, 3, 5, -2] - [-2, 5, 0, 1] = [3, 2, 5, 3]
I found sqdist method that can get me the sum of the square distances between 2 Vectors but how do I get the individual distances of each elements in the vector.

Find the largest index of the minimum in Matlab

I have an array of positive numbers and there are some duplicates. I want to find the largest index of the minimum value.
For example, if a=[2, 3, 1, 1, 4, 1, 3, 2, 1, 5, 5] then [i, v] = min(a) returns i=3, however I want i=9.
Using find and min.
A = [2, 3, 1, 1, 4, 1, 3, 2, 1, 5, 5];
minA = min(A);
maxIndex = max(find(A==minA));
min get the minimun value, and find return de index of values that meet the condition A==minA. max return de maximun index.
Here's a different idea, which only requires one function, sort:
[~,y] = sort(a,'descend');
i = y(end)
ans =
9
You can use imreginalmin as well with time complexity O(n):
largestMinIndex = find(imregionalmin(A),1,'last');

How to access elements of a matrix based on values of a vector

So say I have the below matrix
[1, 2, 3,
4, 5, 6,
7, 8, 9]
And I have a vector [1,3]
I want to access the 1st and 3rd row which would return
[1,2,3
7,8,9]
I need to be able to scale this up to about 1000 rows being grabbed based on values in the vector.
if A is your matrix and v your vector of index, you just have to do A(v,:)

How to get a regular sampled matrix in Scilab

I'm trying to program a function (or even better it it already exists) in scilab that calculates a regular timed samples of values.
IE: I have a vector 'values' which contains the value of a signal at different times. This times are in the vector 'times'. So at time times(N), the signal has value values(N).
At the moment the times are not regular, so the variable 'times' and 'values' can look like:
times = [0, 2, 6, 8, 14]
values= [5, 9, 10, 1, 6]
This represents that the signal had value 5 from second 0 to second 2. Value 9 from second 2 to second 6, etc.
Therefore, if I want to calculate the signal average value I can not just calculate the average of vector 'values'. This is because for example the signal can be for a long time with the same value, but there will be only one value in the vector.
One option is to take the deltaT to calculate the media, but I will also need to perform other calculations:average, etc.
Other option is to create a function that given a deltaT, samples the time and values vectors to produce an equally spaced time vector and corresponding values. For example, with deltaT=2 and the previous vectors,
[sampledTime, sampledValues] = regularSample(times, values, 2)
sampledTime = [0, 2, 4, 6, 8, 10, 12, 14]
sampledValues = [5, 9, 9, 10, 1, 1, 1, 6]
This is easy if deltaT is small enough to fit exactly with all the times. If the deltaT is bigger, then the average of values or some approximation must be done...
Is there anything already done in Scilab?
How can this function be programmed?
Thanks a lot!
PS: I don't know if this is the correct forum to post scilab questions, so any pointer would also be useful.
If you like to implement it yourself, you can use a weighted sum.
times = [0, 2, 6, 8, 14]
values = [5, 9, 10, 1, 6]
weightedSum = 0
highestIndex = length(times)
for i=1:(highestIndex-1)
// Get the amount of time a certain value contributed
deltaTime = times(i+1) - times(i);
// Add the weighted amount to the total weighted sum
weightedSum = weightedSum + deltaTime * values(i);
end
totalTimeDelta = times($) - times(1);
average = weightedSum / totalTimeDelta
printf( "Result is %f", average )
Or If you want to use functionally the same, but less readable code
timeDeltas = diff(times)
sum(timeDeltas.*values(1:$-1))/sum(timeDeltas)