500000x2 array, find rows meeting specific requirements of 1st and 2nd column, MATLAB - matlab

I'm facing a dead end here..
I have collected a huge amount of data and I have isolated only the information that I'm interested in, into a 500K x 2 array of pairs.
1st column contains an ID of, let's say, an Access Point.
2nd column contains a string.
There might be multiple occurrences of an ID in the 1st column, and there can be anything written in the 2nd column. Remember, those are pairs in each row.
What I need to find in those 500K pairs:
I want to find all the IDs, or even the rows, that have 'hello' written in the 2nd column, AND as an additional requirement, there must be more than 2 occurrences of this 'pair'.
Even better want to save how many times this happens, if this happens more than 2 times.
so for example:
col1 (IDs): [ 1, 2, 6, 2, 1, 2, 3, 1]
col2 (str): [ 'hello', 'go', 'hello', 'piz', 'hello', 'da', 'mn', 'hello']
so the data that I ask is :
[ 1, 3 ] , which means, ID=1 , 3 occurences of id=1 with str='hello'

I tried to benchmark it to see if it could do 500.000 rows in a reasonable time.
generate some test data (in total about 60MB)
V = 1+round(rand(5E5,1).*1E4);
H = cell(1,length(V));
for ct = 1:length(H)
switch floor(rand(1)*10)
case 0
H{ct} = 'hello';
case 1
H{ct} = 'go';
case 2
H{ct} = 'piz';
case 3
H{ct} = 'da';
case 4
H{ct} = 'mn';
case 5
H{ct} = 'ds';
case 6
H{ct} = 'wf';
case 7
H{ct} = 'sf';
case 8
H{ct} = 'as';
case 9
H{ct} = 'sg';
end
end
The analysis
tic
a=ismember(H,{'hello'});
M = accumarray(V(a),1);
idx = find(M>1);
result = [idx,M(idx)];
toc
Elapsed time is 0.011699 seconds.
Alternative method with a loop
tic
M=zeros(max(V),1);
for ct = 1:length(H)
if strcmp(H{ct},'hello')
M(V(ct))=M(V(ct))+1;
end
end
idx = find(M>1);
result1 = [idx,M(idx)];
toc
Elapsed time is 0.192560 seconds.

Their are many possible solutions. Here is one: use a map structure. The key set of the map contains the ID's (where "hello" appears in the second column), and the value set contains the number of occurrences.
Run over the second column. When you find "hello", check if the corresponding ID is already a key in the map structure. If true, add +1 to the value associated to that key. Else, add a new pair (key,value) = (the ID, 1).
When finished, remove all the pairs from the map whose values are less or equal than 2. The remaining map is what you are looking for.
Matlab map: https://es.mathworks.com/help/matlab/map-containers.html

Related

delete range of rows of a cell array under certain condition, MATLAB

I have a very large cell array containing a lot of measures. In general the measurements are in the range of 3 to 15 meters. My problem is that some of these measurements don't have this range, so it's invalid data, I want to remove these range of data from my cell array.
Here is what I have tried (in resume):
ind_cond = find(strcmp('Machine',A{:,1}));
A = table2cell(A);
for i = 1:(length(ind_cond)-1);
cond = ismember(A(ind_cond(i):ind_cond(i+1),11),'15');
if cond == 0
A(ind_cond(i):ind_cond(i+1),11) = [];
end
end
So first I search for the word 'Machine' because this is in all the headers so I can have the total number of measurements. Then I try to find the string '15' (I convert this later to num) on the range of the measurements, and if there is no '15' I want to delete that range of rows from the array.
I get the following error:
"A null assignment can have only one non-colon index"
Many thanks
EDIT:
Here is a picture of how the data looks ( I don't know how to upload this, is a .csv file, sorry)
The 11 column is the important thing, here is the data that I'm interested. The problem is for example that some data sets (they are a lot, from 0.25 to 17 meters) are incomplete, because they don't have the value '15' so I want to delete the entire dataset in that case.
My first attemp was make something like this
for i = 1:(length(ind_cond)-1);
if ind_cond(i+1,1)- ind_cond(i,1) < 30 ;
A(ind_cond(i):ind_cond(i+1),:) = [];
end
end
And it works well but this don't delete all the conflictive data, since I have one (1) very large data set that don't have '15', and the condition above can't eliminate it.
In the picture "What i want to delete" is an example of how are the conflictive data, and I want to delete all that data.
Overview of data
What i want to delete
If the intent is to remove the cells that don't have the string '15', you can do the following:
A = [{'TEST'} {'Machine'} ; ...
{'test1'} {'3'}; ...
{'test2'} {'7'}; ...
{'test3'} {'16'}; ...
{'test4'} {'15'} ; ...
{'test5'} {'1'}; ...
{'test6'} {'8'}];
machine_cell = A(:,2);
% keep only cells that where there in no '15'
new_A = A(contains(machine_cell,'15'),:);
The new cell array will be:
>> new_A =
1×2 cell array
{'test4'} {'15'}
The opposite, keep all cells that doesn't have '15' then just negate contains:
new_A = A(~contains(machine_cell,'15'),:);
>> new_A =
6×2 cell array
{'TEST' } {'Machine'}
{'test1'} {'3' }
{'test2'} {'7' }
{'test3'} {'16' }
{'test5'} {'1' }
{'test6'} {'8' }

Optimal String comparison method swift

What is the best algorithm to use to get a percentage similarity between two strings. I have been using Levenshtein so far, but it's not sufficient. Levenshtein gives me the number of differences, and then I have to try and compute that into a similarity by doing:
100 - (no.differences/no.characters_in_scnd_string * 100)
For example, if I test how similar "ab" is to "abc", I get around 66% similarity, which makes sense, as "ab" is 2/3 similar to "abc".
The problem I encounter, is when I test "abcabc" to "abc", I get a similarity of 100%, as "abc" is entirely present in "abcabc". However, I want the answer to be 50%, because 50% of "abcabc" is the same as "abc"...
I hope this makes some sense... The second string is constant, and I want to test the similairty of different strings to that string. By similar, I mean "cat dog" and "dog cat" have an extremely high similarity despite difference in word order.
Any ideas?
This implement of algorithms of Damerau–Levenshtein distance and Levenshtein distance
you can check this StringMetric Algorithms have what you need
https://github.com/autozimu/StringMetric.swift
Using Levenstein algorithm with input:
case1 - distance(abcabc, abc)
case2 - distance(cat dog, dog cat)
Output is:
distance(abcabc, abc) = 3 // what is ok, if you count percent from `abcabc`
distance(cat dog, dog cat) = 6 // should be 0
So in the case of abcabc and abc we are getting 3 and it is 50% of the largest word abcabc. exactly what you want to achive.
The second case with cats and dogs: my suggestion is to split this Strings to words and compare all possible combinations of them and chose the smallest result.
UPDATE:
The second case I will describe with pseudo code, because I'm not very familiar with Swift.
get(cat dog) and split to array of words ('cat' , 'dog') //array1
get(dog cat) and split to array of words ('dog' , 'cat') //array2
var minValue = 0;
for every i-th element of `array1`
var temp = maxIntegerValue // here will be storred all results of 'distance(i, j)'
index = 0 // remember index of smallest temp
for every j-th element of `array2`
if (temp < distance(i, j))
temp = distance(i, j)
index = j
// here we have found the smallest distance(i, j) value of i in 'array2'
// now we should delete current j from 'array2'
delete j from array2
//add temp to minValue
minValue = minValue + temp
Workflow will be like this:
After first iteration on first for statement (for value 'cat' array1) we will get 0, because i = 0 and j = 1 are identic. Then j = 1 will be removed from array2 and after that array2 will have only elem dog.
Second iteration on second for statement (for value 'dog' array1) we will get also 0, because it is identic with dog from array2
At least from now you have an idea how to deal with your problem. It is now depends on you how exactly you will implement it, probably you will take another data structure.

Reference to non-existent field 'd'

My mat file contains 40,000 rows and two columns. I have to read it line by line
and then get values of last column in a single row.
Following is my code:
for v = 1:40000
firstRowB = data.d(v,:)
if(firstRowB(1,2)==1)
count1=count1+1;
end
if(firstRowB(1,2)==2)
count2=count2+1;
end
end
FirstRowB gets the row checks whether last column equals 1 or 2 and then increases the value of respective count by 1.
But I keep getting this error:
Reference to non-existent field 'd'.
You could use vectorization (it is always convenient especially in Matlab). Taking advantage of the fact that true is one and false is zero, if you just want to count you can do :
count1 = sum ( data.d(:, 2) == 1 ) ;
count2 = sum (data.d(:,2) == 2 ) ;
in fact in general you could define :
getNumberOfElementsInLastColEqualTo = #(numb) sum (data.d(:,end) == numb ) ;
counts =arrayfun( getNumberOfElementsInLastColEqualTo , [1 2 ] );
Hope this helps.

Remove rows from a matrix

I have the array "A" with values:
101 101
0 0
61.6320000000000 0.725754779522671
73.7000000000000 0.830301150185882
78.2800000000000 0.490917508345341
81.2640000000000 0.602561200211232
82.6880000000000 0.435568593909153
And I wish to remove this first row and retain the shape of the array (2 columns), thus creating the array
0 0
61.6320000000000 0.725754779522671
73.7000000000000 0.830301150185882
78.2800000000000 0.490917508345341
81.2640000000000 0.602561200211232
82.6880000000000 0.435568593909153
I have used A = A(A~=101); , which removes the values as required - however it packs the array down to one column.
The best way is:
A = A(2:end, :)
But you can also do
A(1,:) = []
however it is slightly less efficient (see Deleting matrix elements by = [] vs reassigning matrix)
If you are looking to delete rows that equal a certain number try
A = A(A(:,1)~=101,:)
Use all or any if you want to delete row if either all or any column equals your value:
A = A(all(A~=101,2),:)

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.