What is the best algorithm to use to get a percentage similarity between two strings. I have been using Levenshtein so far, but it's not sufficient. Levenshtein gives me the number of differences, and then I have to try and compute that into a similarity by doing:
100 - (no.differences/no.characters_in_scnd_string * 100)
For example, if I test how similar "ab" is to "abc", I get around 66% similarity, which makes sense, as "ab" is 2/3 similar to "abc".
The problem I encounter, is when I test "abcabc" to "abc", I get a similarity of 100%, as "abc" is entirely present in "abcabc". However, I want the answer to be 50%, because 50% of "abcabc" is the same as "abc"...
I hope this makes some sense... The second string is constant, and I want to test the similairty of different strings to that string. By similar, I mean "cat dog" and "dog cat" have an extremely high similarity despite difference in word order.
Any ideas?
This implement of algorithms of Damerau–Levenshtein distance and Levenshtein distance
you can check this StringMetric Algorithms have what you need
https://github.com/autozimu/StringMetric.swift
Using Levenstein algorithm with input:
case1 - distance(abcabc, abc)
case2 - distance(cat dog, dog cat)
Output is:
distance(abcabc, abc) = 3 // what is ok, if you count percent from `abcabc`
distance(cat dog, dog cat) = 6 // should be 0
So in the case of abcabc and abc we are getting 3 and it is 50% of the largest word abcabc. exactly what you want to achive.
The second case with cats and dogs: my suggestion is to split this Strings to words and compare all possible combinations of them and chose the smallest result.
UPDATE:
The second case I will describe with pseudo code, because I'm not very familiar with Swift.
get(cat dog) and split to array of words ('cat' , 'dog') //array1
get(dog cat) and split to array of words ('dog' , 'cat') //array2
var minValue = 0;
for every i-th element of `array1`
var temp = maxIntegerValue // here will be storred all results of 'distance(i, j)'
index = 0 // remember index of smallest temp
for every j-th element of `array2`
if (temp < distance(i, j))
temp = distance(i, j)
index = j
// here we have found the smallest distance(i, j) value of i in 'array2'
// now we should delete current j from 'array2'
delete j from array2
//add temp to minValue
minValue = minValue + temp
Workflow will be like this:
After first iteration on first for statement (for value 'cat' array1) we will get 0, because i = 0 and j = 1 are identic. Then j = 1 will be removed from array2 and after that array2 will have only elem dog.
Second iteration on second for statement (for value 'dog' array1) we will get also 0, because it is identic with dog from array2
At least from now you have an idea how to deal with your problem. It is now depends on you how exactly you will implement it, probably you will take another data structure.
Related
I have an array of 1 x 400, where all element values are above 1500. However, I have some elements that have values<50 which are wrong measures and I would like to have the mean of the elements before and after the wrong measured data points and replace it in the main array.
For instance, element number 17 is below 50 so I want to take the mean of elements 16 and 18 and replace element 17 with the new mean.
Can someone help me, please? many thanks in advance.
No language is specified in the question, but for Python you could work with List Comprehension:
# array with 400 values, some of which are incorrect
arr = [...]
arr = [arr[i] if arr[i] >= 50 else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
That is, if arr[i] is less than 50, it'll be replaced by the average value of the element before and after it. There are two issues with this approach.
If i is the first or last element, then one of the two values will be undefined, and no mean can be obtained. This can be fixed by just using the value of the available neighbour, as specified below
If two values in a row are very low, the leftmost one will use the rightmost one to calculate its value, which will result in a very low value. This is a problem that may not occur for you in practice, but it is an inherent result of the way you wish to recalculate values, and you might want to keep it in mind.
Improved version, keeping in mind the edge cases:
# don't alter the first and last item, even if they're low
arr = [arr[i] if arr[i] >= 50 or i == 0 or i+1 == len(arr) else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
# replace the first and last element if needed
if arr[0] < 50:
arr[0] = arr[1]
if arr[len(arr)-1] < 50:
arr[len(arr)-1] = arr[len(arr)-2]
I hope this answer was useful for you, even if you intend to use another language or framework than python.
I was asked this question in a HackerEarth test and I couldn't wrap my head around even forming the algorithm.
The question is -
Count the number of substrings of a string, such that any of their permutations is a palindrome.
So, for aab, the answer is 5 - a, a, b, aa and aab (which can be permuted to form aba).
I feel this is dynamic programming, but I can't find what kind of relations the subproblems might have.
Edit:
So I think the recursive relation might be
dp[i] = dp[i-1] + 1 if str[i] has already appeared before and
substring ending at i-1 has at most 2 characters with odd frequency
else dp[i] = dp[i-1]
No idea if this is right.
I can think of O(n^2) - traverse substrings of length > 1, from indexes (0, 1) up to (0, n-1), then from (1, n-1) down to (1, 3), then from (2, 3) up to (2, n-2), then from (3, n-2) down to (3, 5)...etc.
While traversing, maintain a map of current frequency for each character, as well as totals of the number of characters with odd counts and the number of characters with even counts. Update those on each iteration and add to the total count of palindromic permuted substrings if we are on a substring with (1) odd length and only one character with odd frequency, or (2) even length and no character with odd frequency.
(Add the string length for the count of single character palindromes.)
If I did not misunderstand your question, I tend to believe this is a math problem. Say the length of a string is n, then the answer should be n * (n+1) / 2, the sum of an infinite series. See https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF
For example, string abcde, we can get substrings
a, b, c, d, e,
ab, bc, cd, de,
abc, bcd, cde,
abcd, bcde,
abcde .
You may find the answer from the way I listed the substrings.
So here is my solution that may help you.
you can get a list of every possible substring of input by running a nested loop and for every substring you have to check if the substring can form a palindrome or not.
now how to check if a string/substring can form palindrome:
If a substring is having alphabet of odd number of occurance more than 1, them it can't form a palindrome.Here is the code:
bool stringCanbeFormAPalindrome(string s)
{
int oddValues, alphabet[26];
for(int i =0; i< s.length(); i++)
{
alphabet[s[i]-'a']++;
}
for(int i=0; i<26; i++)
{
if(alphabet[i]%2==1)
{
oddValues++;
if(oddValues>1) return FALSE;
}
}
return TRUE;
}
May that helps.
You can do it easily in O(N) time and O(N) space complexity
notice, the only thing that if the permutation of substring is palindrome or not is the parity of odd character in it so just create a mask of parity of every character, now for any valid substring there can be at most 1 bit different to our current mask, let's iterate on which bit is different, and adding the corresponding answer.
Here's a C++ code (assuming unordered_map is O(1) per query)
string s;
cin>>s;
int n=s.length();
int ans=0;
unordered_map<int,int>um;
um[0]=1;
int mask=0;
for(int i=0;i<n;++i){
mask^=1<<(s[i]-'a');
ans+=um[mask];
for(int j=27;j>=0;--j){
ans+=um[mask^(1<<j)];
}
um[mask]++;
}
cout<<ans;
take care of integer overflow.
I have a set of IDs associated with costs which is just a double value. IDs are integers and unique. Two IDs may have same costs. I stored them as:-
a=containers.Map('KeyType','uint32','ValueType','double');
a(1)=7.3
a(2)=8.4
a(3)=7.3
Now i want to find the minimum cost.
b=[];
c=values(a);
b=[b,c{:}];
cost_min=min(b);
Now i want to find all IDs associated i.e. 1 and 3 with the minimum cost i.e. 7.3. I can collect all the keys into an array and then do a for loop over this array. Is there a better way to do this entire thing in Matlab so that for loops are not required?
sparse matrix can work as a hashmap, just do this:
a= sparse(1:3,1,[7.3 8.4 7.3])
find(a == min(nonzeros(a))
There are methods which can be used on maps for this kind of operations
http://se.mathworks.com/help/matlab/ref/containers.map-class.html
The approach finding minimum values and minimum keys can be done something like this,
a=containers.Map('KeyType','uint32','ValueType','double');
a(1)=7.3;
a(3)=8.4;
a(4)=7.3;
minval = inf;
minkeys = -1;
for k = keys(a)
val = a.values(k);
val = val{1};
if (val < minval(1))
minkeys = k;
minval = val;
elseif (val == minval(1))
minkeys = [minkeys,k];
end
end
disp(minval);
disp(minkeys);
This is not efficient though and value search is clumsy for maps. This is not what they are intended for. Maps is supposed to do efficient key lookup. In case you are going to do a lot of lookups and this is what takes time, then use a map. If you need to do a lot of value searches, I would recommend that you use a matrix (or two arrays) for this instead.
idx = [1;3;4];
val = [7.3,8.3,7.3];
minval = min(val);
minidx = idx(val==minval);
disp(minval);
disp(minidx);
There is also another post with an example where it is shown how a sparse matrix can be used as a hashmap. Let the index become the key. This will take about 3 times the memory as all non-zero elements an ordinary array, but a map uses more memory than an array as well.
As simple as in title. I have nx1 sized vector p. I'm interested in the maximum value of r = p/foo - floor(p/foo), with foo being a scalar, so I just call:
max_value = max(p/foo-floor(p/foo))
How can I get which value of p gave out max_value?
I thought about calling:
[max_value, max_index] = max(p/foo-floor(p/foo))
but soon I realised that max_index is pretty useless. I'm sorry asking this, real beginner here.
Having dropped the issue to pieces, I realized there's no unique corrispondence between values p and values in my related vector p/foo-floor(p/foo), so there's a logical issue rather than a language one.
However, given my input data, I know that the solution is unique. How can I fix this?
I ended up doing:
result = p(p/foo-floor(p/foo) == max(p/foo-floor(p/foo)))
Looks terrible, so if you know any other way...
Once you have the index, use it:
result = p(max_index)
You can create a new vector with your lets say "transformed" values:
p2 = (p/foo-floor(p/foo))
and then just use find to find the max values on p2:
max_index = find(p2 == max(p2))
that will return the index or indices of p2 with the max value of that operation, and finally just lookup the original value in p
p(max_index)
in 1 line, this is:
p(find((p/foo-floor(p/foo) == max((p/foo-floor(p/foo))))))
which is basically the same thing you did in the end :)
I want to count the number of set bits in a uint in Specman:
var x: uint;
gen x;
var x_set_bits: uint;
x_set_bits = ?;
What's the best way to do this?
One way I've seen is:
x_set_bits = pack(NULL, x).count(it == 1);
pack(NULL, x) converts x to a list of bits.
count acts on the list and counts all the elements for which the condition holds. In this case the condition is that the element equals 1, which comes out to the number of set bits.
I don't know Specman, but another way I've seen this done looks a bit cheesy, but tends to be efficient: Keep a 256-element array; each element of the array consists of the number of bits corresponding to that value. For example (pseudocode):
bit_count = [0, 1, 1, 2, 1, ...]
Thus, bit_count2 == 1, because the value 2, in binary, has a single "1" bit. Simiarly, bit_count[255] == 8.
Then, break the uint into bytes, use the byte values to index into the bit_count array, and add the results. Pseudocode:
total = 0
for byte in list_of_bytes
total = total + bit_count[byte]
EDIT
This issue shows up in the book Beautiful Code, in the chapter by Henry S. Warren. Also, Matt Howells shows a C-language implementation that efficiently calculates a bit count. See this answer.