problems amd inverted index and frequency - word

An inverted index (https://en.wikipedia.org/wiki/Inverted_index) is a data structure that aims to allow a full-text search. From a series of documents containing text, the inverted index contains the different words indicating in which documents they appear and how often. The process of generating the inverted index from the set of documents is called indexing.
Given the documents:
{d1: "I saw the cat on the mat",
d2: "I saw the dog on the mat",
d3: "I saw the cat and the rat sat on the mat"}
generate by MapReduce an inverted index that has the following structure:
{‘W1’: [(docId1, numOccu1),…, (docIdN, numOccuN)],
‘W2’: […],…}
where w are the different words that appear (I, saw, ...), docId are the document identifiers (docId1, docId2, ...) and numOccu are the number of times the word appears in the document.
You are asked to write the map functions in pseudocode and reduce and describe the results of those functions.

If we index by (text, word within the text), the index with location in text is:
I (1, 1); (2,1); (3,1)
saw (1, 2); (2,2); (3,2)
the (1, 3); (1, 6); (2,3); (2,6); (3,3); (3,6); (3,10)
cat (1, 4); (3,4)
on (1, 5); (2,5); (3,9)
mat (1, 7); (2,7); (3,11)
dog (2,4)
and (3,5)
rat (3,7)
sat (3,8)
The word “I” is in document 1 (“I saw the cat on the mat”) starting at word 1, so has an entry (1, 1) and word “cat” is in document 1 and 3 at ‘4th’ position respectively (here position is based on word). The index may have weights, frequencies, or other indicators.

Related

Scala Apriori Algorithm : if first index value is same, then Generate Three item set from two items set list

i want this answer
if Input list is >> List( List("A","B"),List("A","C"),List("B","C"),List("B","D"))
Output should be >> List(List("A", "B","C"),List("B","C","D"))
i think it should be done as following>> all indices having first element same are grouped e.g if first element is "A" then group will be ("A","B","B","C").distinct = ("A","B","C")

Palindromic permutated substrings

I was asked this question in a HackerEarth test and I couldn't wrap my head around even forming the algorithm.
The question is -
Count the number of substrings of a string, such that any of their permutations is a palindrome.
So, for aab, the answer is 5 - a, a, b, aa and aab (which can be permuted to form aba).
I feel this is dynamic programming, but I can't find what kind of relations the subproblems might have.
Edit:
So I think the recursive relation might be
dp[i] = dp[i-1] + 1 if str[i] has already appeared before and
substring ending at i-1 has at most 2 characters with odd frequency
else dp[i] = dp[i-1]
No idea if this is right.
I can think of O(n^2) - traverse substrings of length > 1, from indexes (0, 1) up to (0, n-1), then from (1, n-1) down to (1, 3), then from (2, 3) up to (2, n-2), then from (3, n-2) down to (3, 5)...etc.
While traversing, maintain a map of current frequency for each character, as well as totals of the number of characters with odd counts and the number of characters with even counts. Update those on each iteration and add to the total count of palindromic permuted substrings if we are on a substring with (1) odd length and only one character with odd frequency, or (2) even length and no character with odd frequency.
(Add the string length for the count of single character palindromes.)
If I did not misunderstand your question, I tend to believe this is a math problem. Say the length of a string is n, then the answer should be n * (n+1) / 2, the sum of an infinite series. See https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF
For example, string abcde, we can get substrings
a, b, c, d, e,
ab, bc, cd, de,
abc, bcd, cde,
abcd, bcde,
abcde .
You may find the answer from the way I listed the substrings.
So here is my solution that may help you.
you can get a list of every possible substring of input by running a nested loop and for every substring you have to check if the substring can form a palindrome or not.
now how to check if a string/substring can form palindrome:
If a substring is having alphabet of odd number of occurance more than 1, them it can't form a palindrome.Here is the code:
bool stringCanbeFormAPalindrome(string s)
{
int oddValues, alphabet[26];
for(int i =0; i< s.length(); i++)
{
alphabet[s[i]-'a']++;
}
for(int i=0; i<26; i++)
{
if(alphabet[i]%2==1)
{
oddValues++;
if(oddValues>1) return FALSE;
}
}
return TRUE;
}
May that helps.
You can do it easily in O(N) time and O(N) space complexity
notice, the only thing that if the permutation of substring is palindrome or not is the parity of odd character in it so just create a mask of parity of every character, now for any valid substring there can be at most 1 bit different to our current mask, let's iterate on which bit is different, and adding the corresponding answer.
Here's a C++ code (assuming unordered_map is O(1) per query)
string s;
cin>>s;
int n=s.length();
int ans=0;
unordered_map<int,int>um;
um[0]=1;
int mask=0;
for(int i=0;i<n;++i){
mask^=1<<(s[i]-'a');
ans+=um[mask];
for(int j=27;j>=0;--j){
ans+=um[mask^(1<<j)];
}
um[mask]++;
}
cout<<ans;
take care of integer overflow.

Optimal String comparison method swift

What is the best algorithm to use to get a percentage similarity between two strings. I have been using Levenshtein so far, but it's not sufficient. Levenshtein gives me the number of differences, and then I have to try and compute that into a similarity by doing:
100 - (no.differences/no.characters_in_scnd_string * 100)
For example, if I test how similar "ab" is to "abc", I get around 66% similarity, which makes sense, as "ab" is 2/3 similar to "abc".
The problem I encounter, is when I test "abcabc" to "abc", I get a similarity of 100%, as "abc" is entirely present in "abcabc". However, I want the answer to be 50%, because 50% of "abcabc" is the same as "abc"...
I hope this makes some sense... The second string is constant, and I want to test the similairty of different strings to that string. By similar, I mean "cat dog" and "dog cat" have an extremely high similarity despite difference in word order.
Any ideas?
This implement of algorithms of Damerau–Levenshtein distance and Levenshtein distance
you can check this StringMetric Algorithms have what you need
https://github.com/autozimu/StringMetric.swift
Using Levenstein algorithm with input:
case1 - distance(abcabc, abc)
case2 - distance(cat dog, dog cat)
Output is:
distance(abcabc, abc) = 3 // what is ok, if you count percent from `abcabc`
distance(cat dog, dog cat) = 6 // should be 0
So in the case of abcabc and abc we are getting 3 and it is 50% of the largest word abcabc. exactly what you want to achive.
The second case with cats and dogs: my suggestion is to split this Strings to words and compare all possible combinations of them and chose the smallest result.
UPDATE:
The second case I will describe with pseudo code, because I'm not very familiar with Swift.
get(cat dog) and split to array of words ('cat' , 'dog') //array1
get(dog cat) and split to array of words ('dog' , 'cat') //array2
var minValue = 0;
for every i-th element of `array1`
var temp = maxIntegerValue // here will be storred all results of 'distance(i, j)'
index = 0 // remember index of smallest temp
for every j-th element of `array2`
if (temp < distance(i, j))
temp = distance(i, j)
index = j
// here we have found the smallest distance(i, j) value of i in 'array2'
// now we should delete current j from 'array2'
delete j from array2
//add temp to minValue
minValue = minValue + temp
Workflow will be like this:
After first iteration on first for statement (for value 'cat' array1) we will get 0, because i = 0 and j = 1 are identic. Then j = 1 will be removed from array2 and after that array2 will have only elem dog.
Second iteration on second for statement (for value 'dog' array1) we will get also 0, because it is identic with dog from array2
At least from now you have an idea how to deal with your problem. It is now depends on you how exactly you will implement it, probably you will take another data structure.

500000x2 array, find rows meeting specific requirements of 1st and 2nd column, MATLAB

I'm facing a dead end here..
I have collected a huge amount of data and I have isolated only the information that I'm interested in, into a 500K x 2 array of pairs.
1st column contains an ID of, let's say, an Access Point.
2nd column contains a string.
There might be multiple occurrences of an ID in the 1st column, and there can be anything written in the 2nd column. Remember, those are pairs in each row.
What I need to find in those 500K pairs:
I want to find all the IDs, or even the rows, that have 'hello' written in the 2nd column, AND as an additional requirement, there must be more than 2 occurrences of this 'pair'.
Even better want to save how many times this happens, if this happens more than 2 times.
so for example:
col1 (IDs): [ 1, 2, 6, 2, 1, 2, 3, 1]
col2 (str): [ 'hello', 'go', 'hello', 'piz', 'hello', 'da', 'mn', 'hello']
so the data that I ask is :
[ 1, 3 ] , which means, ID=1 , 3 occurences of id=1 with str='hello'
I tried to benchmark it to see if it could do 500.000 rows in a reasonable time.
generate some test data (in total about 60MB)
V = 1+round(rand(5E5,1).*1E4);
H = cell(1,length(V));
for ct = 1:length(H)
switch floor(rand(1)*10)
case 0
H{ct} = 'hello';
case 1
H{ct} = 'go';
case 2
H{ct} = 'piz';
case 3
H{ct} = 'da';
case 4
H{ct} = 'mn';
case 5
H{ct} = 'ds';
case 6
H{ct} = 'wf';
case 7
H{ct} = 'sf';
case 8
H{ct} = 'as';
case 9
H{ct} = 'sg';
end
end
The analysis
tic
a=ismember(H,{'hello'});
M = accumarray(V(a),1);
idx = find(M>1);
result = [idx,M(idx)];
toc
Elapsed time is 0.011699 seconds.
Alternative method with a loop
tic
M=zeros(max(V),1);
for ct = 1:length(H)
if strcmp(H{ct},'hello')
M(V(ct))=M(V(ct))+1;
end
end
idx = find(M>1);
result1 = [idx,M(idx)];
toc
Elapsed time is 0.192560 seconds.
Their are many possible solutions. Here is one: use a map structure. The key set of the map contains the ID's (where "hello" appears in the second column), and the value set contains the number of occurrences.
Run over the second column. When you find "hello", check if the corresponding ID is already a key in the map structure. If true, add +1 to the value associated to that key. Else, add a new pair (key,value) = (the ID, 1).
When finished, remove all the pairs from the map whose values are less or equal than 2. The remaining map is what you are looking for.
Matlab map: https://es.mathworks.com/help/matlab/map-containers.html

How to put numbers into an array and sorted by most frequent number in java

I was given this question on programming in java and was wondering what would be the best way of doing it.
The question was on the lines of:
From the numbers provided, how would you in java display the most frequent number. The numbers was: 0, 3, 4, 1, 1, 3, 7, 9, 1
At first I am thinking well they should be in an array and sorted first then maybe have to go through a for loop. Am I on the right lines. Some examples will help greatly
If the numbers are all fairly small, you can quickly get the most frequent value by creating an array to keep track of the count for each number. The algorithm would be:
Find the maximum value in your list
Create an integer array of size max + 1 (assuming all non-negative values) to store the counts for each value in your list
Loop through your list and increment the count at the index of each value
Scan through the count array and find the index with the highest value
The run-time of this algorithm should be faster than sorting the list and finding the longest string of duplicate values. The tradeoff is that it takes up more memory if the values in your list are very large.
With Java 8, this can be implemented rather smoothly. If you're willing to use a third-party library like jOOλ, it could be done like this:
List<Integer> list = Arrays.asList(0, 3, 4, 1, 1, 3, 7, 9, 1);
System.out.println(
Seq.seq(list)
.grouped(i -> i, Agg.count())
.sorted(Comparator.comparing(t -> -t.v2))
.map(t -> t.v1)
.toList());
(disclaimer, I work for the company behind jOOλ)
If you want to stick with the JDK 8 dependency, the following code would be equivalent to the above:
System.out.println(
list.stream()
.collect(Collectors.groupingBy(i -> i, Collectors.counting()))
.entrySet()
.stream()
.sorted(Comparator.comparing(e -> -e.getValue()))
.map(e -> e.getKey())
.collect(Collectors.toList()));
Both solutions yield:
[1, 3, 0, 4, 7, 9]