Finding non-adjacent subsequences in a string - string-search

Say I am searching in a string, for a subsequence, where the elements do not necessarily have to be adjacent, but have to occur within N characters. So,
search("abc","aaabbbccc",7) => True
search("abc","aabbcc",3) => False
I am looking for an efficient data structure / algorithm that will perform this comparison. I can think of a few approaches like searching for all valid combos of interior wildcards, like
search("abc",whatever,4) => "abc","a*bc","ab*c"
And using any of the multi-string search algorithms (probably Aho–Corasick), but I'm wondering if there is a better solution.

I have attached a python code sample that does what you want. It loops through the string to be searched and if the first letter of search string is found, a substring of length=max_length is created and sent to another function. This function simply moves through the substring trying to find all of the search string letters in order. If it finds them all then it returns True, otherwise False.
def check_substring(find_me, substr):
find_index = 0
for letter in substr:
if find_me[find_index] == letter:
find_index +=1
# if we reach the end of find_me, return true
if find_index >= len(find_me):
return True
return False
def check_string(find_me, look_here, max_len):
for index in range(len(look_here)):
if find_me[0] == look_here[index]:
if check_substring(find_me, look_here[index:index + max_len]):
return True
return False
fm = "abc"
lh = "aabbbccceee"
ml = 5
print check_string(fm, lh, ml)

Related

How to subtract strings that are not consecutive python

What I mean is if I have a string, "apwswe", and another string "appegwisbnwe", if we "subtract" the two strings together, which means "appegwisbnwe" - "apwswe", I want to get "pegibn". Is there a way to do this? BTW pegibn is the characters that they don't have in "common" with eachother.
Not exactly a thing of beauty, but this will get you there:
subtrahend = "apwswe"
minuend = list("appegwisbnwe")
for char in subtrahend:
if minuend.count(char):
minuend.remove(char)
difference = "".join(minuend)
print(difference)
pgibne
Possible alternatives to rhurwitz's solution:
input = "appegwisbnwe"
for char, occurrences in collections.Counter("apwswe"):
input = input.replace(char, '', occurrences)
this is quite simple and can be implemented as a straightforward functools.reduce expression but will rewrite the input string as many times as there are different characters in the filter.
A possibly more efficient alternative as it works in O(len(input) + len(filter)) rather than O(len(input)*len(uniq(filter))
input = "appegwisbnwe"
filter = collections.Counter("apwswe")
output = ''
for c in input:
if filter[c]:
filter[c] -= 1
else:
output += c

Trouble With Rhyming Algorithm Scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}
I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);

find whether a string is substring of other string in SML NJ

In SML NJ, I want to find whether a string is substring of another string and find its index. Can any one help me with this?
The Substring.position function is the only one I can find in the basis library that seems to do string search. Unfortunately, the Substring module is kind of hard to use, so I wrote the following function to use it. Just pass two strings, and it will return an option: NONE if not found, or SOME of the index if it is found:
fun index (str, substr) = let
val (pref, suff) = Substring.position substr (Substring.full str)
val (s, i, n) = Substring.base suff
in
if i = size str then
NONE
else
SOME i
end;
Well you have all the substring functions, however if you want to also know the position of it, then the easiest is to do it yourself, with a linear scan.
Basically you want to explode both strings, and then compare the first character of the substring you want to find, with each character of the source string, incrementing a position counter each time you fail. When you find a match you move to the next char in the substring as well without moving the position counter. If the substring is "empty" (modeled when you are left with the empty list) you have matched it all and you can return the position index, however if the matching suddenly fail you have to return back to when you had the first match and skip a letter (incrementing the position counter) and start all over again.
Hope this helps you get started on doing this yourself.

Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?

Eg if input string is helloworld I want the output to be like:
do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...
all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example.
The input string can be any length, but rarely more than 16 chars.
I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.
The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.
Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.
So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.
I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.
The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.
# Using the 38617 word dictionary at
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")
def containsLetters(subword, word):
wordlen = len(word)
subwordlen = len(subword)
if subwordlen > wordlen:
return False
word = list(word)
for c in subword:
try:
index = word.index(c)
except ValueError:
return False
word.pop(index)
return True
def getAnagrams(word):
output = []
for key in mydict.iterkeys():
if containsLetters(key, word):
output.extend(mydict[key])
output.sort(key=len)
return output
f = open("dict.txt")
wordlist = f.readlines()
f.close()
mydict = {}
for word in wordlist:
word = word.rstrip()
temp = list(word)
temp.sort()
letters = ''.join(temp)
if letters in mydict:
mydict[letters].append(word)
else:
mydict[letters] = [word]
An example run:
>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.
I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.
A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:
import itertools
import urllib
def words():
f = urllib.urlopen(
'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
allwords = set(w[:-1] for w in f)
f.close()
return allwords
def substrings(s):
for i in range(2, len(s)+1):
for p in itertools.permutations(s, i):
yield ''.join(p)
def main():
w = words()
print '%d words' % len(w)
ss = set(substrings('weep'))
print '%d substrings' % len(ss)
good = ss & w
print '%d good ones' % len(good)
sgood = sorted(good, key=lambda w:(len(w), w))
for aword in sgood:
print aword
main()
will emit:
38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep
Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.
What you want is an implementation of a power set.
Also look at Eric Lipparts blog, he blogged about this very thing a little while back
EDIT:
Here is an implementation I wrote of getting the powerset from a given string...
private IEnumerable<string> GetPowerSet(string letters)
{
char[] letterArray = letters.ToCharArray();
for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
{
StringBuilder sb = new StringBuilder();
for (int j = 0; j < letterArray.Length; j++)
{
int pos = Convert.ToInt32(Math.Pow(2.0, j));
if ((pos & i) == pos)
{
sb.Append(letterArray[j]);
}
}
yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
}
}
This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...
Dictionary<string,IEnumerable<string>>
I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)
wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.
A nasality talisman for the sultana analyst
Santalic tailfans, part two
I believe the Ruby code in the answers to this question will also solve your problem.
I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.
I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.
public IList<string> Matches(string sourceLetters, string [] wordList)
{
sourceLetters = sourceLetters.ToUpper();
IList<string> matches = new List<string>();
foreach (string word in wordList)
{
if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
matches.Add(word);
}
return matches;
}
public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
string builtWord = "";
foreach (char letter in targetWord)
{
int pos = sourceLetters.IndexOf(letter);
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
continue;
}
// check for wildcard
pos = sourceLetters.IndexOf("*");
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
}
}
return string.Equals(builtWord, targetWord);
}