Algorithm to get a list of all words that are anagrams of all substrings (scrabble)? - substring

Eg if input string is helloworld I want the output to be like:
do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...
all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example.
The input string can be any length, but rarely more than 16 chars.
I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.

The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.
Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.
So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.

I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.
The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.
# Using the 38617 word dictionary at
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")
def containsLetters(subword, word):
wordlen = len(word)
subwordlen = len(subword)
if subwordlen > wordlen:
return False
word = list(word)
for c in subword:
try:
index = word.index(c)
except ValueError:
return False
word.pop(index)
return True
def getAnagrams(word):
output = []
for key in mydict.iterkeys():
if containsLetters(key, word):
output.extend(mydict[key])
output.sort(key=len)
return output
f = open("dict.txt")
wordlist = f.readlines()
f.close()
mydict = {}
for word in wordlist:
word = word.rstrip()
temp = list(word)
temp.sort()
letters = ''.join(temp)
if letters in mydict:
mydict[letters].append(word)
else:
mydict[letters] = [word]
An example run:
>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']

The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.
I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.

A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:
import itertools
import urllib
def words():
f = urllib.urlopen(
'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
allwords = set(w[:-1] for w in f)
f.close()
return allwords
def substrings(s):
for i in range(2, len(s)+1):
for p in itertools.permutations(s, i):
yield ''.join(p)
def main():
w = words()
print '%d words' % len(w)
ss = set(substrings('weep'))
print '%d substrings' % len(ss)
good = ss & w
print '%d good ones' % len(good)
sgood = sorted(good, key=lambda w:(len(w), w))
for aword in sgood:
print aword
main()
will emit:
38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep
Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.

What you want is an implementation of a power set.
Also look at Eric Lipparts blog, he blogged about this very thing a little while back
EDIT:
Here is an implementation I wrote of getting the powerset from a given string...
private IEnumerable<string> GetPowerSet(string letters)
{
char[] letterArray = letters.ToCharArray();
for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
{
StringBuilder sb = new StringBuilder();
for (int j = 0; j < letterArray.Length; j++)
{
int pos = Convert.ToInt32(Math.Pow(2.0, j));
if ((pos & i) == pos)
{
sb.Append(letterArray[j]);
}
}
yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
}
}
This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...
Dictionary<string,IEnumerable<string>>
I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)
wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));

Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.
A nasality talisman for the sultana analyst
Santalic tailfans, part two

I believe the Ruby code in the answers to this question will also solve your problem.

I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.
I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.
public IList<string> Matches(string sourceLetters, string [] wordList)
{
sourceLetters = sourceLetters.ToUpper();
IList<string> matches = new List<string>();
foreach (string word in wordList)
{
if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
matches.Add(word);
}
return matches;
}
public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
string builtWord = "";
foreach (char letter in targetWord)
{
int pos = sourceLetters.IndexOf(letter);
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
continue;
}
// check for wildcard
pos = sourceLetters.IndexOf("*");
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
}
}
return string.Equals(builtWord, targetWord);
}

Related

Trouble With Rhyming Algorithm Scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}
I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

How do I find letters in words that are part of a string and remove them? (List comprehensions with if statements)

I'm trying to remove vowels from a string. Specifically, remove vowels from words that have more than 4 letters.
Here's my thought process:
(1) First, split the string into an array.
(2) Then, loop through the array and identify words that are more than 4 letters.
(3) Third, replace vowels with "".
(4) Lastly, join the array back into a string.
Problem: I don't think the code is looping through the array.
Can anyone find a solution?
def abbreviate_sentence(sent):
split_string = sent.split()
for word in split_string:
if len(word) > 4:
abbrev = word.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "")
sentence = " ".join(abbrev)
return sentence
print(abbreviate_sentence("follow the yellow brick road")) # => "fllw the yllw brck road"
I just figured out that the "abbrev = words.replace..." line was incomplete.
I changed it to:
abbrev = [words.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "") if len(words) > 4 else words for words in split_string]
I found the part of the solution here: Find and replace string values in list.
It is called a List Comprehension.
I also found List Comprehension with If Statement
The new lines of code look like:
def abbreviate_sentence(sent):
split_string = sent.split()
for words in split_string:
abbrev = [words.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "")
if len(words) > 4 else words for words in split_string]
sentence = " ".join(abbrev)
return sentence
print(abbreviate_sentence("follow the yellow brick road")) # => "fllw the yllw brck road"

Adding specific letters to a string MATLAB

I'm a neuroscience/biomedical engineering major struggling with this whole MATLAB programming ordeal and so far, this website is the best teacher available to me right now. I am currently having trouble with one of my HW problems. What I need to do is take a phrase, find a specific word in it, then take a specific letter in it and increase that letter by the number indicated. In other words:
phrase = 'this homework is so hard'
word = 'so'
letter = 'o'
factor = 5
which should give me 'This homework is sooooo hard'
I got rid of my main error, though I really don;t know how. I exited MATLAB, then got back into it. Lo and behold, it magically worked.
function[out1] = textStretch(phrase, word, letter, stretch)
searchword= strfind(phrase, word);
searchletter strfind(hotdog, letter); %Looks for the letter in the word
add = (letter+stretch) %I was hoping this would take the letter and add to it, but that's not what it does
replace= strrep(phrase, word, add) %This would theoretically take the phrase, find the word and put in the new letter
out1 = replace
According to the teacher, the ones() function might be useful, and I have to concatenate strings, but if I can just find it in the string and replace it, why do I need to concatenate?
Since this is homework I won't write the whole thing out for you but you were on the right track with strfind.
a = strfind(phrase, word);
b = strfind(word, letter);
What does phrase(1:a) return? What does phrase(a+b:end) return?
Making some assumptions about why your teacher wants you to use ones:
What does num = double('o') return?
What does char(num) return? How about char([num num])?
You can concatenate strings like this:
out = [phrase(1:a),'ooooo',phrase(a+b:end)];
So all you really need to focus on is how to get a string which is letter repeated factor times.
If you wanted to use strrep instead you would need to give it the full word you are searching for and a copy of that word with the repeated letters in:
new_phrase = strrep(phrase, 'so', 'sooooo');
Again, the issue is how to get the 'sooooo' string.
See if this works for you -
phrase_split = regexp(phrase,'\s','Split'); %// Split into words as cells
wordr = cellstr(strrep(word,letter,letter(:,ones(1,factor))));%// Stretched word
phrase_split(strcmp(phrase_split,word)) = wordr;%//Put stretched word into place
out = strjoin(phrase_split) %// %// Output as the string cells joined together
Note: strjoin needs a recent MATLAB version, which if unavailable could be obtained from here.
Or you can just use a hack obtained from the m-file itself -
out = [repmat(sprintf(['%s', ' '], phrase_split{1:end-1}), ...
1, ~isscalar(phrase_split)), sprintf('%s', phrase_split{end})]
Sample run -
phrase =
this homework is so hard and so boring
word =
so
letter =
o
factor =
5
out =
this homework is sooooo hard and sooooo boring
So, just wrap the code into a function wrapper like this -
function out = textStretch(phrase, word, letter, factor)
Homework molded edit:
phrase = 'this homework is seriously hard'
word = 'seriously'
letter = 'r'
stretch = 6
out = phrase
stretched_word = letter(:,ones(1,stretch))
hotdog = strfind(phrase, word)
hotdog_st = strfind(word,letter)
start_ind = hotdog+hotdog_st-1
out(start_ind+stretch:end+stretch-1) = out(start_ind+1:end)
out(hotdog+hotdog_st-1:hotdog+hotdog_st-1+stretch-1) = stretched_word
Output -
out =
this homework is serrrrrriously hard
As again, use this syntax to convert to function -
function out = textStretch(phrase, word, letter, stretch)
Well Jessica first of all this is WRONG, but I am not here to give you the solution. Could you please just use it this way? This surely run.
function main_script()
phrase = 'this homework is so hard';
word = 'so';
letter = 'o';
factor = 5;
[flirty] = textStretchNEW(phrase, word, letter, factor)
end
function [flirty] = textStretchNEW(phrase, word, letter, stretch)
hotdog = strfind(phrase, word);
colddog = strfind(hotdog, letter);
add = letter + stretch;
hug = strrep(phrase, word, add);
flirty = hug
end

Finding non-adjacent subsequences in a string

Say I am searching in a string, for a subsequence, where the elements do not necessarily have to be adjacent, but have to occur within N characters. So,
search("abc","aaabbbccc",7) => True
search("abc","aabbcc",3) => False
I am looking for an efficient data structure / algorithm that will perform this comparison. I can think of a few approaches like searching for all valid combos of interior wildcards, like
search("abc",whatever,4) => "abc","a*bc","ab*c"
And using any of the multi-string search algorithms (probably Aho–Corasick), but I'm wondering if there is a better solution.
I have attached a python code sample that does what you want. It loops through the string to be searched and if the first letter of search string is found, a substring of length=max_length is created and sent to another function. This function simply moves through the substring trying to find all of the search string letters in order. If it finds them all then it returns True, otherwise False.
def check_substring(find_me, substr):
find_index = 0
for letter in substr:
if find_me[find_index] == letter:
find_index +=1
# if we reach the end of find_me, return true
if find_index >= len(find_me):
return True
return False
def check_string(find_me, look_here, max_len):
for index in range(len(look_here)):
if find_me[0] == look_here[index]:
if check_substring(find_me, look_here[index:index + max_len]):
return True
return False
fm = "abc"
lh = "aabbbccceee"
ml = 5
print check_string(fm, lh, ml)

find whether a string is substring of other string in SML NJ

In SML NJ, I want to find whether a string is substring of another string and find its index. Can any one help me with this?
The Substring.position function is the only one I can find in the basis library that seems to do string search. Unfortunately, the Substring module is kind of hard to use, so I wrote the following function to use it. Just pass two strings, and it will return an option: NONE if not found, or SOME of the index if it is found:
fun index (str, substr) = let
val (pref, suff) = Substring.position substr (Substring.full str)
val (s, i, n) = Substring.base suff
in
if i = size str then
NONE
else
SOME i
end;
Well you have all the substring functions, however if you want to also know the position of it, then the easiest is to do it yourself, with a linear scan.
Basically you want to explode both strings, and then compare the first character of the substring you want to find, with each character of the source string, incrementing a position counter each time you fail. When you find a match you move to the next char in the substring as well without moving the position counter. If the substring is "empty" (modeled when you are left with the empty list) you have matched it all and you can return the position index, however if the matching suddenly fail you have to return back to when you had the first match and skip a letter (incrementing the position counter) and start all over again.
Hope this helps you get started on doing this yourself.