Trouble With Rhyming Algorithm Scala - scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}

I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

Related

Anomynizing first_name, last_name and full_name columns by replacing it with pronunciable english words in a dataframe Spark Scala

I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition.
Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:
It must convert a real world name into another real world name which is pronounceable and identifiable.
It must be possible to convert first name, last name and full name separately, such that full name = first name and last name separated by a space.
It should produce the same anomynized name for a name on every iteration.
The target dataset will have more than a million distinct records.
I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations.
Code below:
def anonymizeString(s: Option[String]): Option[String] = {
val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
val Numbers = ('0' to '9')
s match {
//case None => None
case _ =>
val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
val random = new scala.util.Random(seed)
var r = ""
for (c <- s.get) {
if (Numbers.contains(c)) {
r = r + (((random.nextInt.abs + c) % Numbers.size))
} else if (AsciiUpperLetters.contains(c)) {
r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
} else if (AsciiLowerLetters.contains(c)) {
r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
} else if (UtfLetters.contains(c)) {
r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
} else {
r = r + c
}
}
Some(r)
}
"it is not going to give me a million distinct combinations"
I am not sure why you say that. I just checked /usr/share/dict/words on my Mac, and it has 234,371 words. That allows for almost 55 billion combinations of two words.
So, just hash your string to an Int, take it modulo 234,371, and map to the respective entry from the dictionary.
Granted, some words in the dictionary don't look too much like names (though still much better than what you are doing at random) - e.g. "A" ... but even if you require the word to contain at least 5 characters, you'd have 227,918 words left – still more than enough.
Also please don't use "naked get" on Option ... It hurts my aesthetic feelings so much :(
class Anonymizer(dict: IndexedSeq[String]) {
def anonymize(s: Option[String]) = s
.map(_.hashCode % dict.size)
.map(dict)
}

Scala program to replace words in an alphabetical order with in a string

I am learning Scala and have been trying to create a program which should replace characters in each word with in a string in an alphabetical order. For example, the string is "Where are you" so program should change it to "Eehrw aer ouy". I googled search and found some examples but I am not able to write an error free program. I think I am far from having a working program.
def main(args:Array[String]){
val r = "Where are you"
val newstr = r.map(x=>(x,_) match {
case ' ' = ' '
case y => {
val newchar = (x.toByte).toChar
if newchar.toByte.toChar > (newchar + 1).toByte.toChar
x = newchar
else
x
}
})
}
The tricky part is restoring the original capitalization. Add punctuation to the mix and it turns into a fun little challenge.
val str = "Where, aRe yoU?"
val sortedLowerCase = str.toLowerCase.split("(?=\\W)").map(_.sorted).mkString
val capsAt = str.indices.filter(str(_).isUpper)
capsAt.foldLeft(sortedLowerCase)((s,x) => s.patch(x,Seq(s(x).toUpper),1))
// res0: String = Eehrw, aEr ouY?
Time spent studying the Standard Library will be richly rewarded.
r.split(" ").map(word => word.toLowerCase.sorted)
To keep the capital letters, instead of .toLowerCase.sorted, used .sortWith and implement the sort comparison function according to how you want to sort characters.
Let me expand on Ren's answer:
compare based on lowercase and then capitalize only if there's an uppercase letter
r.split(" ").map(word => word.sortWith(_.toLower < _.toLower))
.map(x => if (x.exists(_.isUpper)) x.toLowerCase.capitalize else x )

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

Finding non-adjacent subsequences in a string

Say I am searching in a string, for a subsequence, where the elements do not necessarily have to be adjacent, but have to occur within N characters. So,
search("abc","aaabbbccc",7) => True
search("abc","aabbcc",3) => False
I am looking for an efficient data structure / algorithm that will perform this comparison. I can think of a few approaches like searching for all valid combos of interior wildcards, like
search("abc",whatever,4) => "abc","a*bc","ab*c"
And using any of the multi-string search algorithms (probably Aho–Corasick), but I'm wondering if there is a better solution.
I have attached a python code sample that does what you want. It loops through the string to be searched and if the first letter of search string is found, a substring of length=max_length is created and sent to another function. This function simply moves through the substring trying to find all of the search string letters in order. If it finds them all then it returns True, otherwise False.
def check_substring(find_me, substr):
find_index = 0
for letter in substr:
if find_me[find_index] == letter:
find_index +=1
# if we reach the end of find_me, return true
if find_index >= len(find_me):
return True
return False
def check_string(find_me, look_here, max_len):
for index in range(len(look_here)):
if find_me[0] == look_here[index]:
if check_substring(find_me, look_here[index:index + max_len]):
return True
return False
fm = "abc"
lh = "aabbbccceee"
ml = 5
print check_string(fm, lh, ml)

Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?

Eg if input string is helloworld I want the output to be like:
do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...
all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example.
The input string can be any length, but rarely more than 16 chars.
I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.
The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.
Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.
So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.
I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.
The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.
# Using the 38617 word dictionary at
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")
def containsLetters(subword, word):
wordlen = len(word)
subwordlen = len(subword)
if subwordlen > wordlen:
return False
word = list(word)
for c in subword:
try:
index = word.index(c)
except ValueError:
return False
word.pop(index)
return True
def getAnagrams(word):
output = []
for key in mydict.iterkeys():
if containsLetters(key, word):
output.extend(mydict[key])
output.sort(key=len)
return output
f = open("dict.txt")
wordlist = f.readlines()
f.close()
mydict = {}
for word in wordlist:
word = word.rstrip()
temp = list(word)
temp.sort()
letters = ''.join(temp)
if letters in mydict:
mydict[letters].append(word)
else:
mydict[letters] = [word]
An example run:
>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.
I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.
A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:
import itertools
import urllib
def words():
f = urllib.urlopen(
'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
allwords = set(w[:-1] for w in f)
f.close()
return allwords
def substrings(s):
for i in range(2, len(s)+1):
for p in itertools.permutations(s, i):
yield ''.join(p)
def main():
w = words()
print '%d words' % len(w)
ss = set(substrings('weep'))
print '%d substrings' % len(ss)
good = ss & w
print '%d good ones' % len(good)
sgood = sorted(good, key=lambda w:(len(w), w))
for aword in sgood:
print aword
main()
will emit:
38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep
Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.
What you want is an implementation of a power set.
Also look at Eric Lipparts blog, he blogged about this very thing a little while back
EDIT:
Here is an implementation I wrote of getting the powerset from a given string...
private IEnumerable<string> GetPowerSet(string letters)
{
char[] letterArray = letters.ToCharArray();
for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
{
StringBuilder sb = new StringBuilder();
for (int j = 0; j < letterArray.Length; j++)
{
int pos = Convert.ToInt32(Math.Pow(2.0, j));
if ((pos & i) == pos)
{
sb.Append(letterArray[j]);
}
}
yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
}
}
This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...
Dictionary<string,IEnumerable<string>>
I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)
wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.
A nasality talisman for the sultana analyst
Santalic tailfans, part two
I believe the Ruby code in the answers to this question will also solve your problem.
I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.
I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.
public IList<string> Matches(string sourceLetters, string [] wordList)
{
sourceLetters = sourceLetters.ToUpper();
IList<string> matches = new List<string>();
foreach (string word in wordList)
{
if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
matches.Add(word);
}
return matches;
}
public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
string builtWord = "";
foreach (char letter in targetWord)
{
int pos = sourceLetters.IndexOf(letter);
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
continue;
}
// check for wildcard
pos = sourceLetters.IndexOf("*");
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
}
}
return string.Equals(builtWord, targetWord);
}