Anomynizing first_name, last_name and full_name columns by replacing it with pronunciable english words in a dataframe Spark Scala - scala

I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition.
Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:
It must convert a real world name into another real world name which is pronounceable and identifiable.
It must be possible to convert first name, last name and full name separately, such that full name = first name and last name separated by a space.
It should produce the same anomynized name for a name on every iteration.
The target dataset will have more than a million distinct records.
I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations.
Code below:
def anonymizeString(s: Option[String]): Option[String] = {
val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
val Numbers = ('0' to '9')
s match {
//case None => None
case _ =>
val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
val random = new scala.util.Random(seed)
var r = ""
for (c <- s.get) {
if (Numbers.contains(c)) {
r = r + (((random.nextInt.abs + c) % Numbers.size))
} else if (AsciiUpperLetters.contains(c)) {
r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
} else if (AsciiLowerLetters.contains(c)) {
r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
} else if (UtfLetters.contains(c)) {
r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
} else {
r = r + c
}
}
Some(r)
}

"it is not going to give me a million distinct combinations"
I am not sure why you say that. I just checked /usr/share/dict/words on my Mac, and it has 234,371 words. That allows for almost 55 billion combinations of two words.
So, just hash your string to an Int, take it modulo 234,371, and map to the respective entry from the dictionary.
Granted, some words in the dictionary don't look too much like names (though still much better than what you are doing at random) - e.g. "A" ... but even if you require the word to contain at least 5 characters, you'd have 227,918 words left – still more than enough.
Also please don't use "naked get" on Option ... It hurts my aesthetic feelings so much :(
class Anonymizer(dict: IndexedSeq[String]) {
def anonymize(s: Option[String]) = s
.map(_.hashCode % dict.size)
.map(dict)
}

Related

Trouble With Rhyming Algorithm Scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}
I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

Scala program to replace words in an alphabetical order with in a string

I am learning Scala and have been trying to create a program which should replace characters in each word with in a string in an alphabetical order. For example, the string is "Where are you" so program should change it to "Eehrw aer ouy". I googled search and found some examples but I am not able to write an error free program. I think I am far from having a working program.
def main(args:Array[String]){
val r = "Where are you"
val newstr = r.map(x=>(x,_) match {
case ' ' = ' '
case y => {
val newchar = (x.toByte).toChar
if newchar.toByte.toChar > (newchar + 1).toByte.toChar
x = newchar
else
x
}
})
}
The tricky part is restoring the original capitalization. Add punctuation to the mix and it turns into a fun little challenge.
val str = "Where, aRe yoU?"
val sortedLowerCase = str.toLowerCase.split("(?=\\W)").map(_.sorted).mkString
val capsAt = str.indices.filter(str(_).isUpper)
capsAt.foldLeft(sortedLowerCase)((s,x) => s.patch(x,Seq(s(x).toUpper),1))
// res0: String = Eehrw, aEr ouY?
Time spent studying the Standard Library will be richly rewarded.
r.split(" ").map(word => word.toLowerCase.sorted)
To keep the capital letters, instead of .toLowerCase.sorted, used .sortWith and implement the sort comparison function according to how you want to sort characters.
Let me expand on Ren's answer:
compare based on lowercase and then capitalize only if there's an uppercase letter
r.split(" ").map(word => word.sortWith(_.toLower < _.toLower))
.map(x => if (x.exists(_.isUpper)) x.toLowerCase.capitalize else x )

how to split scala string with regular expression

I come up a pattern like
val pattern = "(\\w+)\\|(.*)\\|\\[(.*)\\]\\|\"(.*)\"\\|\"(.*)\"\\|\\[(.*)\\]\\|\\[(.*)\\]\\|(.*)\\|\\[(.*)\\]\\|\\[(.*)\\]".r
and I have a original string
var str = """AuthLogout|vmlxapp21a|[13/Jan/2016:16:33:15 +0100]|"66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com"|"abcd_123_portalweb_w "|[]|[41]||[]|[]"""
then apply pattern to the string, but it is always empty.
val items = pattern.findAllIn(str).toList
If I understand what you're trying to do, perhaps using a giant regex isn't the easiest way: You can split by | and get rid of the unwanted separators ([, ], ") using replaceAll:
val str = """AuthLogout|vmlxapp21a|[13/Jan/2016:16:33:15 +0100]|"66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com"|"abcd_123_portalweb_w "|[]|[41]||[]|[]"""
val withoutBoundaries = str.replaceAll("[\"\\]\\[]","")
val result = withoutBoundaries.split("\\|")
result.foreach(println)
Which prints:
AuthLogout
vmlxapp21a
13/Jan/2016:16:33:15 +0100
66.77.444.44 uid=XXXXX,ou=People,o=Bank,o=External,dc=xxxx,dc=com
abcd_123_portalweb_w
41
If you do want to use a regex here, I'd create sub-regex vars representing the different text parts that you're after, to make this somewhat manageable:
val plain = "(.*)" // no boundary characters
val boxed = s"\\[$plain\\]" // same, encapsulated by square brackets
val quoted = '"' + plain + '"' // same, encapsulated by double quotes
// the whole thing, separated by pipes:
val r = s"$plain\\|$plain\\|$boxed\\|$quoted\\|$quoted\\|$boxed\\|$boxed\\|$plain\\|$boxed\\|$boxed".r
val result = r.findAllIn(str).toList // this list has one item, as expected.
Now, if you want to see how this regex looks like, here it is - but I don't recommend having this in your code...:
val r = """(.*)\|(.*)\|\[(.*)\]\|"(.*)"\|"(.*)"\|\[(.*)\]\|\[(.*)\]\|(.*)\|\[(.*)\]\|\[(.*)\]""".r

Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?

Eg if input string is helloworld I want the output to be like:
do
he
we
low
hell
hold
roll
well
word
hello
lower
world
...
all the way up to the longest word that is an anagram of a substring of helloworld. Like in Scrabble for example.
The input string can be any length, but rarely more than 16 chars.
I've done a search and come up with structures like a trie, but I am still unsure of how to actually do this.
The structure used to hold your dictionary of valid entries will have a huge impact on efficiency. Organize it as a tree, root being the singular zero letter "word", the empty string. Each child of root is a single first letter of a possible word, children of those being the second letter of a possible word, etc., with each node marked as to whether it actually forms a word or not.
Your tester function will be recursive. It starts with zero letters, finds from the tree of valid entries that "" isn't a word but it does have children, so you call your tester recursively with your start word (of no letters) appended with each available remaining letter from your input string (which is all of them at that point). Check each one-letter entry in tree, if valid make note; if children, re-call tester function appending each of remaining available letters, and so on.
So for example, if your input string is "helloworld", you're going to first call your recursive tester function with "", passing the remaining available letters "helloworld" as a 2nd parameter. Function sees that "" isn't a word, but child "h" does exist. So it calls itself with "h", and "elloworld". Function sees that "h" isn't a word, but child "e" exists. So it calls itself with "he" and "lloworld". Function sees that "e" is marked, so "he" is a word, take note. Further, child "l" exists, so next call is "hel" with "loworld". It will next find "hell", then "hello", then will have to back out and probably next find "hollow", before backing all the way out to the empty string again and then starting with "e" words next.
I couldn't resist my own implementation. It creates a dictionary by sorting all the letters alphabetically, and mapping them to the words that can be created from them. This is an O(n) start-up operation that eliminates the need to find all permutations. You could implement the dictionary as a trie in another language to attain faster speedups.
The "getAnagrams" command is also an O(n) operation which searches each word in the dictionary to see if it is a subset of the search. Doing getAnagrams("radiotelegraphically")" (a 20 letter word) took approximately 1 second on my laptop, and returned 1496 anagrams.
# Using the 38617 word dictionary at
# http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt
# Usage: getAnagrams("helloworld")
def containsLetters(subword, word):
wordlen = len(word)
subwordlen = len(subword)
if subwordlen > wordlen:
return False
word = list(word)
for c in subword:
try:
index = word.index(c)
except ValueError:
return False
word.pop(index)
return True
def getAnagrams(word):
output = []
for key in mydict.iterkeys():
if containsLetters(key, word):
output.extend(mydict[key])
output.sort(key=len)
return output
f = open("dict.txt")
wordlist = f.readlines()
f.close()
mydict = {}
for word in wordlist:
word = word.rstrip()
temp = list(word)
temp.sort()
letters = ''.join(temp)
if letters in mydict:
mydict[letters].append(word)
else:
mydict[letters] = [word]
An example run:
>>> getAnagrams("helloworld")
>>> ['do', 'he', 'we', 're', 'oh', 'or', 'row', 'hew', 'her', 'hoe', 'woo', 'red', 'dew', 'led', 'doe', 'ode', 'low', 'owl', 'rod', 'old', 'how', 'who', 'rho', 'ore', 'roe', 'owe', 'woe', 'hero', 'wood', 'door', 'odor', 'hold', 'well', 'owed', 'dell', 'dole', 'lewd', 'weld', 'doer', 'redo', 'rode', 'howl', 'hole', 'hell', 'drew', 'word', 'roll', 'wore', 'wool','herd', 'held', 'lore', 'role', 'lord', 'doll', 'hood', 'whore', 'rowed', 'wooed', 'whorl', 'world', 'older', 'dowel', 'horde', 'droll', 'drool', 'dwell', 'holed', 'lower', 'hello', 'wooer', 'rodeo', 'whole', 'hollow', 'howler', 'rolled', 'howled', 'holder', 'hollowed']
The data structure you want is called a Directed Acyclic Word Graph (dawg), and it is described by Andrew Appel and Guy Jacobsen in their paper "The World's Fastest Scrabble Program" which unfortunately they have chosen not to make available free online. An ACM membership or a university library will get it for you.
I have implemented this data structure in at least two languages---it is simple, easy to implement, and very, very fast.
A simple-minded approach is to generate all the "substrings" and, for each of them, check whether it's an element of the set of acceptable words. E.g., in Python 2.6:
import itertools
import urllib
def words():
f = urllib.urlopen(
'http://www.cs.umd.edu/class/fall2008/cmsc433/p5/Usr.Dict.Words.txt')
allwords = set(w[:-1] for w in f)
f.close()
return allwords
def substrings(s):
for i in range(2, len(s)+1):
for p in itertools.permutations(s, i):
yield ''.join(p)
def main():
w = words()
print '%d words' % len(w)
ss = set(substrings('weep'))
print '%d substrings' % len(ss)
good = ss & w
print '%d good ones' % len(good)
sgood = sorted(good, key=lambda w:(len(w), w))
for aword in sgood:
print aword
main()
will emit:
38617 words
31 substrings
5 good ones
we
ewe
pew
wee
weep
Of course, as other responses pointed out, organizing your data purposefully can greatly speed-up your runtime -- although the best data organization for a fast anagram finder could well be different... but that will largely depend on the nature of your dictionary of allowed words (a few tens of thousands, like here -- or millions?). Hash-maps and "signatures" (based on sorting the letters in each word) should be considered, as well as tries &c.
What you want is an implementation of a power set.
Also look at Eric Lipparts blog, he blogged about this very thing a little while back
EDIT:
Here is an implementation I wrote of getting the powerset from a given string...
private IEnumerable<string> GetPowerSet(string letters)
{
char[] letterArray = letters.ToCharArray();
for (int i = 0; i < Math.Pow(2.0, letterArray.Length); i++)
{
StringBuilder sb = new StringBuilder();
for (int j = 0; j < letterArray.Length; j++)
{
int pos = Convert.ToInt32(Math.Pow(2.0, j));
if ((pos & i) == pos)
{
sb.Append(letterArray[j]);
}
}
yield return new string(sb.ToString().ToCharArray().OrderBy(c => c).ToArray());
}
}
This function gives me the powersets of chars that make up the passed in string, I then can use these as keys into a dictionary of anagrams...
Dictionary<string,IEnumerable<string>>
I created my dictionary of anagrams like so... (there are probably more efficient ways, but this was simple and plenty quick enough with the scrabble tournament word list)
wordlist = (from s in fileText.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
let k = new string(s.ToCharArray().OrderBy(c => c).ToArray())
group s by k).ToDictionary(o => o.Key, sl => sl.Select(a => a));
Like Tim J, Eric Lippert's blog posts where the first thing to come to my mind. I wanted to add that he wrote a follow-up about ways to improve the performance of his first attempt.
A nasality talisman for the sultana analyst
Santalic tailfans, part two
I believe the Ruby code in the answers to this question will also solve your problem.
I've been playing a lot of Wordfeud on my phone recently and was curious if I could come up with some code to give me a list of possible words. The following code takes your availble source letters (* for a wildcards) and an array with a master list of allowable words (TWL, SOWPODS, etc) and generates a list of matches. It does this by trying to build each word in the master list from your source letters.
I found this topic after writing my code, and it's definitely not as efficient as John Pirie's method or the DAWG algorithm, but it's still pretty quick.
public IList<string> Matches(string sourceLetters, string [] wordList)
{
sourceLetters = sourceLetters.ToUpper();
IList<string> matches = new List<string>();
foreach (string word in wordList)
{
if (WordCanBeBuiltFromSourceLetters(word, sourceLetters))
matches.Add(word);
}
return matches;
}
public bool WordCanBeBuiltFromSourceLetters(string targetWord, string sourceLetters)
{
string builtWord = "";
foreach (char letter in targetWord)
{
int pos = sourceLetters.IndexOf(letter);
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
continue;
}
// check for wildcard
pos = sourceLetters.IndexOf("*");
if (pos >= 0)
{
builtWord += letter;
sourceLetters = sourceLetters.Remove(pos, 1);
}
}
return string.Equals(builtWord, targetWord);
}

How to extract character n-grams based on a large text

Given a large text file I want to extract the character n-grams using Apache Spark (do the task in parallel).
Example input (2 line text):
line 1: (Hello World, it)
line 2: (is a nice day)
Output n-grams:
Hel - ell -llo -lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_ - t_i - _is - ... and so on. So I want the return value to be a RDD[String], each string containing the n-gram.
Notice that the new line is considered a white space in the output n-grams. I put each line in parenthesis to be clear. Also, just to be clear the string or text is not a single entry in a RDD. I read the file using sc.textFile() method.
The main idea is to take all the lines within each partition and combine them into a long String. Next, we replace " " with "_" and call sliding on this string to create the trigrams for each partition in parallel.
Note: The resulting trigrams might not be 100% accurate since we will miss few trigrams from the beginning and the end of a each partition. Given that each partition can be several million characters long, the loss in assurance should be negligible. The main benefit here is that each partition can be executed in parallel.
Here are some toy data. Everything bellow can be executed on any Spark REPL:
scala> val data = sc.parallelize(Seq("Hello World, it","is a nice day"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12]
val trigrams = data.mapPartitions(_.toList.mkString(" ").replace(" ","_").sliding(3))
trigrams: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14]
Here I will collect the trigrams to show how they look like (you might not want to do this if your dataset is massive)
scala> val asCollected = trigrams.collect
asCollected: Array[String] = Array(Hel, ell, llo, lo_, o_W, _Wo, Wor, orl, rld, ld,, d,_, ,_i, _it, is_, s_a, _a_, a_n, _ni, nic, ice, ce_, e_d, _da, day)
You could use a function like the following:
def n_gram(str:String, n:Int) = (str + " ").sliding(n)
I am assuming the newline has been stripped off when reading the line, so I've a added a space to compensate for that. If, on the other hand, the newline is preserved, you could define it as:
def n_gram(str:String, n:Int) = str.replace('\n', ' ').sliding(n)
Using your example:
println(n_gram("Hello World, it", 3).map(_.replace(' ', '_')).mkString(" - "))
would return:
Hel - ell - llo - lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_
There may be shorter ways to do this,
Assuming that the entire string (including the new line) is a single entry in an RDD, returning the following from flatMap should give you the result you want.
val strings = text.foldLeft(("", List[String]())) {
case ((s, l), c) =>
if (s.length < 2) {
val ns = s + c
(ns, l)
} else if (s.length == 2) {
val ns = s + c
(ns, ns :: l)
} else {
val ns = s.tail + c
(ns, ns :: l)
}
}._2