How to extract character n-grams based on a large text - scala

Given a large text file I want to extract the character n-grams using Apache Spark (do the task in parallel).
Example input (2 line text):
line 1: (Hello World, it)
line 2: (is a nice day)
Output n-grams:
Hel - ell -llo -lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_ - t_i - _is - ... and so on. So I want the return value to be a RDD[String], each string containing the n-gram.
Notice that the new line is considered a white space in the output n-grams. I put each line in parenthesis to be clear. Also, just to be clear the string or text is not a single entry in a RDD. I read the file using sc.textFile() method.

The main idea is to take all the lines within each partition and combine them into a long String. Next, we replace " " with "_" and call sliding on this string to create the trigrams for each partition in parallel.
Note: The resulting trigrams might not be 100% accurate since we will miss few trigrams from the beginning and the end of a each partition. Given that each partition can be several million characters long, the loss in assurance should be negligible. The main benefit here is that each partition can be executed in parallel.
Here are some toy data. Everything bellow can be executed on any Spark REPL:
scala> val data = sc.parallelize(Seq("Hello World, it","is a nice day"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12]
val trigrams = data.mapPartitions(_.toList.mkString(" ").replace(" ","_").sliding(3))
trigrams: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14]
Here I will collect the trigrams to show how they look like (you might not want to do this if your dataset is massive)
scala> val asCollected = trigrams.collect
asCollected: Array[String] = Array(Hel, ell, llo, lo_, o_W, _Wo, Wor, orl, rld, ld,, d,_, ,_i, _it, is_, s_a, _a_, a_n, _ni, nic, ice, ce_, e_d, _da, day)

You could use a function like the following:
def n_gram(str:String, n:Int) = (str + " ").sliding(n)
I am assuming the newline has been stripped off when reading the line, so I've a added a space to compensate for that. If, on the other hand, the newline is preserved, you could define it as:
def n_gram(str:String, n:Int) = str.replace('\n', ' ').sliding(n)
Using your example:
println(n_gram("Hello World, it", 3).map(_.replace(' ', '_')).mkString(" - "))
would return:
Hel - ell - llo - lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_

There may be shorter ways to do this,
Assuming that the entire string (including the new line) is a single entry in an RDD, returning the following from flatMap should give you the result you want.
val strings = text.foldLeft(("", List[String]())) {
case ((s, l), c) =>
if (s.length < 2) {
val ns = s + c
(ns, l)
} else if (s.length == 2) {
val ns = s + c
(ns, ns :: l)
} else {
val ns = s.tail + c
(ns, ns :: l)
}
}._2

Related

Anomynizing first_name, last_name and full_name columns by replacing it with pronunciable english words in a dataframe Spark Scala

I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition.
Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:
It must convert a real world name into another real world name which is pronounceable and identifiable.
It must be possible to convert first name, last name and full name separately, such that full name = first name and last name separated by a space.
It should produce the same anomynized name for a name on every iteration.
The target dataset will have more than a million distinct records.
I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations.
Code below:
def anonymizeString(s: Option[String]): Option[String] = {
val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
val Numbers = ('0' to '9')
s match {
//case None => None
case _ =>
val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
val random = new scala.util.Random(seed)
var r = ""
for (c <- s.get) {
if (Numbers.contains(c)) {
r = r + (((random.nextInt.abs + c) % Numbers.size))
} else if (AsciiUpperLetters.contains(c)) {
r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
} else if (AsciiLowerLetters.contains(c)) {
r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
} else if (UtfLetters.contains(c)) {
r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
} else {
r = r + c
}
}
Some(r)
}
"it is not going to give me a million distinct combinations"
I am not sure why you say that. I just checked /usr/share/dict/words on my Mac, and it has 234,371 words. That allows for almost 55 billion combinations of two words.
So, just hash your string to an Int, take it modulo 234,371, and map to the respective entry from the dictionary.
Granted, some words in the dictionary don't look too much like names (though still much better than what you are doing at random) - e.g. "A" ... but even if you require the word to contain at least 5 characters, you'd have 227,918 words left – still more than enough.
Also please don't use "naked get" on Option ... It hurts my aesthetic feelings so much :(
class Anonymizer(dict: IndexedSeq[String]) {
def anonymize(s: Option[String]) = s
.map(_.hashCode % dict.size)
.map(dict)
}

Trouble With Rhyming Algorithm Scala

I want to make a method that takes two List of strings representing the sounds(phonemes & vowels) of two words as parameters. The function of this method is to determine whether or not the words rhyme based on the two sounds.
Definition of a rhyme: The words rhymes if the last vowel(inclusive) and after are the same. Words will rhyme even if the last vowel sounds have different stress. Only vowels will have stress levels(numbers)
My approach so far is to reverse the list so that the sounds are in reverse order and then add everything from the start of the line to the first vowel(inclusive). Then compare the two list to see if they equal. Please apply basic code, Im only at elementary level of scala. Just finished learning program execution.
Ex1: two words GEE and NEE will rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and NEE sound (“N”,”IY1) becomes (”IY1”, “N”) since they have the same vowel everything else after should not be considered any more.
Ex2: two words GEE and JEEP will not rhyme because GEE sound (“JH”,”IY1”) becomes (”IY1”,”JH”) and JEEP sound (“JH”,”IY1”,”P”) becomes (”P”,”IY1”,”JH”) since the first sound in GEE is a vowel it’s being compared to “P” and “IY1” in JEEP.
Ex3: two words HALF and GRAPH will rhyme because HALF sound(“HH”,”AE1”,”F”) becomes (“F”,”AE1”,”HH”) and GRAPH sound (“G”,”R”,”AE2”,”F”) become (“F”,”AE2”,”R”,”G”) in this case although the first vowel have different stress(numbers) we ignore the stress since the vowels are the same.
def isRhymeSounds(soundList1: List[String], soundList2: List[String]): Boolean={
val revSound1 = soundList1.reverse
val revSound2 = soundList2.reverse
var revSoundList1:List[String] = List()
var revSoundList2:List[String] = List()
for(sound1 <- revSound1) {
if(sound1.length >= 3) {
val editVowel1 = sound1.substring(0,2)
revSoundList1 = revSoundList1 :+ editVowel1
}
else {
revSoundList1 = revSoundList1 :+ sound1
}
}
for(sound2 <- revSound2) {
if(sound2.length >= 3) {
val editVowel2 = sound2.substring(0, 2)
revSoundList2 = revSoundList2 :+ editVowel2
}
else {
revSoundList2 = revSoundList2 :+ sound2
}
}
if(revSoundList1 == revSoundList2){
true
}
else{
false
}
}
I don't think reversing is necessary.
def isRhyme(sndA :List[String], sndB :List[String]) :Boolean = {
val vowel = "[AEIOUY]+".r
sndA.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)} ==
sndB.foldLeft(""){case (res, s) => vowel.findPrefixOf(s).getOrElse(res+s)}
}
explanation
"[AEIOUY]+".r - This is a Regular Expression (that's the .r part) that means "a String of one or more of these characters." In other words, any combination of capital letter vowels.
findPrefixOf() - This returns the first part of the test string that matches the regular expression. So vowel.findPrefixOf("AY2") returns Some("AY") because the first two letters match the regular expression. And vowel.findPrefixOf("OFF") returns Some("O") because only the first letter matches the regular expression. But vowel.findPrefixOf("BAY") returns None because the string does not start with any of the specified characters.
getOrElse() - This unwraps an Option. So Some("AY").getOrElse("X") returns "AY", and Some("O").getOrElse("X") returns "O", but None.getOrElse("X") returns "X" because there's nothing inside a None value so we go with the OrElse default return value.
foldLeft()() - This takes a collection of elements and, starting from the "left", it "folds" them in on each other until a final result is obtained.
So, consider how List("HH", "AE1", "F", "JH", "IY1", "P") would be processed.
res s result
=== === ======
"" HH ""+HH //s doesn't match the RE so default res+s
HH AE1 AE //s does match the RE, use only the matching part
AE F AE+F //res+s
AEF JH AEF+JH //res+s
AEFJH IY1 IY //only the matching part
IY P IY+P //res+s
final result: "IYP"

Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data. I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column. However I don't know where to begin to accomplish this.
So, I created two functions to handle the replacements. The problem is that I don't know how to put more than one replace inside this method. I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code. Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?
def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()
Well you can simply add every replacement next to the previous one like this :
def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")
But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll.
def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")
Note that \\ is used as escape character.

Scala program to replace words in an alphabetical order with in a string

I am learning Scala and have been trying to create a program which should replace characters in each word with in a string in an alphabetical order. For example, the string is "Where are you" so program should change it to "Eehrw aer ouy". I googled search and found some examples but I am not able to write an error free program. I think I am far from having a working program.
def main(args:Array[String]){
val r = "Where are you"
val newstr = r.map(x=>(x,_) match {
case ' ' = ' '
case y => {
val newchar = (x.toByte).toChar
if newchar.toByte.toChar > (newchar + 1).toByte.toChar
x = newchar
else
x
}
})
}
The tricky part is restoring the original capitalization. Add punctuation to the mix and it turns into a fun little challenge.
val str = "Where, aRe yoU?"
val sortedLowerCase = str.toLowerCase.split("(?=\\W)").map(_.sorted).mkString
val capsAt = str.indices.filter(str(_).isUpper)
capsAt.foldLeft(sortedLowerCase)((s,x) => s.patch(x,Seq(s(x).toUpper),1))
// res0: String = Eehrw, aEr ouY?
Time spent studying the Standard Library will be richly rewarded.
r.split(" ").map(word => word.toLowerCase.sorted)
To keep the capital letters, instead of .toLowerCase.sorted, used .sortWith and implement the sort comparison function according to how you want to sort characters.
Let me expand on Ren's answer:
compare based on lowercase and then capitalize only if there's an uppercase letter
r.split(" ").map(word => word.sortWith(_.toLower < _.toLower))
.map(x => if (x.exists(_.isUpper)) x.toLowerCase.capitalize else x )

How does reduceByKey work [duplicate]

This question already has answers here:
reduceByKey: How does it work internally?
(5 answers)
Closed 5 years ago.
I am doing some work with Scala and spark - beginner programmer and poster- the goal is to map each request (line) to a pair (userid, 1) then sum the hits.
Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?
Please excuse any errors in my post formatting as I am new to this website.
val userreqs = logs.map(line => line.split(' ')).
map(words => (words(2),1)).
reduceByKey((v1,v2) => v1 + v2)
considering the below hypothetical log
trans_id amount user_id
1 100 A001
2 200 A002
3 300 A001
4 200 A003
this how the data is processed in spark for each operation performed on the logs.
logs // RDD("1 100 A001","2 200 A002", "3 300 A001", "3 200 A003")
.map(line => line.split(' ')) // RDD(Array(1,100,A001),Array(2,200,A002),Array(3,300,A001), Array(4,200,A003))
.map(words => (words(2),1)) // RDD((A001,1), (A002,1), (A001,1), (A003,1))
.reduceByKey((v1,v2) => v1+v2) // RDD(A001,2),A(A002,1),A(`003,1))
line.split(' ') splits a string into Array of String. "Hello World" => Array("Hello", "World")
reduceByKey(_+_) run a reduce operation grouping data by key. in this case its adds all the values of key. In the above example there were two occurences for the user-key A001 and the value associated with each of those key was 1. This is now reduced to value 2 using the additive function (_ + _) provided in the reduceByKey method.
The easiest way to learn Spark and reduceByKey is to read the official documentation of PairRDDFunctions that says:
reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.
So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.
Now, you may be asking yourself:
What is the key?
The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line
map(words => (words(2),1)).
This is where you take words and destructure it into a pair of key and 1.
This is a classic map-reduce algorithm where you give 1 to all keys to reduce them in the following step.
In the end, after this map you'll have a series of key-value pairs as follows:
(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)
I repeated the last (again, 1) pair on purpose to show you that pairs can occur multiple times.
The series is created using RDD.map operator that takes a function that splits a single line and tokenize it into words.
logs.map(line => line.split(' ')).
It reads:
For every line in logs, split the line into tokens using (space) as separator.
I'd change this line to use a regex like \\s+ so any white character would get considered part of a separator.
line.split(' ') splits each line with the space which returns an array of string
For example:
"hello spark scala".split(' ') gives [hello, spark, scala]
`reduceByKey((v1,v2) => v1 + v2)` is equivalent to `reduceByKey(_ + _)`
Here is how reduceByKey works https://i.stack.imgur.com/igmG3.gif and http://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/
For the same key it keeps adding all the values.
Hope this helped!