Map letter to word in which it appears - scala

I am newbie to scala and I am trying to write a function that takes an input string and returns a map of letters to words in which they appear.
For example given the input string "this is demo",
I would like the output map ['t'->["this"],'h'->["this],'i'->["this","is"]... and so on.
I can write this code in traditional way, but how can I write this code in a functional way by using scala constructs like map, groupby, flatmap etc?

"this is demo"
.split(" ")
.flatMap(w => w.map(c => c -> w))
.groupMap(_._1)(_._2)
// HashMap(e -> Array(demo), s -> Array(this, is), t -> Array(this), m -> Array(demo), i -> Array(this, is), h -> Array(this), o -> Array(demo), d -> Array(demo))
First step consists in getting an array of tuples representing for each character which word it comes from. This can be achieved by first splitting the sentence in words and for each character of each word producing a tuple with the character and its word (.map(w => w.map(c => c -> w))). And since this gives us an array of arrays, we can use a flatMap to flatten these into a one level array of tuples (producing Array((t,this), (h,this), (i,this), ...)).
Second step consists in grouping these tuples of characters and words by character and mapping the grouped values to the associated words. Which can be achieved with groupMap (it groups tuples by their first part (by character) and maps grouped tuples to their second part (the word)). If you're using an earlier Scala version (before 2.13), you'll have to replace groupMap with a combination of groupBy and mapValues: .groupBy(_._1).mapValues(_.map(_._2)).

Here is another solution. You can replace the rough tokenizer I suggest by a parser, like Stanford CoreNLP Simple for example, info here.
def tokenizeText(s:String):Array[String] = {
s.toLowerCase().split("[\\W]+")
}
val text:String = "Here you have your text. Set several sentences if you like."
val words = tokenizeText(text)
val letters = words.mkString("").toSet.mkString("").split("")
val twoDArray:Array[Array[String]] = letters.map(l => words.filter(w => w.contains(l)))

Related

Scala split a 2 words which aren't seperated

I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.
if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.
Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.
You can use regular expressions with split:
scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
Or if your dictionary (and/or strings to split) has more than one word ...
val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
val result: List[String] = toSplit.flatMap(_.split(rx))
You could do a regex find and replace on the following pattern:
(?=apple)|(?<=apple)
and then replace with comma surrounded by spaces on both sides. We could try:
val input = "bananaapplefruit"
val output = input.replaceAll("(?=apple)|(?<=apple)", " , ")
println(output) // banana , apple , fruit

Scala combination function issue

I have a input file like this:
The Works of Shakespeare, by William Shakespeare
Language: English
and I want to use flatMap with the combinations method to get the K-V pairs per line.
This is what I do:
var pairs = input.flatMap{line =>
line.split("[\\s*$&#/\"'\\,.:;?!\\[\\(){}<>~\\-_]+")
.filter(_.matches("[A-Za-z]+"))
.combinations(2)
.toSeq
.map{ case array => array(0) -> array(1)}
}
I got 17 pairs after this, but missed 2 of them: (by,shakespeare) and (william,shakespeare). I think there might be something wrong with the last word of the first sentence, but I don't know how to solve it, can anyone tell me?
The combinations method will not give duplicates even if the values are in the opposite order. So the values you are missing already appear in the solution in the other order.
This code will create all ordered pairs of words in the text.
for {
line <- input
t <- line.split("""\W+""").tails if t.length > 1
a = t.head
b <- t.tail
} yield a -> b
Here is the description of the tails method:
Iterates over the tails of this traversable collection. The first value will be this traversable collection and the final one will be an empty traversable collection, with the intervening values the results of successive applications of tail.

How to refer Spark RDD element multiple times using underscore notation?

How to refer Spark RDD element multiple times using underscore notations.
For example I need to convert RDD[String] to RDD[(String, Int)]. I can create anonymous function using function variables but I would like to do this using Underscore notation. How I can achieve this.
PFB sample code.
val x = List("apple", "banana")
val rdd1 = sc.parallelize(x)
// Working
val rdd2 = rdd1.map(x => (x, x.length))
// Not working
val rdd3 = rdd1.map((_, _.length))
Why does the last line above not work?
An underscore or (more commonly) a placeholder syntax is a marker of a single input parameter. It's nice to use for simple functions, but can get tricky to get right with two or more.
You can find the definitive answer in the Scala language specification's Placeholder Syntax for Anonymous Functions:
An expression (of syntactic category Expr) may contain embedded underscore symbols _ at places where identifiers are legal. Such an expression represents an anonymous function where subsequent occurrences of underscores denote successive parameters.
Note that one underscore references one input parameter, two underscores are for two different input parameters and so on.
With that said, you cannot use the placeholder twice and expect that they'll reference the same input parameter. That's not how it works in Scala and hence the compiler error.
// Not working
val rdd3 = rdd1.map((_, _.length))
The above is equivalent to the following:
// Not working
val rdd3 = rdd1.map { (a: String, b: String) => (a, b.length)) }
which is clearly incorrect as map expects a function of one input parameter.

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))

What does `var # _*` signify in Scala

I'm reviewing some Scala code trying to learn the language. Ran into a piece that looks like the following:
case x if x startsWith "+" =>
val s: Seq[Char] = x
s match {
case Seq('+', rest # _*) => r.subscribe(rest.toString){ m => }
}
In this case, what exactly is rest # _* doing? I understand this is a pattern match for a Sequence, but I'm not exactly understanding what that second parameter in the Sequence is supposed to do.
Was asked for more context so I added the code block I found this in.
If you have come across _* before in the form of applying a Seq as varargs to some method/constructor, eg:
val myList = List(args: _*)
then this is the "unapply" (more specifically, search for "unapplySeq") version of this: take the sequence and convert back to a "varargs", then assign the result to rest.
x # p matches the pattern p and binds the result of the whole match to x. This pattern matches a Seq containing '+' followed by any number (*) of unnamed elements (_) and binds rest to a Seq of those elements.