I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.
if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.
Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.
You can use regular expressions with split:
scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
Or if your dictionary (and/or strings to split) has more than one word ...
val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
val result: List[String] = toSplit.flatMap(_.split(rx))
You could do a regex find and replace on the following pattern:
(?=apple)|(?<=apple)
and then replace with comma surrounded by spaces on both sides. We could try:
val input = "bananaapplefruit"
val output = input.replaceAll("(?=apple)|(?<=apple)", " , ")
println(output) // banana , apple , fruit
I have a input file like this:
The Works of Shakespeare, by William Shakespeare
Language: English
and I want to use flatMap with the combinations method to get the K-V pairs per line.
This is what I do:
var pairs = input.flatMap{line =>
line.split("[\\s*$&#/\"'\\,.:;?!\\[\\(){}<>~\\-_]+")
.filter(_.matches("[A-Za-z]+"))
.combinations(2)
.toSeq
.map{ case array => array(0) -> array(1)}
}
I got 17 pairs after this, but missed 2 of them: (by,shakespeare) and (william,shakespeare). I think there might be something wrong with the last word of the first sentence, but I don't know how to solve it, can anyone tell me?
The combinations method will not give duplicates even if the values are in the opposite order. So the values you are missing already appear in the solution in the other order.
This code will create all ordered pairs of words in the text.
for {
line <- input
t <- line.split("""\W+""").tails if t.length > 1
a = t.head
b <- t.tail
} yield a -> b
Here is the description of the tails method:
Iterates over the tails of this traversable collection. The first value will be this traversable collection and the final one will be an empty traversable collection, with the intervening values the results of successive applications of tail.
How to refer Spark RDD element multiple times using underscore notations.
For example I need to convert RDD[String] to RDD[(String, Int)]. I can create anonymous function using function variables but I would like to do this using Underscore notation. How I can achieve this.
PFB sample code.
val x = List("apple", "banana")
val rdd1 = sc.parallelize(x)
// Working
val rdd2 = rdd1.map(x => (x, x.length))
// Not working
val rdd3 = rdd1.map((_, _.length))
Why does the last line above not work?
An underscore or (more commonly) a placeholder syntax is a marker of a single input parameter. It's nice to use for simple functions, but can get tricky to get right with two or more.
You can find the definitive answer in the Scala language specification's Placeholder Syntax for Anonymous Functions:
An expression (of syntactic category Expr) may contain embedded underscore symbols _ at places where identifiers are legal. Such an expression represents an anonymous function where subsequent occurrences of underscores denote successive parameters.
Note that one underscore references one input parameter, two underscores are for two different input parameters and so on.
With that said, you cannot use the placeholder twice and expect that they'll reference the same input parameter. That's not how it works in Scala and hence the compiler error.
// Not working
val rdd3 = rdd1.map((_, _.length))
The above is equivalent to the following:
// Not working
val rdd3 = rdd1.map { (a: String, b: String) => (a, b.length)) }
which is clearly incorrect as map expects a function of one input parameter.
My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))
I'm reviewing some Scala code trying to learn the language. Ran into a piece that looks like the following:
case x if x startsWith "+" =>
val s: Seq[Char] = x
s match {
case Seq('+', rest # _*) => r.subscribe(rest.toString){ m => }
}
In this case, what exactly is rest # _* doing? I understand this is a pattern match for a Sequence, but I'm not exactly understanding what that second parameter in the Sequence is supposed to do.
Was asked for more context so I added the code block I found this in.
If you have come across _* before in the form of applying a Seq as varargs to some method/constructor, eg:
val myList = List(args: _*)
then this is the "unapply" (more specifically, search for "unapplySeq") version of this: take the sequence and convert back to a "varargs", then assign the result to rest.
x # p matches the pattern p and binds the result of the whole match to x. This pattern matches a Seq containing '+' followed by any number (*) of unnamed elements (_) and binds rest to a Seq of those elements.