Difficulty in understanding variable assignment and function signature output - scala

Apologies for not being able to word the title better. I'm open to suggestions.
I'm trying to make an inverted index where for each word I'm trying to produce a list of articles that mentions this word. Here's my code below:
def makeInvertedIndex(words: List[String], rdd: RDD[Article]): RDD[(String, Iterable[Article])] = {
val foo = rdd flatMap { article =>
words.map { lang =>
(word, article)
}.filter(pair => pair._2.mentionsWord(pair._1))
}
foo.groupByKey
}
The function above returns a type of RDD[(String, Iterable[Article])] as expected, but if I were to rewrite the function as below:
def makeInvertedIndex(words: List[String], rdd: RDD[Article]): RDD[(String, Iterable[Article])] = {
rdd flatMap { article =>
words.map { lang =>
(word, article)
}.filter(pair => pair._2.mentionsWord(pair._1))
}.groupByKey
}
I get an error where the signatures don't match. Is there something I'm missing here?
I would assume that the output types would be the same by first glance. Perhaps the .groupByKey in the bottom version is being applied as part of the flatMap?

rdd flatMap { ... }.groupByKey
is parsed as
rdd.flatMap({...}.groupByKey)
but you want
rdd.flatMap({...}).groupByKey
This is expected behavior: you want this when you do, say
1 + something.foo
but it gets a bit hard to follow when something is a large expression, like here.
I like to write
rdd.flatMap { ... }.groupByKey
which will work, but, in this case, I'd rather use a for:
(for {
article <- rdd
word <- words
if article.mentionsWord(word)
} yield (word, article)
).groupByKey

Related

How to transform Dataset[(String, Seq[String])] to Dataset[(String, String)]?

Probably this's simple problem, but I begin my adventure with spark.
Problem: I'd like to get following structure (Expected result) in spark. Now I have following structure.
title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}
Data are stored in Dataset[(String, Seq[String])]
Excepted result
I would like to get Tuple [word, title]
word11, {title1} word12, {title1}
What I do
1. Make (title, seq[word1,word2,word,3])
docs.mapPartitions { iter =>
iter.map {
case (title, contents) => {
val textToLemmas: Seq[String] = toText(....)
(title, textToLemmas)
}
}
}
I tried use .map to transform my structure to Tuple, but can't do it.
I tried to iterate through all the elements, but then I can not return type
Thanks for answer.
This should work:
val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }
Another solution is to call the explode function like this :
import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]
Hope this help you, Best Regrads.
I'm surprised no one offered a solution with Scala's for-comprehension (that gets "desugared" to flatMap and map as in Yuval Itzchakov's answer at compile time).
When you see a series of flatMap and map (possibly with filter) that's Scala's for-comprehension.
So the following:
val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }
is equivalent to the following:
val result = for {
(title, words) <- dataSet
w <- words
} yield (w, title)
After all, that's why we enjoy flexibility of Scala, isn't it?

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Removing Try failures in collection using flatMap

I have a Map[String, String]
How can I simply this expression using flatMap?
val carNumbers = carMap.keys.map(k => Try(k.stripPrefix("car_number_").toInt)).toList.filter(_.isSuccess)
Note: I want to remove the Failure/Success wrapper and just have a List[Int].
It looks like you just want to convert Try to Option:
for {
key <- carMap.keys
t <- Try(key.stripPrefix("car_number_").toInt).toOption
} yield t
this will result Iterable and you can convert it to list with .toList method.
Also you can go with oneliner like this:
carMap.keys.flatMap(k => Try(k.stripPrefix("car_number_").toInt).toOption)
Consider using collect() with a partial function:
carMap.keys
.collect( k =>
Try(k.stripPrefix("car_number_").toInt) match {
case Success(num) => num
}
)
This will return an Iterable[Int] with the values that could be stripped and converted to an Int (assuming this is what you were looking for).

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

How to get a result from Enumerator/Iteratee?

I am using play2 and reactivemongo to fetch a result from mongodb. Each item of the result needs to be transformed to add some metadata. Afterwards I need to apply some sorting to it.
To deal with the transformation step I use enumerate():
def ideasEnumerator = collection.find(query)
.options(QueryOpts(skipN = page))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate()
Then I create an Iteratee as follows:
val processIdeas: Iteratee[Idea, Unit] =
Iteratee.foreach[Idea] { idea =>
resolveCrossLinks(idea) flatMap { idea =>
addMetaInfo(idea.copy(history = None))
}
}
Finally I feed the Iteratee:
ideasEnumerator(processIdeas)
And now I'm stuck. Every example I saw does some println inside foreach, but seems not to care about a final result.
So when all documents are returned and transformed how do I get a Sequence, a List or some other datatype I can further deal with?
Change the signature of your Iteratee from Iteratee[Idea, Unit] to Iteratee[Idea, Seq[A]] where A is the type. Basically the first param of Iteratee is Input type and second param is Output type. In your case you gave the Output type as Unit.
Take a look at the below code. It may not compile but it gives you the basic usage.
ideasEnumerator.run(
Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
) // returns Future[List[MyObject]]
As you can see, Iteratee is a simply a state machine. Just extract that Iteratee part and assign it to a val:
val iteratee = Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
and feel free to use it where ever you need to convert from your Idea to List[MyObject]
With the help of your answers I ended up with
val processIdeas: Iteratee[Idea, Future[Vector[Idea]]] =
Iteratee.fold(Future(Vector.empty[Idea])) { (accumulator: Future[Vector[Idea]], next:Idea) =>
resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
} flatMap (ideaWithMeta => accumulator map (acc => acc :+ ideaWithMeta))
}
val ideas = collection.find(query)
.options(QueryOpts(page, perPage))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate(perPage).run(processIdeas)
This later needs a ideas.flatMap(identity) to remove the returning Future of Futures but I'm fine with it and everything looks idiomatic and elegant I think.
The performance gained compared to creating a list and iterate over it afterwards is negligible though.