How to transform Dataset[(String, Seq[String])] to Dataset[(String, String)]? - scala

Probably this's simple problem, but I begin my adventure with spark.
Problem: I'd like to get following structure (Expected result) in spark. Now I have following structure.
title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}
Data are stored in Dataset[(String, Seq[String])]
Excepted result
I would like to get Tuple [word, title]
word11, {title1} word12, {title1}
What I do
1. Make (title, seq[word1,word2,word,3])
docs.mapPartitions { iter =>
iter.map {
case (title, contents) => {
val textToLemmas: Seq[String] = toText(....)
(title, textToLemmas)
}
}
}
I tried use .map to transform my structure to Tuple, but can't do it.
I tried to iterate through all the elements, but then I can not return type
Thanks for answer.

This should work:
val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

Another solution is to call the explode function like this :
import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]
Hope this help you, Best Regrads.

I'm surprised no one offered a solution with Scala's for-comprehension (that gets "desugared" to flatMap and map as in Yuval Itzchakov's answer at compile time).
When you see a series of flatMap and map (possibly with filter) that's Scala's for-comprehension.
So the following:
val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }
is equivalent to the following:
val result = for {
(title, words) <- dataSet
w <- words
} yield (w, title)
After all, that's why we enjoy flexibility of Scala, isn't it?

Related

Difficulty in understanding variable assignment and function signature output

Apologies for not being able to word the title better. I'm open to suggestions.
I'm trying to make an inverted index where for each word I'm trying to produce a list of articles that mentions this word. Here's my code below:
def makeInvertedIndex(words: List[String], rdd: RDD[Article]): RDD[(String, Iterable[Article])] = {
val foo = rdd flatMap { article =>
words.map { lang =>
(word, article)
}.filter(pair => pair._2.mentionsWord(pair._1))
}
foo.groupByKey
}
The function above returns a type of RDD[(String, Iterable[Article])] as expected, but if I were to rewrite the function as below:
def makeInvertedIndex(words: List[String], rdd: RDD[Article]): RDD[(String, Iterable[Article])] = {
rdd flatMap { article =>
words.map { lang =>
(word, article)
}.filter(pair => pair._2.mentionsWord(pair._1))
}.groupByKey
}
I get an error where the signatures don't match. Is there something I'm missing here?
I would assume that the output types would be the same by first glance. Perhaps the .groupByKey in the bottom version is being applied as part of the flatMap?
rdd flatMap { ... }.groupByKey
is parsed as
rdd.flatMap({...}.groupByKey)
but you want
rdd.flatMap({...}).groupByKey
This is expected behavior: you want this when you do, say
1 + something.foo
but it gets a bit hard to follow when something is a large expression, like here.
I like to write
rdd.flatMap { ... }.groupByKey
which will work, but, in this case, I'd rather use a for:
(for {
article <- rdd
word <- words
if article.mentionsWord(word)
} yield (word, article)
).groupByKey

Get Future objects from Future Options in Scala

I am new to Scala from Java so the functional programming thing is still a bit difficult for me to understand. I have a project in Play framework. I need to query the database to get rows with ids and display them in a html template.
Here is my code
def search(query: String) = Action.async{ request =>
val result = SearchEngine.searchResult(query)
val docs = result.map(DocumentService.getDocumentByID(_).map(doc => doc))
val futures = Future.sequence(docs)
futures.map{documents =>
Ok(views.html.results(documents.flatten))
}
}
getDocumentByID returns a Future[Options[Document]] object, but my results template takes Array[Document] so I have tried to no avail to transform the Future[Options[Document]] to Array[Document]
The current code I have is the closest I have been, but it still does not compile. This is the error:
Error:(36, -1) Play 2 Compiler:
found : Array[scala.concurrent.Future[Option[models.Document]]]
required: M[scala.concurrent.Future[A]]
Try to collect only the Somes from the Future returned by the getDocumentByID
val docs = result.map { res =>
val f: Future[Option[Document]] = DocumentService.getDocumentByID(res)
f.collect { case Some(doc) => doc }
}.toList
val futures = Future.seqence(docs) //notice that docs is converted to list from array in the previous line
General suggestion
Do not use Arrays. Arrays are mutable and they do not grow dynamically.
So it is advisable to avoid using Array in concurrent/parallel code.

Removing Try failures in collection using flatMap

I have a Map[String, String]
How can I simply this expression using flatMap?
val carNumbers = carMap.keys.map(k => Try(k.stripPrefix("car_number_").toInt)).toList.filter(_.isSuccess)
Note: I want to remove the Failure/Success wrapper and just have a List[Int].
It looks like you just want to convert Try to Option:
for {
key <- carMap.keys
t <- Try(key.stripPrefix("car_number_").toInt).toOption
} yield t
this will result Iterable and you can convert it to list with .toList method.
Also you can go with oneliner like this:
carMap.keys.flatMap(k => Try(k.stripPrefix("car_number_").toInt).toOption)
Consider using collect() with a partial function:
carMap.keys
.collect( k =>
Try(k.stripPrefix("car_number_").toInt) match {
case Success(num) => num
}
)
This will return an Iterable[Int] with the values that could be stripped and converted to an Int (assuming this is what you were looking for).

How to get a result from Enumerator/Iteratee?

I am using play2 and reactivemongo to fetch a result from mongodb. Each item of the result needs to be transformed to add some metadata. Afterwards I need to apply some sorting to it.
To deal with the transformation step I use enumerate():
def ideasEnumerator = collection.find(query)
.options(QueryOpts(skipN = page))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate()
Then I create an Iteratee as follows:
val processIdeas: Iteratee[Idea, Unit] =
Iteratee.foreach[Idea] { idea =>
resolveCrossLinks(idea) flatMap { idea =>
addMetaInfo(idea.copy(history = None))
}
}
Finally I feed the Iteratee:
ideasEnumerator(processIdeas)
And now I'm stuck. Every example I saw does some println inside foreach, but seems not to care about a final result.
So when all documents are returned and transformed how do I get a Sequence, a List or some other datatype I can further deal with?
Change the signature of your Iteratee from Iteratee[Idea, Unit] to Iteratee[Idea, Seq[A]] where A is the type. Basically the first param of Iteratee is Input type and second param is Output type. In your case you gave the Output type as Unit.
Take a look at the below code. It may not compile but it gives you the basic usage.
ideasEnumerator.run(
Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
) // returns Future[List[MyObject]]
As you can see, Iteratee is a simply a state machine. Just extract that Iteratee part and assign it to a val:
val iteratee = Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
and feel free to use it where ever you need to convert from your Idea to List[MyObject]
With the help of your answers I ended up with
val processIdeas: Iteratee[Idea, Future[Vector[Idea]]] =
Iteratee.fold(Future(Vector.empty[Idea])) { (accumulator: Future[Vector[Idea]], next:Idea) =>
resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
} flatMap (ideaWithMeta => accumulator map (acc => acc :+ ideaWithMeta))
}
val ideas = collection.find(query)
.options(QueryOpts(page, perPage))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate(perPage).run(processIdeas)
This later needs a ideas.flatMap(identity) to remove the returning Future of Futures but I'm fine with it and everything looks idiomatic and elegant I think.
The performance gained compared to creating a list and iterate over it afterwards is negligible though.

Iterate Over a tuple

I need to implement a generic method that takes a tuple and returns a Map
Example :
val tuple=((1,2),(("A","B"),("C",3)),4)
I have been trying to break this tuple into a list :
val list=tuple.productIterator.toList
Scala>list: List[Any] = List((1,2), ((A,B),(C,3)), 4)
But this way returns List[Any] .
I am trying now to find out how to iterate over the following tuple ,for example :
((1,2),(("A","B"),("C",3)),4)
in order to loop over each element 1,2,"A",B",...etc. How could I do this kind of iteration over the tuple
What about? :
def flatProduct(t: Product): Iterator[Any] = t.productIterator.flatMap {
case p: Product => flatProduct(p)
case x => Iterator(x)
}
val tuple = ((1,2),(("A","B"),("C",3)),4)
flatProduct(tuple).mkString(",") // 1,2,A,B,C,3,4
Ok, the Any-problem remains. At least that´s due to the return type of productIterator.
Instead of tuples, use Shapeless data structures like HList. You can have generic processing, and also don't lose type information.
The only problem is that documentation isn't very comprehensive.
tuple.productIterator map {
case (a,b) => println(a,b)
case (a) => println(a)
}
This works for me. tranform is a tuple consists of dataframes
def apply_function(a: DataFrame) = a.write.format("parquet").save("..." + a + ".parquet")
transform.productIterator.map(_.asInstanceOf[DataFrame]).foreach(a => apply_function(a))