Mapping RDD to function does not invoke the function - scala

I am using Scala Spark API. In my code, I have an RDD of the following structure:
Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]
I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:
myRDD.map(line => mappingFunction(line))
Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:
def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
println("Inside mappingFunction")
return line
}
When my program ends, there are no printed messages in the stdout.
In order to investigate the problem, I implemented a code snippet that worked:
val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))
And the following mapping function was invoked:
def callInt(i: Int) = {
println("Inside callInt")
}
Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.

x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.
myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.
That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.
Some examples of actions are collect or count
If you do this:
myRDD.map(line => mappingFunction(line)).count()
You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs
There is a good answer about this topic here.
Also you can find more info and a whole list of transformations and actions here

Related

What does code block mean in a scala anonymous function in Spark?

I am new to scala and don't understand what does a code block mean in anonymous function. Here is some piece of example code:
def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
val articles_Languages = rdd.flatMap(article => {
langs.filter(lang => article.mentionsLanguage(lang))
.map(lang => (lang, article))
})
articles_Languages.groupByKey
}
Does it mean that a WikipediaArticle object is transformed from its original to a list of tuples (lang,article),and then flatted, and by calling groupByKey it is transformed into RDD[(String, Iterable[WikipediaArticle])]?
Does it mean that I can wirte any code inside a {} block as long as the final line inside the block returns the object I want?In this way example code iteratored langs upon each article?
map and flatMap are high order functions, they receive a function as a parameter and you can call them in several ways. You can just pass a method that you have defined, an anonymous function inside () if it has only one line, or inside {} if you need more lines of code.
And yes, you can pass there whatever you want if you follow the required signature, meaning that input and output have to match with the signature.
In case of map you have a signature A => B so you can transform your A into anything you want
E.g. for an RDD of Int:
rdd.map(x=> x+1)
The x => x+1 is an anonymous function called by map or some other method.
Instead of using a def with input and output definitions, the output of Int is inferred by Scala, in this case.

Slick: update List in db

My table schema in Postgres is the following:
I store List[String] in the 2nd column and I wrote the working method that updates this list with Union of a new list and old list:
def update(userId: Long, unknownWords: List[String]) = db.run {
for {
y <- lists.filter(_.userId === userId).result
words = y.map(_.unknownWords).flatMap(_.union(unknownWords)).distinct.toList
x <- lists.filter(_.userId === userId).map(_.unknownWords).update(words)
} yield x
}
Is there any way to write this better? And maybe the question is pretty dumb, but I don’t quite understand why I should apply .result() to the first line of the for expression, the filter().map() chain on the 3d line is working fine, is there something wrong with the types?
Why .result
The reason you need to apply .result is to do with the difference between queries (Query type) and actions (DBIO) in Slick.
By itself, the lists.filter line is a query. However, the third line (the update) is an action. If you left the .result off your for comprehension would have a type mismatch between a Query and a DBIO (action).
Because you're going to db.run the result of the for comprehension, the for comprehension needs to result in an DBIO action, rather than a query. In other words, putting a .result there is the right thing to do because you're constructing an action to run in the database (namely, fetching some data for the user).
You'll then going to run another action later to update the database. So in all, you're using for to combine two actions (two runnable SQL expressions) into a single DBIO. That's the x you yield, which is executed by db.run.
Better?
This is working for you, and that's just fine.
There's a small amount of duplication. You might spot your query on the first line, is very similar to the update query. You could abstract that out into a value:
val userLists = lists.filter(_.userId === userId)
That's a query. In fact, you could go a step further and modify the query to just select the unknownWords column:
val userUnknownWords = lists.filter(_.userId === userId).map(_.unknownWords)
I've not tried to compile this but that would make your code something like:
def update(userId: Long, unknownWords: List[String]) = {
val userUnknownWords = lists.filter(_.userId === userId).map(_.unknownWords)
db.run {
for {
y <- userUnknowlWords.result
words = y.flatMap(_.union(unknownWords)).distinct.toList
x <- userUnknownWords.update(words)
} yield x
}
Given that you're composing two actions (a select and an update), you could use DBIO.flatMap in place of the for comprehension. You might find it clearer. Or not. But here's an example...
The argument to DBIO.flatMap needs to be another action. That is, flatMap is a way to sequence actions. In particular, it's a way to do that while using the value from the database.
So you could replace the for comprehension with:
val action: DBIO[Int] =
userUnknowlWords.result.flatMap { currentWords =>
userUnknownWords.update(
currentWords.flatMap(_.union(unknownWords)).distinct.toList
)
}
(Again, apologies for not compiling the above: I don't have the details of the types, but hopefully this will give a flavour for how the code could work).
The final action is the one you can pass to db.run. It returns the number of rows changed.

Losing types on sequencing Futures

I'm trying to do this:
case class ConversationData(members: Seq[ConversationMemberModel], messages: Seq[MessageModel])
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
Future.sequence(List(membersFuture, messagesFuture)).map{ result =>
// some magic here
self ! ConversationData(members, messages)
}
But when I'm sequencing the two futures compiler is losing types. The compiler says that type of result is List[Seq[Product with Serializable]] At the beginning I expect to do something like
Future.sequence(List(membersFuture, messagesFuture)).map{ members, messages => ...
But it looks like sequencing futures don't work like this... I also tried to using a collect inside the map but I get similar errors.
Thanks for your help
When using Future.sequence, the assumption is that the underlying types produced by the multiple Futures are the same (or extend from the same parent type). With sequence, you basically invert a Seq of Futures for a particular type to a single Future for a Seq of that particular type. A concrete example is probably more illustrative of that point:
val f1:Future[Foo] = ...
val f2:Future[Foo] = ...
val f3:Future[Foo] = ...
val futures:List[Future[Foo]] = List(f1, f2, f3)
val aggregateFuture:Future[List[Foo]] = Future.sequence(futures)
So you can see that I went from a List of Future[Foo] to a single Future wrapping a List[Foo]. You use this when you already have a bunch of Futures for results of the same type (or base type) and you want to aggregate all of the results for the next processing step. The sequence method product a new Future that won't be completed until all of the aggregated Futures are done and it will then contain the aggregated results of all of those Futures. This works especially well when you have an indeterminate or variable number of Futures to process.
For your case, it seems that you have a fixed number of Futures to handle. As #Zoltan suggested, a simple for comprehension is probably a better fit here because the number of Futures is known. So solving your problem like so:
for{
members <- membersFuture
messages <- messagesFuture
} {
self ! ConversationData(members, messages)
}
is probably the best way to go for this specific example.
What are you trying to achieve with the sequence call? I'd just use a for-comprehension instead:
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
for {
members <- membersFuture
messages <- messagesFuture
} yield (self ! ConversationData(members, messages))
Note that it is important that you declare the two futures outside the for-comprehension, because otherwise your messagesFuture wouldn't be submitted until the membersFuture is completed.
You could also use zip:
membersFuture.zip(messagesFuture).map {
case (members, messages) => self ! ConversationData(members, messages)
}
but I'd prefer the for-comprehension.

Why in an RDD, map gives NotSerializableException while foreach doesn't?

I understand the basic difference between map & foreach (lazy and eager), also I understand why this code snippet
sc.makeRDD(Seq("a", "b")).map(s => new java.io.ByteArrayInputStream(s.getBytes)).collect
should give
java.io.NotSerializableException: java.io.ByteArrayInputStream
And then I think so should the following code snippet
sc.makeRDD(Seq("a", "b")).foreach(s => {
val is = new java.io.ByteArrayInputStream(s.getBytes)
println("is = " + is)
})
But this code runs fine. Why so?
Actually fundamental difference between map and foreach is not evaluation strategy. Lets take a look at the signatures (I've omitted implicit part of map for brevity):
def map[U](f: (T) ⇒ U): RDD[U]
def foreach(f: (T) ⇒ Unit): Unit
map takes a function from T to U applies it to each element of the existing RDD[T] and returns RDD[U]. To allow operations likes shuffling U has to be serializable.
foreach takes a function from T to Unit (which is analogous to Java void) and by itself returns nothing. Everything happens locally, there is no network traffic involved so there is no need for serialization. Unlike map, foreach should be used when want to get some kind of side effect, like in your previous question.
On a side note these two are actually different. Anonymous function you use in map is a function:
(s: String) => java.io.ByteArrayInputStream
and one you use in foreach like this:
(s: String) => Unit
If you use the second function with map your code will compile, although result would be far from what you want (RDD[Unit]).
collect call after map is causing the issue.
Below are results of my testing in spark-shell.
Below passes as no data has to be sent to other nodes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).count
Below calls fail, as map output can be sent to other nodes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).first
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).collect
Repartition forces distribution of data to nodes, which fails.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).repartition(2).saveAsTextFile("/tmp/NWRepart")
Without repartition below call passes.
sc.makeRDD(1 to 1000, 1).map(_ => {NullWritable.get}).saveAsTextFile("/tmp/NW")

Spark's RDD.map() will not execute unless the item inside RDD is visited

I'm not quite sure about how Scala and Spark works, maybe I write the code in the wrong way.
The function I want to achieve is, for a given Seq[String, Int], assign a random item in v._2.path to _._2.
To do that, I implement a method and call this method in the next line
def getVerticesWithFeatureSeq(graph: Graph[WikiVertex, WikiEdge.Value]): RDD[(VertexId, WikiVertex)] = {
graph.vertices.map(v => {
//For each token in the sequence, assign an article to them based on its path(root to this node)
println(v._1+" before "+v._2.featureSequence)
v._2.featureSequence = v._2.featureSequence.map(f => (f._1, v._2.path.apply(new scala.util.Random().nextInt(v._2.path.size))))
println(v._1+" after "+v._2.featureSequence)
(v._1, v._2)
})
}
val dt = getVerticesWithFeatureSeq(wikiGraph)
When I execute it, I suppose the println should print out something, but it didn't.
If I add another line of code
dt.foreach(println)
println inside map will print correctly.
Is there some latency of spark's code execution? Like if no one is accessing a variable, the computing will be deferred or even canceled?
Is graph.vertices an RDD? That would explain your issue, since Spark transformations are lazy until no action is executed, foreach in your case:
val dt = getVerticesWithFeatureSeq(wikiGraph) //no result is computed yet, map transformation is 'recorded'
dt.foreach(println) //foreach action requires a result, this triggers the computation
RDD's remember the transformations applied and they are only computed when an action requires a result to be returned to the driver program.
You can check http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations for further details and a list of available transformations and actions.