How can we avoid MapPartition related issues? - scala

val counts = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("points"+point)
point.indices.map(i => i,point(i)))
})
}).countByValue()
val count = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("pointsssss" + point.deep)
point.indices.map(i => (i, point(i)))
})
}).countByValue()
When I execute count.foreach(println), I also get output from counts. How can I avoid this problem ?

The reason you see both print statements is that countByValue is itself an action and not a transformation, and it triggers evaluation of the RDD (in this case, both of them). From the docs:
def countByValue(): Map[T, Long]
Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master, equivalent to running a single reduce task.
Your next code, count.foreach(println) happens thus outside of Spark, in normal Scala Collections, in the master node.
Check the logic if this is not the behavior you want, I have the suspicion that you want countByKey() instead (also an action).

Related

Given distinct events, with some mapping to the same ID, how can I find the most recent event for each ID and sum the results using Algebird?

I have been struggling for days to find a solution and so I'm hoping someone with more Algebird experience can help!
I have a stream of events I'm aggregating using Algebird, where each event represents an attempt to perform some task. Consider the following data structure to represent each attempt:
class TaskAttempt {
val taskId: String
val time: Int
val type: String
val value: Long
val valueUnit: String
}
I am aggregating these attempts from a stream, and there is no guarantee that an attempt to perform a task will succeed. In the case that an attempt fails, I expect additional attempts for the same task. The aggregation I'm trying to build does the following:
Collect only the most recent attempt (based on the TaskAttempt.time field) for each task ID. Assume larger values for TaskAttempt.time mean the event happened more recently. All previous events for each task will be ignored.
Sum the TaskAttempt.value field from the TaskAttempt instances collected in step 1 into a Map(type -> Map(valueUnit -> valueSum)). This means that in the end, all values from each most recent task attempt will be summed if their type and valueUnit fields are equal.
I was hoping to accomplish the above using something like the following, but .flatMap() cannot be called on an Algebird Preparer after calling .reduce() because the latter returns a MonoidAggregator rather than a Preparer. Regardless, here is some non-working code to further show what I'd like to accomplish:
Preparer[NetworkAttemptSubmissionPrediction]
// Aggregate attempts into Sets
.flatMap { attempt =>
for {
a <- attempt
} yield Set(a)
}
// Reduce by grouping TaskAttempt's by taskId and then collecting the
// attempts with the largest value for time for each taskId
.reduce {
(
l1: Set[TaskAttempt],
l2: Set[TaskAttempt]
) =>
(l1 ++ l2)
.groupBy(_.taskId)
.flatMap(entry: (String -> List[TaskAttempt]) => entry._2.maxBy(_.time))
.toSet
}
// Map the remaining filtered attempts to the required Map
.flatMap { attempt =>
for {
value <- attempt.value
} yield Map(
attempt.type -> Map(attempt.valueUnit -> value)
)
}
.sum
Ultimately, I must provide the framework I'm using for the stream aggregation (internal tool built on top of Twitter's Summingbird) with a MonoidAggregator[TaskAttempt, Map[String, Map[String, Long]], Map[String, Map[String, Long]] that aggregates the data as described above. How can I accomplish this? Any other ideas for how I could make this work?
I decided that rather than attempting to dedupe, I should avoid the need to dedupe altogether. I did this by adding additional "negative" task attempts to the topic which negate failed "positive" task attempts that come before them in the stream. By doing this, I can sum all of the events in the stream without worry of double counting due to the presence of multiple attempts for a single task.

Spark Scala: mapPartitions in this use case

I was reading a lot about the differences between map and mapPartitions. Still I have some doubts.
The thing is after reading I decided to change the map functions for mapPartitions in my code because apparently mapPartitions is faster than map.
My question is about to be sure if my decision is right in scenarios like the following (comments show the previous code):
val reducedRdd = rdd.mapPartitions(partition => partition.map(r => (r.id, r)))
//val reducedRdd = rdd.map(r => (r.id, r))
.reduceByKey((r1, r2) => r1.combineElem(r2))
// .map(e => e._2)
.mapPartitions(partition => partition.map(e => e._2))
I am thinking it right? Thanks!
In your case, mapPartitions should not make any difference.
mapPartitions vs map
mapPartitions is useful when we have some common computation which we want to do for each partition. Example -
rdd.mapPartitions{
partition =>
val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>
partition.map {
row => (row.id, complicatedRowConverter(row) )
}
}
In above example, we are creating a complicatedRowConverter function which is dervived from some costly computation. This function will be same for entire
RDD partition and we don't need recreate it again and again. The other way to do same thing can be -
rdd.map { row =>
val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>
(row.id, complicatedRowConverter(row) )
}
}
This will be slow because we are unnecessarily running this statement for every row - val complicatedRowConverter = <SOME-COSTLY-COMPUTATION>.
In your case, you don't have any precomputation or anything else for each partition. In the mapPartition, you are simply iterating over each row and mapping it to (row.id, row).
So mapPartition here won't benefit and you can use simple map.
tl;dr mapPartitions will be fast in this case.
Why
consider a function
def someFunc(row): row {
// do some processing on row
// return new row
}
Say we are processing 1million records
map
We will end up calling the someFunc 1 million.
There are bascally 1m virtual function call and other kernel data structures created for the processing
mapPartition
we would write this as
mapPartition { partIter =>
partIter.map {
// do some processing on row
// return new row
}
}
No virual functions, context switch here.
Hence mapPartitions will be faster.
Also, like mentioned in the #moriarity007's answer, we also need to factor in the object creation overhead involved with the operation, when deciding between the operator to use.
Also, I would recommend using the dataframe transforms and actions to do processing, where the compute can be even faster, since Spark catalyst optimizes your code and also take advantage of code generation.

Streamline results in mapPartitions (Spark)

Is there a way to return partial results in mapPartitions() ?
Currently I use it like this:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = <some costly init operation>
val results = ArrayBuffer[OutputType]()
for(input: InputType <- iter) results += (transform(input, additionalData))
results.iterator
}
}
But of course if a partition is too big the results array will throw an OOM exception.
So my question: is there a way to send partial results every once in a while so as to avoid any OOM ?
I want to stick to mapPartitions because I initialize a costly object (e.g. get the value of a big broadcasted variable) before processing the input and I don't want to do that at every record like with map
If additionalData doesn't access the iterator you can just map:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = ???
iter.map(input => transform(input, additionalData))
}}

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

How to get a result from Enumerator/Iteratee?

I am using play2 and reactivemongo to fetch a result from mongodb. Each item of the result needs to be transformed to add some metadata. Afterwards I need to apply some sorting to it.
To deal with the transformation step I use enumerate():
def ideasEnumerator = collection.find(query)
.options(QueryOpts(skipN = page))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate()
Then I create an Iteratee as follows:
val processIdeas: Iteratee[Idea, Unit] =
Iteratee.foreach[Idea] { idea =>
resolveCrossLinks(idea) flatMap { idea =>
addMetaInfo(idea.copy(history = None))
}
}
Finally I feed the Iteratee:
ideasEnumerator(processIdeas)
And now I'm stuck. Every example I saw does some println inside foreach, but seems not to care about a final result.
So when all documents are returned and transformed how do I get a Sequence, a List or some other datatype I can further deal with?
Change the signature of your Iteratee from Iteratee[Idea, Unit] to Iteratee[Idea, Seq[A]] where A is the type. Basically the first param of Iteratee is Input type and second param is Output type. In your case you gave the Output type as Unit.
Take a look at the below code. It may not compile but it gives you the basic usage.
ideasEnumerator.run(
Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
) // returns Future[List[MyObject]]
As you can see, Iteratee is a simply a state machine. Just extract that Iteratee part and assign it to a val:
val iteratee = Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
and feel free to use it where ever you need to convert from your Idea to List[MyObject]
With the help of your answers I ended up with
val processIdeas: Iteratee[Idea, Future[Vector[Idea]]] =
Iteratee.fold(Future(Vector.empty[Idea])) { (accumulator: Future[Vector[Idea]], next:Idea) =>
resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
} flatMap (ideaWithMeta => accumulator map (acc => acc :+ ideaWithMeta))
}
val ideas = collection.find(query)
.options(QueryOpts(page, perPage))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate(perPage).run(processIdeas)
This later needs a ideas.flatMap(identity) to remove the returning Future of Futures but I'm fine with it and everything looks idiomatic and elegant I think.
The performance gained compared to creating a list and iterate over it afterwards is negligible though.