Spark: calling a function inside of mapPartitionsWithIndex - scala

I got very strange results with the following code.
I only want to take the partition data and iterate for each data, X times.
Here I'm calling to my function for each partition:
val myRDDResult = myRDD.mapPartitionsWithIndex( myFunction(_, _, limit), preservesPartitioning = true)
And the funcion is:
private def myFunction (partitionIndex: Long,
partitionData: Iterator[Array[(LabeledPoint,Int,Int)]]), limit: Int): Iterator[String] = {
var newData = ArrayBuffer[String]()
if (partitionData.nonEmpty){
val partDataMap = partitionData.next.map{ case (lp, _, neighId) => (lp, neighId) }.toMap
var newString:String = ""
for {
(k1,_) <- partDataMap
i <- 0 to limit
_ = {
// ... some code to generate the content for `newString`
newData.+=(newString)
}
}yield ()
}
newData.iterator
}
Here are some values obtained:
partitionData limit newData newData_expected
1640 250 411138 410000 (1640*250)
16256 27 288820 438912
I don't know if I misundertanding some concept of my code.
I've also tried changing the for part for this idea: partDataMap.map{elem=> for (i <- 0 to limit){....}}
Any suggestions?

First, sorry because I downvoted/upvoted (click error) your question and since I didn't cancel it within 10 minutes, SO kept it upvoted.
Regarding to your code, I think your expected results are bad because I took the same code as you, simplified it a little, and instead of receiving 410000 elements, I got 411640. Maybe I copied something incorrectly or ignore some stuff, but the code giving 411640 looks like:
val limit = 250
val partitionData: Iterator[Array[Int]] = Seq((1 to 1640).toArray).toIterator
var newData = ArrayBuffer[String]()
if (partitionData.nonEmpty){
val partDataMap = partitionData.next.map{ nr => nr.toString }
for {
value <- partDataMap
i <- 0 to limit
_ = {
newData.+=(s"${value}_${i}")
}
} yield ()
}
println(s"new buffer=${newData}")
println(s"Buffer size = ${newData.size}")
Now to answer to your question about why mapWithPartitions results differ from your expectations. IMO it's because your conversion from the Array to Map. If in your array you have duplicated key, it will count only once. It could explain why in both cases (if we consider 411640 as correct expected number) you receive the results lower than expected. To be sure of that you can compare partDataMap.size with partitionData.next.size.

Related

Cleaner way to find all indices of same value in Scala

I have a textfile like so
NameOne,2,3,3
NameTwo,1,0,2
I want to find the indices of the max value in each line in Scala. So the output of this would be
NameOne,1,2
NameTwo,2
I'm currently using the function below to do this but I can't seem to find a simple way to do this without a for loop and I'm wondering if there is a better method out there.
def findIndices(movieRatings: String): (String) = {
val tokens = movieRatings.split(",", -1)
val movie = tokens(0)
val ratings = tokens.slice(1, tokens.size)
val max = ratings.max
var indices = ArrayBuffer[Int]()
for (i<-0 until ratings.length) {
if (ratings(i) == max) {
indices += (i+1)
}
}
return movie + "," + indices.mkString(",")
}
This function is called as so:
val output = textFile.map(findIndices).saveAsTextFile(args(1))
Just starting to learn Scala so any advice would help!
You can zipWithIndex and use filter:
ratings.zipWithIndex
.filter { case(_, value) => value == max }
.map { case(index, _) => index }
I noticed that your code doesn't actually produce the expected result from your example input. I'm going to assume that the example given is the correct result.
def findIndices(movieRatings :String) :String = {
val Array(movie, ratings #_*) = movieRatings.split(",", -1)
val mx = ratings.maxOption //Scala 2.13.x
ratings.indices
.filter(x => mx.contains(ratings(x)))
.mkString(s"$movie,",",","")
}
Note that this doesn't address some of the shortcomings of your algorithm:
No comma allowed in movie name.
Only works for ratings from 0 to 9. No spaces allowed.
testing:
List("AA"
,"BB,"
,"CC,5"
,"DD,2,5"
,"EE,2,5, 9,11,5"
,"a,b,2,7").map(findIndices)
//res0: List[String] = List(AA, <-no ratings
// , BB,0 <-comma, no ratings
// , CC,0 <-one rating
// , DD,1 <-two ratings
// , EE,1,4 <-" 9" and "11" under valued
// , a,0 <-comma in name error
// )

How to append a string to a list or array in a for loop in scala?

var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]()
for(element<-ts_rdd)
{
var elem1 = element._1
var kpssSignificance: Double = 0.05
var dOpt: Option[Int] = (0 to 2).find
{
diff =>
var testTs = differencesOfOrderD(element._2, diff)
var (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
var d = dOpt match
{
case Some(v) => v
case None => 300000
}
if(d.equals(300000))
{
println("Bad Key: " + elem1)
RetainKeyList += elem1
}
Hi all,
I created a empty mutable list buffer var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]() and I am trying to add a string elem1 to it in a for loop.
When I try to compile the code it hangs with no error message but if I remove the code RetainKeyList += elem1 I am able to print all of the elem1 string properly.
What am I doing wrong here? Is there a cleaner way to collect all the string elem1 generated in the for loop?
Long story short, your code is running on a distributed environment, so the local collection is not modified. Every week someone asks this question, please if you do not understand what are the implications of distributed computing do not use a distributed framework like Spark.
Also, you are abusing of mutability in all parts. And mutability and a distributed environment don't play nicely.
Anyway, here is a better way to solve your problem.
val retainKeysRdd = ts_rdd.map {
case (elem1, elem2) =>
val kpssSignificance = 0.05d
val dOpt = (0 to 2).find { diff =>
val testTs = differencesOfOrderD(elem2, diff)
val (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
(elem1 -> dOpt)
} collect {
case (key, None) => key
}
This returns an RDD with the retain keys. If you are really sure you need this as a local collection and that they won't blow up your memory, you can do this:
val retainKeysList = retainKeysRdd.collect().toList

Is there any way to replace nested For loop with Higher order methods in scala

I am having a mutableList and want to take sum of all of its rows and replacing its rows with some other values based on some criteria. Code below is working fine for me but i want to ask is there any way to get rid of nested for loops as for loops slows down the performance. I want to use scala higher order methods instead of nested for loop. I tried flodLeft() higher order method to replace single for loop but can not implement to replace nested for loop
def func(nVect : Int , nDim : Int) : Unit = {
var Vector = MutableList.fill(nVect,nDimn)(math.random)
var V1Res =0.0
var V2Res =0.0
var V3Res =0.0
for(i<- 0 to nVect -1) {
for (j <- i +1 to nVect -1) {
var resultant = Vector(i).zip(Vector(j)).map{case (x,y) => x + y}
V1Res = choice(Vector(i))
V2Res = choice(Vector(j))
V3Res = choice(resultant)
if(V3Res > V1Res){
Vector(i) = res
}
if(V3Res > V2Res){
Vector(j) = res
}
}
}
}
There are no "for loops" in this code; the for statements are already converted to foreach calls by the compiler, so it is already using higher-order methods. These foreach calls could be written out explicitly, but it would make no difference to the performance.
Making the code compile and then cleaning it up gives this:
def func(nVect: Int, nDim: Int): Unit = {
val vector = Array.fill(nVect, nDim)(math.random)
for {
i <- 0 until nVect
j <- i + 1 until nVect
} {
val res = vector(i).zip(vector(j)).map { case (x, y) => x + y }
val v1Res = choice(vector(i))
val v2Res = choice(vector(j))
val v3Res = choice(res)
if (v3Res > v1Res) {
vector(i) = res
}
if (v3Res > v2Res) {
vector(j) = res
}
}
}
Note that using a single for does not make any difference to the result, it just looks better!
At this point it gets difficult to make further improvements. The only parallelism possible is with the inner map call, but vectorising this is almost certainly a better option. If choice is expensive then the results could be cached, but this cache needs to be updated when vector is updated.
If the choice could be done in a second pass after all the cross-sums have been calculated then it would be much more parallelisable, but clearly that would also change the results.

Scala Futures with DB

I'm writing code in scala/play with anorm/postgres for match generation based on users profiles. The following code works, but I've commented out the section that is causing problems, the while loop. I noticed while running it that the first 3 Futures seem to work synchronously but the problem comes when I'm retrieving the count of rows in the table in the fourth step.
The fourth step returns the count before the above insert's actually happened. As far as I can tell, steps 1-3 are being queued up for postgres synchronously, but the call to retrieve the count seems to return BEFORE the first 3 steps complete, which makes no sense to me. If the first 3 steps get queued up in the correct order, why wouldn't the fourth step wait to return the count until after the inserts happen?
When I uncomment the while loop, the match generation and insert functions are called until memory runs out, as the count returned is continually below the desired threshold.
I know the format itself is subpar, but my question is not about how to write the most elegant scala code, but merely how to get it to work for now.
def matchGeneration(email:String,itNum:Int) = {
var currentIterationNumber = itNum
var numberOfMatches = MatchData.numberOfCurrentMatches(email)
while(numberOfMatches < 150){
Thread.sleep(25000)//delay while loop execution time
generateUsers(email) onComplete {
case(s) => {
print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 1")
Thread.sleep(5000)//Time for initial user generation to take place
genDemoMatches(email, currentIterationNumber) onComplete {
case (s) => {
print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 2")
genIntMatches(email,currentIterationNumber) onComplete {
case(s) => {
print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 3")
genSchoolWorkMatches(email,currentIterationNumber) onComplete {
case(s) => {
Thread.sleep(10000)
print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 4")
incrementNumberOfMatches(email) onComplete {
case(s) => {
currentIterationNumber+=1
println(s"current number of matches: $numberOfMatches")
println(s"current Iteration: $currentIterationNumber")
}
}
}
}
}
}
}
}
}
}
//}
}
The match functions are defined as futures, such as :
def genSchoolWorkMatches(email:String,currentIterationNumber:Int):Future[Unit]=
Future(genUsersFromSchoolWorkData(email, currentIterationNumber))
genUsersFromSchoolWorkData(email:String) follows the same form as the other two. It is a function that initially gets all the school/work fields that a user has filled out in their profile ( SELECT major FROM school_work where email='$email') and it generates a dummyUser that contains one of those fields in common with this user of email:String. It would take about 30-40 lines of code to print this function so I can explain it further if need be.
I have edited my code, the only way I found so far to get this to work was by hacking it with Thread.sleep(). I think the problem may lie with anorm
as my Future logic constructs did work as I expected, but the problem lies in the inconsistency of when writes occur versus what the read returns. The numberOfCurrentMatches(email:String) function returns the number of matches as it is a simple SELECT count(email) from table where email='$email'. The problem is that sometimes after inserting 23 matches the count returns as 0, then after a second iteration it will return 46. I assumed that the onComplete() would bind to the underlying anorm function defined with DB.withConnection() but apparently it may be too far removed to accomplish this. I am not really sure at this point what to research or look up further to try to get around this problem, rather than writing a separate sort of supervisor function to return at a value closer to 150.
UPDATE
Thanks to the advice of user's here, and trying to understand Scala's documentation at this link: Scala Futures and Promises
I have updated my code to be a bit more readable and scala-esque:
def genMatchOfTypes(email:String,iterationNumber:Int) = {
genDemoMatches(email,iterationNumber)
genIntMatches(email,iterationNumber)
genSchoolWorkMatches(email,iterationNumber)
}
def matchGeneration(email:String) = {
var currentIterationNumber = 0
var numberOfMatches = MatchData.numberOfCurrentMatches(email)
while (numberOfMatches < 150) {
println(s"current number of matches: $numberOfMatches")
Thread.sleep(30000)
generateUsers(email)
.flatMap(users => genMatchOfTypes(email,currentIterationNumber))
.flatMap(matches => incrementNumberOfMatches(email))
.map{
result =>
currentIterationNumber += 1
println(s"current Iteration2: $currentIterationNumber")
numberOfMatches = MatchData.numberOfCurrentMatches(email)
println(s"current number of matches2: $numberOfMatches")
}
}
}
I still am heavily dependent upon the Thread.sleep(30000) to provide enough time to run through the while loop before it tries to loop back again. It's still an unwieldy hack. When I uncomment the Thread.sleep()
my output in bash looks like this:
users for match generation createdcurrent number of matches: 0
[error] c.MatchDataController - here is the list: jnkj
[error] c.MatchDataController - here is the list: hbhjbjjnkjn
current number of matches: 0
current number of matches: 0
current number of matches: 0
current number of matches: 0
current number of matches: 0
This of course is a truncated output. It runs like this over and over until I get errors about too many open files and the JVM/play server crashes entirely.
One solution is to use Future.traverse for known iteration count
Implying
object MatchData {
def numberOfCurrentMatches(email: String) = ???
}
def generateUsers(email: String): Future[Unit] = ???
def incrementNumberOfMatches(email: String): Future[Int] = ???
def genDemoMatches(email: String, it: Int): Future[Unit] = ???
def genIntMatches(email: String, it: Int): Future[Unit] = ???
def genSchoolWorkMatches(email: String, it: Int): Future[Unit] = ???
You can write code like
def matchGeneration(email: String, itNum: Int) = {
val numberOfMatches = MatchData.numberOfCurrentMatches(email)
Future.traverse(Stream.range(itNum, 150 - numberOfMatches + itNum)) { currentIterationNumber => for {
_ <- generateUsers(email)
_ = print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 1")
_ <- genDemoMatches(email, currentIterationNumber)
_ = print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 2")
_ <- genIntMatches(email, currentIterationNumber)
_ = print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 3")
_ <- genSchoolWorkMatches(email, currentIterationNumber)
_ = Thread.sleep(15000)
_ = print(s">>>>>>>>>>>>>>>>>>>>>>>>>>>STEP 4")
numberOfMatches <- incrementNumberOfMatches(email)
_ = println(s"current number of matches: $numberOfMatches")
_ = println(s"current Iteration: $currentIterationNumber")
} yield ()
}
Update
If you urged to check some condition each time, one way is to use cool monadic things from scalaz library. It have definition of monad for scala.Future so we can replace word monadic with asynchronous when we want to
For example StreamT.unfoldM can create conditional monadic(asynchronous) loop, even if we don need elements of resulting collection we still can use it just for iteration.
Lets define your
def generateAll(email: String, iterationNumber: Int): Future[Unit] = for {
_ <- generateUsers(email)
_ <- genDemoMatches(email, iterationNumber)
_ <- genIntMatches(email, iterationNumber)
_ <- genSchoolWorkMatches(email, iterationNumber)
} yield ()
Then iteration step
def generateStep(email: String, limit: Int)(iterationNumber: Int): Future[Option[(Unit, Int)]] =
if (MatchData.numberOfCurrentMatches(email) >= limit) Future(None)
else for {
_ <- generateAll(email, iterationNumber)
_ <- incrementNumberOfMatches(email)
next = iterationNumber + 1
} yield Some((), next)
Now our resulting function simplifies to
import scalaz._
import scalaz.std.scalaFuture._
def matchGeneration(email: String, itNum: Int): Future[Unit] =
StreamT.unfoldM(0)(generateStep(email, 150) _).toStream.map(_.force: Unit)
It looks like synchronous method MatchData.numberOfCurrentMatches is reacting on your asynchronous modification inside the incrementNumberOfMatches. Note that generally it could lead to disastrous results and you probably need to move that state inside some actor or something like that

Spark job slow on cluster than standalone

I have this piece of code which works fine in standalone but works slowly when it comes to work on a cluster of 4 slaves (8cores 30Go memory) at AWS.
For a file of 10000 entries
Standalone : 257s
Aws 4S : 369s
def tabHash(nb:Int, dim:Int) = {
var tabHash0 = Array(Array(0.0)).tail
for( ind <- 0 to nb-1) {
var vechash1 = Array(0.0).tail
for( ind <- 0 to dim-1) {
val nG = Random.nextGaussian
vechash1 = vechash1 :+ nG
}
tabHash0 = tabHash0 :+ vechash1
}
tabHash0
}
def hashmin3(x:Vector, w:Double, b:Double, tabHash1:Array[Array[Double]]) = {
var tabHash0 = Array(0.0).tail
val x1 = x.toArray
for( ind <- 0 to tabHash1.size-1) {
var sum = 0.0
for( ind2 <- 0 to x1.size-1) {
sum = sum + (x1(ind2)*tabHash1(ind)(ind2))
}
tabHash0 = tabHash0 :+ (sum+b)/w
}
tabHash0
}
def pow2(tab1:Array[Double], tab2:Array[Double]) = {
var sum = 0.0
for( ind <- 0 to tab1.size-1) {
sum = sum - Math.pow(tab1(ind)-tab2(ind),2)
}
sum
}
val w = ww
val b = Random.nextDouble * w
val tabHash2 = tabHash(nbseg,dim)
var rdd_0 = parsedData.map(x => (x.get_id,(x.get_vector,hashmin3(x.get_vector,w,b,tabHash2)))).cache
var rdd_Yet = rdd_0
for( ind <- 1 to maxIterForYstar ) {
var rdd_dist = rdd_Yet.cartesian(rdd_0).flatMap{ case (x,y) => Some((x._1,(y._2._1,pow2(x._2._2,y._2._2))))}//.coalesce(64)
var rdd_knn = rdd_dist.topByKey(k)(Ordering[(Double)].on(x=>x._2))
var rdd_bary = rdd_knn.map(x=> (x._1,Vectors.dense(bary(x._2,k))))
rdd_Yet = rdd_bary.map(x=>(x._1,(x._2,hashmin3(x._2,w,b,tabHash2))))
}
I tried to broadcast some variables
val w = sc.broadcast(ww.toDouble)
val b = sc.broadcast(Random.nextDouble * ww)
val tabHash2 = sc.broadcast(tabHash(nbseg,dim))
Without any effects
I know that's not the bary function because i tried another version of this code without hashmin3 which works fine with 4 slaves and worse with 8 slaves which is for another topic.
Bad code. Especially for distributed and large computations. I can't fast tell what is root of problem, but you anyway must rewrite this code.
Array is terrible for universal and sharable data. It is mutable and require continious memory allocation (last may be a problem even you have enough of memory). Better use Vector (or List sometimes). Never use arrays, really.
var vechash1 = Array(0.0).tail You create collection with one element, then call function to get empty collection. If it's rare no worry about performance, but it's ugly! var vechash1: Array[Double] = Array() or var vechash1: Vector[Double] = Vector() or var vechash1 = Vector.empty[Double].
def tabHash(nb:Int, dim:Int) = Always set return type of function when it's unclear. Power of scala is rich type system. It's very helpful to have compile time checks (about what you exactly get in result but not what you imagine to get!). It's very important when deal with huge data, because compile checks save your time and money. Also it's more easy to read such code later. def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] =
def hashmin3(x: Vector, typo? it doesn't compile without type parameter.
First function more compact:
def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] = {
(0 to nb-1).map {_ =>
(0 to dim - 1).map(_ => Random.nextGaussian()).toVector
}.toVector
}
Second function is ((x*M) + scalar_b)/scalar_w. It's may be more efficient to use library that is specifically optimized for work with matrices.
Third (i suppose mistake here with sign of computation, if you count square error):
def pow2(tab1:Vector[Double], tab2:Vector[Double]): Double =
tab1.zip(tab2).map{case (t1,t2) => Math.pow(t1 - t2, 2)}.reduce(_ - _)
var rdd_Yet = rdd_0 Cached RDD is rewrited in cycle. So it's useless storage.
Last cycle is hard to analyse. I think it's must be simplified.