Functional Programming way to calculate something like a rolling sum - scala

Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.

You may use scanLeft
list.scanLeft(0)(_ + _).tail

The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.

I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

How to sort a list in scala

I am a newbie in scala and I need to sort a very large list with 40000 integers.
The operation is performed many times. So performance is very important.
What is the best method for sorting?
You can sort the list with List.sortWith() by providing a relevant function literal. For example, the following code prints all elements of sorted list which contains all elements of the initial list in alphabetical order of the first character lowercased:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s: String, t: String)
=> s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
Much shorter version will be the following with Scala's type inference:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s, t) => s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
For integers there is List.sorted, just use this:
val list = List(4, 3, 2, 1)
val sortedList = list.sorted
println(sortedList)
just check the docs
List has several methods for sorting. myList.sorted works for types with already defined order (like Int or String and others). myList.sortWith and myList.sortBy receive a function that helps defining the order
Also, first link on google for scala List sort: http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered
you can use List(1 to 400000).sorted

How can I idiomatically "remove" a single element from a list in Scala and close the gap?

Lists are immutable in Scala, so I'm trying to figure out how I can "remove" - really, create a new collection - that element and then close the gap created in the list. This sounds to me like it would be a great place to use map, but I don't know how to get started in this instance.
Courses is a list of strings. I need this loop because I actually have several lists that I will need to remove the element at that index from (I'm using multiple lists to store data associated across lists, and I'm doing this by simply ensuring that the indices will always correspond across lists).
for (i <- 0 until courses.length){
if (input == courses(i) {
//I need a map call on each list here to remove that element
//this element is not guaranteed to be at the front or the end of the list
}
}
}
Let me add some detail to the problem. I have four lists that are associated with each other by index; one list stores the course names, one stores the time the class begins in a simple int format (ie 130), one stores either "am" or "pm", and one stores the days of the classes by int (so "MWF" evals to 1, "TR" evals to 2, etc). I don't know if having multiple this is the best or the "right" way to solve this problem, but these are all the tools I have (first-year comp sci student that hasn't programmed seriously since I was 16). I'm writing a function to remove the corresponding element from each lists, and all I know is that 1) the indices correspond and 2) the user inputs the course name. How can I remove the corresponding element from each list using filterNot? I don't think I know enough about each list to use higher order functions on them.
This is the use case of filter:
scala> List(1,2,3,4,5)
res0: List[Int] = List(1, 2, 3, 4, 5)
scala> res0.filter(_ != 2)
res1: List[Int] = List(1, 3, 4, 5)
You want to use map when you are transforming all the elements of a list.
To answer your question directly, I think you're looking for patch, for instance to remove element with index 2 ("c"):
List("a","b","c","d").patch(2, Nil, 1) // List(a, b, d)
where Nil is what we're replacing it with, and 1 is the number of characters to replace.
But, if you do this:
I have four lists that are associated with each other by index; one
list stores the course names, one stores the time the class begins in
a simple int format (ie 130), one stores either "am" or "pm", and one
stores the days of the classes by int
you're going to have a bad time. I suggest you use a case class:
case class Course(name: String, time: Int, ampm: String, day: Int)
and then store them in a Set[Course]. (Storing time and days as Ints isn't a great idea either - have a look at java.util.Calendar instead.)
First a few sidenotes:
List is not an index-based structure. All index-oriented operations on it take linear time. For index-oriented algorithms Vector is a much better candidate. In fact if your algorithm requires indexes it's a sure sign that you're really not exposing Scala's functional capabilities.
map serves for transforming a collection of items "A" to the same collection of items "B" using a passed in transformer function from a single "A" to single "B". It cannot change the number of resulting elements. Probably you've confused map with fold or reduce.
To answer on your updated question
Okay, here's a functional solution, which works effectively on lists:
val (resultCourses, resultTimeList, resultAmOrPmList, resultDateList)
= (courses, timeList, amOrPmList, dateList)
.zipped
.filterNot(_._1 == input)
.unzip4
But there's a catch. I actually came to be quite astonished to find out that functions used in this solution, which are so basic for functional languages, were not present in the standard Scala library. Scala has them for 2 and 3-ary tuples, but not the others.
To solve that you'll need to have the following implicit extensions imported.
implicit class Tuple4Zipped
[ A, B, C, D ]
( val t : (Iterable[A], Iterable[B], Iterable[C], Iterable[D]) )
extends AnyVal
{
def zipped
= t._1.toStream
.zip(t._2).zip(t._3).zip(t._4)
.map{ case (((a, b), c), d) => (a, b, c, d) }
}
implicit class IterableUnzip4
[ A, B, C, D ]
( val ts : Iterable[(A, B, C, D)] )
extends AnyVal
{
def unzip4
= ts.foldRight((List[A](), List[B](), List[C](), List[D]()))(
(a, z) => (a._1 +: z._1, a._2 +: z._2, a._3 +: z._3, a._4 +: z._4)
)
}
This implementation requires Scala 2.10 as it utilizes the new effective Value Classes feature for pimping the existing types.
I have actually included these in a small extensions library called SExt, after depending your project on which you'll be able to have them by simply adding an import sext._ statement.
Of course, if you want you can just compose these functions directly into the solution:
val (resultCourses, resultTimeList, resultAmOrPmList, resultDateList)
= courses.toStream
.zip(timeList).zip(amOrPmList).zip(dateList)
.map{ case (((a, b), c), d) => (a, b, c, d) }
.filterNot(_._1 == input)
.foldRight((List[A](), List[B](), List[C](), List[D]()))(
(a, z) => (a._1 +: z._1, a._2 +: z._2, a._3 +: z._3, a._4 +: z._4)
)
Removing and filtering List elements
In Scala you can filter the list to remove elements.
scala> val courses = List("Artificial Intelligence", "Programming Languages", "Compilers", "Networks", "Databases")
courses: List[java.lang.String] = List(Artificial Intelligence, Programming Languages, Compilers, Networks, Databases)
Let's remove a couple of classes:
courses.filterNot(p => p == "Compilers" || p == "Databases")
You can also use remove but it's deprecated in favor of filter or filterNot.
If you want to remove by an index you can associate each element in the list with an ordered index using zipWithIndex. So, courses.zipWithIndex becomes:
List[(java.lang.String, Int)] = List((Artificial Intelligence,0), (Programming Languages,1), (Compilers,2), (Networks,3), (Databases,4))
To remove the second element from this you can refer to index in the Tuple with courses.filterNot(_._2 == 1) which gives the list:
res8: List[(java.lang.String, Int)] = List((Artificial Intelligence,0), (Compilers,2), (Networks,3), (Databases,4))
Lastly, another tool is to use indexWhere to find the index of an arbitrary element.
courses.indexWhere(_ contains "Languages")
res9: Int = 1
Re your update
I'm writing a function to remove the corresponding element from each
lists, and all I know is that 1) the indices correspond and 2) the
user inputs the course name. How can I remove the corresponding
element from each list using filterNot?
Similar to Nikita's update you have to "merge" the elements of each list. So courses, meridiems, days, and times need to be put into a Tuple or class to hold the related elements. Then you can filter on an element of the Tuple or a field of the class.
Combining corresponding elements into a Tuple looks as follows with this sample data:
val courses = List(Artificial Intelligence, Programming Languages, Compilers, Networks, Databases)
val meridiems = List(am, pm, am, pm, am)
val times = List(100, 1200, 0100, 0900, 0800)
val days = List(MWF, TTH, MW, MWF, MTWTHF)
Combine them with zip:
courses zip days zip times zip meridiems
val zipped = List[(((java.lang.String, java.lang.String), java.lang.String), java.lang.String)] = List((((Artificial Intelligence,MWF),100),am), (((Programming Languages,TTH),1200),pm), (((Compilers,MW),0100),am), (((Networks,MWF),0900),pm), (((Databases,MTWTHF),0800),am))
This abomination flattens the nested Tuples to a Tuple. There are better ways.
zipped.map(x => (x._1._1._1, x._1._1._2, x._1._2, x._2)).toList
A nice list of tuples to work with.
List[(java.lang.String, java.lang.String, java.lang.String, java.lang.String)] = List((Artificial Intelligence,MWF,100,am), (Programming Languages,TTH,1200,pm), (Compilers,MW,0100,am), (Networks,MWF,0900,pm), (Databases,MTWTHF,0800,am))
Finally we can filter based on course name using filterNot. e.g. filterNot(_._1 == "Networks")
List[(java.lang.String, java.lang.String, java.lang.String, java.lang.String)] = List((Artificial Intelligence,MWF,100,am), (Programming Languages,TTH,1200,pm), (Compilers,MW,0100,am), (Databases,MTWTHF,0800,am))
The answer I am about to give might be overstepping what you have been taught so far in your course, so if that is the case I apologise.
Firstly, you are right to question whether you should have four lists - fundamentally, it sounds like what you need is an object which represents a course:
/**
* Represents a course.
* #param name the human-readable descriptor for the course
* #param time the time of day as an integer equivalent to
* 12 hour time, i.e. 1130
* #param meridiem the half of the day that the time corresponds
* to: either "am" or "pm"
* #param days an encoding of the days of the week the classes runs.
*/
case class Course(name : String, timeOfDay : Int, meridiem : String, days : Int)
with which you may define an individual course
val cs101 =
Course("CS101 - Introduction to Object-Functional Programming",
1000, "am", 1)
There are better ways to define this type (better representations of 12-hour time, a clearer way to represent the days of the week, etc), but I won't deviate from your original problem statement.
Given this, you would have a single list of courses:
val courses = List(cs101, cs402, bio101, phil101)
And if you wanted to find and remove all courses that matched a given name, you would write:
val courseToRemove = "PHIL101 - Philosophy of Beard Ownership"
courses.filterNot(course => course.name == courseToRemove)
Equivalently, using the underscore syntactic sugar in Scala for function literals:
courses.filterNot(_.name == courseToRemove)
If there was the risk that more than one course might have the same name (or that you are filtering based on some partial criteria using a regular expression or prefix match) and that you only want to remove the first occurrence, then you could define your own function to do that:
def removeFirst(courses : List[Course], courseToRemove : String) : List[Course] =
courses match {
case Nil => Nil
case head :: tail if head == courseToRemove => tail
case head :: tail => head :: removeFirst(tail)
}
Use the ListBuffer is a mutable List like a java list
var l = scala.collection.mutable.ListBuffer("a","b" ,"c")
print(l) //ListBuffer(a, b, c)
l.remove(0)
print(l) //ListBuffer(b, c)

Carry on information about previous computations

I'm new to functional programming, so some problems seems harder to solve using functional approach.
Let's say I have a list of numbers, like 1 to 10.000, and I want to get the items of the list which sums up to at most a number n (let's say 100). So, it would get the numbers until their sum is greater than 100.
In imperative programming, it's trivial to solve this problem, because I can keep a variable in each interaction, and stop once the objective is met.
But how can I do the same in functional programming? Since the sum function operates on completed lists, and I still don't have the completed list, how can I 'carry on' the computation?
If sum was lazily computed, I could write something like that:
(1 to 10000).sum.takeWhile(_ < 100)
P.S.:Even though any answer will be appreciated, I'd like one that doesn't compute the sum each time, since obviously the imperative version will be much more optimal regarding speed.
Edit:
I know that I can "convert" the imperative loop approach to a functional recursive function. I'm more interested in finding if one of the existing library functions can provide a way for me not to write one each time I need something.
Use Stream.
scala> val ss = Stream.from(1).take(10000)
ss: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> ss.scanLeft(0)(_ + _)
res60: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> res60.takeWhile(_ < 100).last
res61: Int = 91
EDIT:
Obtaining components is not very tricky either. This is how you can do it:
scala> ss.scanLeft((0, Vector.empty[Int])) { case ((sum, compo), cur) => (sum + cur, compo :+ cur) }
res62: scala.collection.immutable.Stream[(Int, scala.collection.immutable.Vector[Int])] = Stream((0,Vector()), ?)
scala> res62.takeWhile(_._1 < 100).last
res63: (Int, scala.collection.immutable.Vector[Int]) = (91,Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13))
The second part of the tuple is your desired result.
As should be obvious, in this case, building a vector is wasteful. Instead we can only store the last number from the stream that contributed to sum.
scala> ss.scanLeft(0)(_ + _).zipWithIndex
res64: scala.collection.immutable.Stream[(Int, Int)] = Stream((0,0), ?)
scala> res64.takeWhile(_._1 < 100).last._2
res65: Int = 13
The way I would do this is with recursion. On each call, add the next number. Your base case is when the sum is greater than 100, at which point you return all the way up the stack. You'll need a helper function to do the actual recursion, but that's no big deal.
This isn't hard using "functional" methods either.
Using recursion, rather than maintaining your state in a local variable that you mutate, you keep it in parameters and return values.
So, to return the longest initial part of a list whose sum is at most N:
If the list is empty, you're done; return the empty list.
If the head of the list is greater than N, you're done; return the empty list.
Otherwise, let H be the head of the list.
All we need now is the initial part of the tail of the list whose sum is at most N - H, then we can "cons" H onto that list, and we're done.
We can compute this recursively using the same procedure as we have used this far, so it's an easy step.
A simple pseudocode solution:
sum_to (n, ls) = if isEmpty ls or n < (head ls)
then Nil
else (head ls) :: sum_to (n - head ls, tail ls)
sum_to(100, some_list)
All sequence operations which require only one pass through the sequence can be implemented using folds our reduce like it is sometimes called.
I find myself using folds very often since I became used to functional programming
so here odd one possible approach
Use an empty collection as initial value and fold according to this strategy
Given the processed collection and the new value check if their sum is low enough and if then spend the value to the collection else do nothing
that solution is not very efficient but I want to emphasize the following
map fold filter zip etc are the way to get accustomed to functional programming try to use them as much as possible instead of loping constructs or recursive functions your code will be more declarative and functional

Scala: what is the most appropriate data structure for sorted subsets?

Given a large collection (let's call it 'a') of elements of type T (say, a Vector or List) and an evaluation function 'f' (say, (T) => Double) I would like to derive from 'a' a result collection 'b' that contains the N elements of 'a' that result in the highest value under f. The collection 'a' may contain duplicates. It is not sorted.
Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection 'b'? Thanks for any pointers / ideas.
Notes:
(1) I guess my use case can be most concisely expressed as
val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n // evaluation function
val b = a.sortBy( f ).take( N ) // sort, then clip
except that I do not want to sort the entire set.
(2) one option might be an iteration over 'a' that fills a TreeSet with 'manual' size bounding (reject anything worse than the worst item in the set, don't let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.
(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?
You can use a priority queue:
def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = {
val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse)
val (before, after) = xs.splitAt(k)
q ++= before
after.foreach(x => q += ord.max(x, q.dequeue))
q.dequeueAll
}
We fill the queue with the first k elements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:
scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4)
res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)
And it doesn't sort the complete list. I've got an Ordering in this implementation, but adapting it to use an evaluation function would be pretty trivial.