Calculate average and remove from list in Scala - scala

I am on my first week with Scala and struggling with the way the code is written in this language.
I am trying to write a function that determines the average number in a list and removes the values below that average. For example:
I have this list:
List[(String, Int)] = List(("A", 1), ("B", 1), ("C", 3), ("D", 2), ("E", 4))
The result should be 2.2.
So the function should also remove the entries ("A", 1), ("B", 1) and ("D", 2) because they are below the average.
Can anyone help me?

You can calculate the average of the second elements of the list of tuples, you don't need to do the sum yourself because Scala has a builtin function for that. First we need to transform the list of tuples to a list of Int values, we can do that using the map function as shown below
val average = list.map(_._2).sum/list.size.toDouble
Now you have the average, you can filter your list based on its value
val newList = list.filter(_._2 < average)
Note that we didn't remove anything from the list, we created a new one with filtered elements

val average = list.map(_._2).sum / list.size.toDouble
list.filter(p => p._2 >= average)
You need to cast to Double, else average would be cast to an Int and be imprecise. The filter only keeps the element greater than the average.

You can do:
val sum = list.map(_._2).sum
val avg: Double = sum / list.size.toDouble
val filtered = list.filter(_._2 > avg)
Note this is traversing the list twice, once for summing and once for filtering. Another thing to note is that Scala List[T] is immutable. When you filter, you're creating a new List object with the filtered data.

Related

How to deconstruct sort method for length of strings (descending) in Scala?

I came across this piece of code that sorts a collection of strings in descending order by length:
words.sortBy(x => -x.length)
Can someone help me to understand what the purpose of - in front of x is and deconstruct this code piece by piece? Is it standing for 'reverse' operation? I know it's integer operation but I'm having a hard time figuring out how the algorithm works in the background.
Also can this be deemed a bubble sort?
If you have:
val collection: SomeCollection[A]
val keyToSortBy A => B
when you do:
collection.sortBy(keyToSortBy)
what happens is that Scala will look for Ordering[B] in its implicit scope (read about implicits if you aren't familiar with them yet), and it will use methods provided by this interrface to compare elemetns by sorting algorithm.
sortBy will use Ordering[X] to sort things in ascending order (think Comparator if you know Java). For Ordering[Int] it's just increasing order of integers, for Ordering[String] you have a lexical order of Strings.
What - does here is taking the value before passing it to the algorithm sorting by Int and negating it. It would be easier if you see some example:
List("a", "bb", "ccc").sortBy(word => word.length)
// imagine that what it does is:
// - building a collection of pairs ("a", 1), ("bb", 2), ("ccc", 3)
// ( (value from collection, what function returned for that value) )
// - sorting by the second element of pair
// using normal Int comparison to get ascending result
// - take only the first element of each pair: ("a", 1), ("bb", 2), ("ccc", 3)
List("a", "bb", "ccc") // result
If we put - there, what Ordering would get to compate would be different:
List("a", "bb", "ccc").sortBy(word => -word.length)
// - building a collection of pairs ("a", -1), ("bb", -2), ("ccc", -3)
// - sorting by the second element of pair - notice that all are negative now!!!
// using normal Int comparison to get ascending result
// - take only the first element of each pair: ("ccc", -3), ("bb", -2), ("a", -1)
List("ccc", "bb", "a") // result

how to find all possible combinations between tuples without duplicates scala

suppose I have list of tuples:
val a = ListBuffer((1, 5), (6, 7))
Update: Elements in a are assumed to be distinct inside each of the tuples2, in other words, it can be for example (1,4) (1,5) but not (1,1) (2,2).
I want to generate results of all combinations of ListBuffer a between these two tuples but without duplication. The result will look like:
ListBuffer[(1,5,6), (1,5,7), (6,7,1), (6,7,5)]
Update: elements in result tuple3 are also distinct. tuples them selves are also distinct, means as long as (6,7,1) is present, then (1,7,6) should not be in the result tuple3.
If, for example val a = ListBuffer((1, 4), (1, 5)) then the result output should be ListBuffer[(1,4,5)] in which (1,4,1) and (1,5,1) are discarded
How can I do that in Scala?
Note: I just gave an example. Usually the val a has tens of scala.Tuple2
If the individual elements are unique, as you've commented, then you should be able to flatten everything (un-tuple), get the desired combinations(), and re-tuple.
updated
val a = collection.mutable.ListBuffer((1, 4), (1, 5))
a.flatMap(t => Seq(t._1, t._2)) //un-tuple
.distinct //no duplicates
.combinations(3) //unique sets of 3
.map{case Seq(x,y,z) => (x,y,z)} //re-tuple
.toList //if you don't want an iterator

Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background. I have done some programming in haskell, so not completely new to functional programming.
I'm trying to accomplish some form of a nested for-loop. I have a RDD which I want to manipulate based on every two elements in the RDD. The pseudo code (java-like) would look like this:
// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
list.add(rdd.get(i)._1);
for(int j = 0; j < rdd.length; j++){
if(rdd.get(i)._1 == rdd.get(j)._1){
list.add(rdd.get(j)._1);
}
}
}
// Then now let ._1 of the rdd be this list
My scala solution (that does not work) looks like this:
val aggregatedTransactions = joinedTransactions.map( f => {
var list = List[Any](f._2._1)
val filtered = joinedTransactions.filter(t => f._1 == t._1)
for(i <- filtered){
list ::= i._2._1
}
(f._1, list, f._2._2)
})
I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.
I am aware that i cannot do any filter or map function within another map function. I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.
How do you achieve an effect like this with RDDs?
Assuming your input has the form RDD[(A, (A, B))] for some types A, B, and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:
rdd.join(rdd.values).keys
Details:
It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example. So I'll make some assumptions and hope that it will help with your specific case.
For the full example, I'll assume:
Input RDD has type RDD[(Int, (Int, Int))]
Expected output has the form RDD[Int], and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1) would appear once per occurrence of X as a key
If that's the case we're trying to solve - this join would solve it:
// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
(1, (1, 5)),
(1, (2, 5)),
(2, (1, 5)),
(3, (4, 5))
))
// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys
// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

Split sorted Scala Sequence/Array according to gaps between elements [duplicate]

What is the most elegant way of grouping a list of values into groups based on their neighbor values?
The wider context I have is having a list of lines, that need to be grouped into paragraphs. I want to be able to say that if the vertical difference between two lines is lower than threshold, they are in the same paragraph.
I ended up solving this problem differently, but I'm wondering about the correct solution here.
case class Box(y: Int)
val list = List(Box(y=1), Box(y=2), Box(y=5))
def group(list: List[Box], threshold: Int): List[List[Box]] = ???
val grouped = group(list, 2)
> List(List(Box(y=1), Box(y=2)), List(Box(y=5)))
I have looked at groupBy(), but that can only work with one element at a time. I have also tried an approach that involved pre-computing differences using sliding(), but then it becomes awkward to retrieve the elements from the original collection.
It's a one liner. Generalising types left as an exercise for the reader.
Using ints and absolute difference rather than lines and spacing to avoid clutter.
val zs = List(1,2,4,8,9,10,15,16)
def closeEnough(a:Int, b:Int) = (Math.abs(b -a) <= 2)
zs.drop(1).foldLeft(List(List(zs.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
.map(_.reverse)
.reverse
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))
Or a two liner for a slight efficiency gain
val ys = zs.reverse
ys.drop(1).foldLeft(List(List(ys.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))

Seq with maximal elements

I have a Seq and function Int => Int. What I need to achieve is to take from original Seq only thoose elements that would be equal to the maximum of the resulting sequence (the one, I'll have after applying given function):
def mapper:Int=>Int= x=>x*x
val s= Seq( -2,-2,2,2 )
val themax= s.map(mapper).max
s.filter( mapper(_)==themax)
But this seems wasteful, since it has to map twice (once for the filter, other for the maximum).
Is there a better way to do this? (without using a cycle, hopefully)
EDIT
The code has since been edited; in the original this was the filter line: s.filter( mapper(_)==s.map(mapper).max). As om-nom-nom has pointed out, this evaluates `s.map(mapper).max each (filter) iteration, leading to quadratic complexity.
Here is a solution that does the mapping only once and using the `foldLeft' function:
The principle is to go through the seq and for each mapped element if it is greater than all mapped before then begin a new sequence with it, otherwise if it is equal return the list of all maximums and the new mapped max. Finally if it is less then return the previously computed Seq of maximums.
def getMaxElems1(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = s.foldLeft(Seq[(Int,Int)]())((res, elem) => {
val e2 = mapper(elem)
if(res.isEmpty || e2>res.head._2)
Seq((elem,e2))
else if (e2==res.head._2)
res++Seq((elem,e2))
else res
}).map(_._1) // keep only original elements
// test with your list
scala> getMaxElems1(s)(mapper)
res14: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems1(Seq(-1, 2,0, -2, 1,-2))(mapper)
res15: Seq[Int] = List(2, -2, -2)
Remark: About complexity
The algorithm I present above has a complexity of O(N) for a list with N elements. However:
the operation of mapping all elements is of complexity O(N)
the operation of computing the max is of complexity O(N)
the operation of zipping is of complexity O(N)
the operation of filtering the list according to the max is also of complexity O(N)
the operation of mapping all elements is of complexity O(M), with M the number of final elements
So, finally the algorithm you presented in your question has the same complexity (quality) than my answer's one, moreover the solution you present is more clear than mine. So, even if the 'foldLeft' is more powerful, for this operation I would recommend your idea, but with zipping original list and computing the map only once (especially if your map is more complicated than a simple square). Here is the solution computed with the help of *scala_newbie* in question/chat/comments.
def getMaxElems2(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = {
val mappedS = s.map(mapper) //map done only once
val m = mappedS.max // find the max
s.zip(mappedS).filter(_._2==themax).unzip._1
}
// test with your list
scala> getMaxElems2(s)(mapper)
res16: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems2(Seq(-1, 2,0, -2, 1,-2))(mapper)
res17: Seq[Int] = List(2, -2, -2)