How to remove duplicates from a list then sort by most frequent - scala

I have a list with assorted keywords that may repeat. I need to generate a list with distinct keywords but sorted by the frequency of which they appeared on the original list.
How would be the idiomatic Scala for that? Here is a working but ugly implementation:
val keys = List("c","a","b","b","a","a")
keys.groupBy(p => p).toList.sortWith( (a,b) => a._2.size > b._2.size ).map(_._1)
// List("a","b","c")

Shorter version:
keys.distinct.sortBy(keys count _.==).reverse
That is not particular efficient, however. The groupBy version ought to perform better, though it can be improved:
keys.groupBy(identity).toSeq.sortBy(_._2.size).map(_._1)
One can also get rid of the reverse in the first version by declaring an Ordering:
val ord = Ordering by (keys count (_: String).==)
keys.distinct.sorted(ord.reverse)
Note that reverse in this version just produces a new Ordering that works in the opposite manner of the original. This version also suggests a way to get better performance:
val freq = collection.mutable.Map.empty[String, Int] withDefaultValue 0
keys foreach (k => freq(k) += 1)
val ord = Ordering by freq
keys.distinct.sorted(ord.reverse)

Nothing wrong with that implementation that comments can't fix!
Seriously, break it down a bit and describe what & why you're taking each step.
Not as "concise" perhaps, but the purpose of concise code in scala is to make code more readable. When concise code is not clear it's time to back up, break up (introduce well named local variables), and comment.

Here's my take, don't know if it's less "ugly":
scala> keys.groupBy(p => p).values.toList.sortBy(_.size).reverse.map(_.head)
res39: List[String] = List(a, b, c)

fold version:
val keys = List("c","a","b","b","a","a")
val keysCounts =
(Map.empty[String, Int] /: keys) { case (counts, k) =>
counts updated (k, (counts getOrElse (k, 0)) + 1)
}
keysCounts.toList sortBy { case (_, count) => -count } map { case (w, _) => w }

Perhaps,
val mapCount = keys.map(x => (x,keys.count(_ == x))).distinct
// mapCount : List[(java.lang.String, Int)] = List((c,1), (a,3), (b,2))
val sortedList = mapCount.sortWith(_._2 > _._2).map(_._1)
// sortedList : List[java.lang.String] = List(a, b, c)

How about:
keys.distinct.sorted
Newbie didn't read the question carefully. Let me try again:
keys.foldLeft (Map[String,Int]()) { (counts, elem) => counts + (elem -> (counts.getOrElse(elem, 0) - 1))}
.toList.sortBy(_._2).map(_._1)
Could use a mutable Map if you prefer. Negative frequency counts are stored in the map. If that bothers you, you can make them positive and negate the sortBy argument.

Just a little change from #Daniel 's 4th version, may have a better performance:
scala> def sortByFreq[T](xs: List[T]): List[T] = {
| val freq = collection.mutable.Map.empty[T, Int] withDefaultValue 0
| xs foreach (k => freq(k) -= 1)
| xs.distinct sortBy freq
| }
sortByFreq: [T](xs: List[T])List[T]
scala> sortByFreq(keys)
res2: List[String] = List(a, b, c)

My prefered versions would be:
Most canonical / expressive?
keys.groupBy(identity).toList.map{ case (k,v) => (-v.size,k) }.sorted.map(_._2)
Shortest and probably most efficient?
keys.groupBy(identity).toList.sortBy(-_._2.size).map(_._1)
Straight forward
keys.groupBy(identity).values.toList.sortBy(-_.size).map(_.head)

Related

Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = rdd.map (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = rdd.map(x => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = counted.map { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max._2.map(_._1)
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

Scala - access collection members within map or flatMap

Suppose that I use a sequence of various maps and/or flatMaps to generate a sequence of collections. Is it possible to access information about the "current" collection from within any of those methods? For example, without knowing anything specific about the functions used in the previous maps or flatMaps, and without using any intermediate declarations, how can I get the maximum value (or length, or first element, etc.) of the collection upon which the last map acts?
List(1, 2, 3)
.flatMap(x => f(x) /* some unknown function */)
.map(x => x + ??? /* what is the max element of the collection? */)
Edit for clarification:
In the example, I'm not looking for the max (or whatever) of the initial List. I'm looking for the max of the collection after the flatMap has been applied.
By "without using any intermediate declarations" I mean that I do not want to use any temporary collections en route to the final result. So, the example by Steve Waldman below, while giving the desired result, is not what I am seeking. (I include this condition is mostly for aesthetic reasons.)
Edit for clarification, part 2:
The ideal solution would be some magic keyword or syntactic sugar that lets me reference the current collection:
List(1, 2, 3)
.flatMap(x => f(x))
.map(x => x + theCurrentList.max)
I'm prepared to accept the fact, however, that this simply is not possible.
Maybe just define the list as a val, so you can name it? I don't know of any facility built into map(...) or flatMap(...) that would help.
val myList = List(1, 2, 3)
myList
.flatMap(x => f(x) /* some unknown function */)
.map(x => x + myList.max /* what is the max element of the List? */)
Update: By this approach at least, if you have multiple transformations and want to see the transformed version, you'd have to name that. You could get away with
val myList = List(1, 2, 3).flatMap(x => f(x) /* some unknown function */)
myList.map(x => x + myList.max /* what is the max element of the List? */)
Or, if there will be multiple transformations, get in the habit of naming the stages.
val rawList = List(1, 2, 3)
val smordified = rawList.flatMap(x => f(x) /* some unknown function */)
val maxified = smordified.map(x => x + smordified.max /* what is the max element of the List? */)
maxified
Update 2: Watch it work in the REPL even with heterogenous types:
scala> def f( x : Int ) : Vector[Double] = Vector(x * math.random, x * math.random )
f: (x: Int)Vector[Double]
scala> val rawList = List(1, 2, 3)
rawList: List[Int] = List(1, 2, 3)
scala> val smordified = rawList.flatMap(x => f(x) /* some unknown function */)
smordified: List[Double] = List(0.40730853571901315, 0.15151641399798665, 1.5305929709857609, 0.35211231420067435, 0.644241939254793, 0.15530230501048903)
scala> val maxified = smordified.map(x => x + smordified.max /* what is the max element of the List? */)
maxified: List[Double] = List(1.937901506704774, 1.6821093849837476, 3.0611859419715217, 1.8827052851864352, 2.1748349102405538, 1.6858952759962498)
scala> maxified
res3: List[Double] = List(1.937901506704774, 1.6821093849837476, 3.0611859419715217, 1.8827052851864352, 2.1748349102405538, 1.6858952759962498)
It is possible, but not pretty, and not likely something you want if you are doing it for "aesthetic reasons."
import scala.math.max
def f(x: Int): Seq[Int] = ???
List(1, 2, 3).
flatMap(x => f(x) /* some unknown function */).
foldRight((List[Int](),List[Int]())) {
case (x, (xs, Nil)) => ((x :: xs), List.fill(xs.size + 1)(x))
case (x, (xs, xMax :: _)) => ((x :: xs), List.fill(xs.size + 1)(max(x, xMax)))
}.
zipped.
map {
case (x, xMax) => x + xMax
}
// Or alternately, a slightly more efficient version using Streams.
List(1, 2, 3).
flatMap(x => f(x) /* some unknown function */).
foldRight((List[Int](),Stream[Int]())) {
case (x, (xs, Stream())) =>
((x :: xs), Stream.continually(x))
case (x, (xs, curXMax #:: _)) =>
val newXMax = max(x, curXMax)
((x :: xs), Stream.continually(newXMax))
}.
zipped.
map {
case (x, xMax) => x + xMax
}
Seriously though, I just took this on to see if I could do it. While the code didn't turn out as bad as I expected, I still don't think it's particularly readable. I'd discourage using this over something similar to Steve Waldman's answer. Sometimes, it's simply better to just introduce a val, rather than being dogmatic about it.
You could define a mapWithSelf (resp. flatMapWithSelf) operation along these lines and add it as an implicit enrichment to the collection. For List it might look like:
// Scala 2.13 APIs
object Enrichments {
implicit class WithSelfOps[A](val lst: List[A]) extends AnyVal {
def mapWithSelf[B](f: (A, List[A]) => B): List[B] =
lst.map(f(_, lst))
def flatMapWithSelf[B](f: (A, List[A]) => IterableOnce[B]): List[B] =
lst.flatMap(f(_, lst))
}
}
The enrichment basically fixes the value of the collection before the operation and threads it through. It should be possible to generify this (at least for the strict collections), though it would look a little different in 2.12 vs. 2.13+.
Usage would look like
import Enrichments._
val someF: Int => IterableOnce[Int] = ???
List(1, 2, 3)
.flatMap(someF)
.mapWithSelf { (x, lst) =>
x + lst.max
}
So at the usage site, it's aesthetically pleasant. Note that if you're computing something which traverses the list, you'll be traversing the list every time (leading to a quadratic runtime). You can get around that with some mutability or by just saving the intermediate list after the flatMap.
One somewhat-simple way of referencing prior output within the current map/collect operation is to use a named reference outside the map, then reference it from within the map block:
var prevOutput = ... // starting value of whatever is referenced within the map
myValues.map {
prevOutput = ... // expression that references prior `prevOutput`
prevOutput // return above computed value for the map to collect
}
This draws attention to the fact that we're referencing prior elements while building the new sequence.
This would be more messy, though, if you wanted to reference arbitrarily previous values, not just the previous one.

Getting the largest integer value from many lists

I want to get the maximum integer value of a bunch of lists.
How can I do this? Keep in mind, some of the lists maybe empty.
I tried something but I was getting:
java.lang.UnsupportedOperationException: empty.max
So I just have a bunch of lists like:
val l1 = List.empty
val l2 = List(1,2,3)
val l3 = List(4,5,6)
val l4 = List(10)
I am doing this currently:
(l1 ++ l2 ++ l3).max
The max may not exist if all the lists are empty, so we can model the result as an Option[Int].
Here's a simple way of doing it:
val max: Option[Int] = List(l1, l2, l3, l4).flatten match {
case Nil => None
case list => Some(list.max)
}
Performing an operation on a List if not empty is a common use case, so there's an ad-hoc combinator that you can use alternatively, reduceOption, as suggested by Jean Logeart's answer:
If you're into one-liners, you can do:
val max: Option[Int] = List(l1, l2, l3, l4).flatten.reduceOption(_ max _)
although I would prefer the first (more verbose) solution, as I personally find it easier to read.
If instead you want to have a default result, you can fold over the flattened List starting with your default:
val max: Int = List(l1, l2, l3, l4).flatten.foldLeft(0)(_ max _) // 0 or any default
or alternatively, just prepend a 0 to your original solution
val max = (0 :: l1 ++ l2 ++ l3).max
If all the lists can be empty:
val max: Option[Int] = Seq(l1, l2, l3, l4).flatten.reduceOption(_ max _)
Pretty much all of the other answers are using flatten or flatMap to create an intermediate list. If all of your lists are quite large, that's needless memory overhead. My solution uses iterators to avoid the extra allocation in the middle.
val list = List(l1, l2, l3, l4)
val max = list.iterator.flatMap(_.iterator).reduceOption(_ max _)
As pointed out in a comment, the .flatMap(_.iterator) can actually be replaced by a flatten. Since it's being called on an iterator, the result is another iterator, rather than a complete list.
If you are running into an exception where ALL of the lists are empty, then this will solve that:
(0 :: l1 ++ l2 ++ l3).max
Assuming you just want to default to 0 if they are all empty.
Here is a way you can use Options and a try/catch to find the max:
scala> val l = List(List.empty, List(1,2,3), List(4,5,6), List(10))
l: List[List[Int]] = List(List(), List(1, 2, 3), List(4, 5, 6), List(10))
scala> l.flatMap(x => try{ Some(x.max) } catch {case _ => None}).max
res0: Int = 10
In light of the comments below: don't use exceptions for control flow. I would recommend using Gabriele Petronella's solution.

Scala - increasing prefix of a sequence

I was wondering what is the most elegant way of getting the increasing prefix of a given sequence. My idea is as follows, but it is not purely functional or any elegant:
val sequence = Seq(1,2,3,1,2,3,4,5,6)
var currentElement = sequence.head - 1
val increasingPrefix = sequence.takeWhile(e =>
if (e > currentElement) {
currentElement = e
true
} else
false)
The result of the above is:
List(1,2,3)
You can take your solution, #Samlik, and effectively zip in the currentElement variable, but then map it out when you're done with it.
sequence.take(1) ++ sequence.zip(sequence.drop(1)).
takeWhile({case (a, b) => a < b}).map({case (a, b) => b})
Also works with infinite sequences:
val sequence = Seq(1, 2, 3).toStream ++ Stream.from(1)
sequence is now an infinite Stream, but we can peek at the first 10 items:
scala> sequence.take(10).toList
res: List[Int] = List(1, 2, 3, 1, 2, 3, 4, 5, 6, 7)
Now, using the above snippet:
val prefix = sequence.take(1) ++ sequence.zip(sequence.drop(1)).
takeWhile({case (a, b) => a < b}).map({case (a, b) => b})
Again, prefix is a Stream, but not infinite.
scala> prefix.toList
res: List[Int] = List(1, 2, 3)
N.b.: This does not handle the cases when sequence is empty, or when the prefix is also infinite.
If by elegant you mean concise and self-explanatory, it's probably something like the following:
sequence.inits.dropWhile(xs => xs != xs.sorted).next
inits gives us an iterator that returns the prefixes longest-first. We drop all the ones that aren't sorted and take the next one.
If you don't want to do all that sorting, you can write something like this:
sequence.scanLeft(Some(Int.MinValue): Option[Int]) {
case (Some(last), i) if i > last => Some(i)
case _ => None
}.tail.flatten
If the performance of this operation is really important, though (it probably isn't), you'll want to use something more imperative, since this solution still traverses the entire collection (twice).
And, another way to skin the cat:
val sequence = Seq(1,2,3,1,2,3,4,5,6)
sequence.head :: sequence
.sliding(2)
.takeWhile{case List(a,b) => a <= b}
.map(_(1)).toList
// List[Int] = List(1, 2, 3)
I will interpret elegance as the solution that most closely resembles the way we humans think about the problem although an extremely efficient algorithm could also be a form of elegance.
val sequence = List(1,2,3,2,3,45,5)
val increasingPrefix = takeWhile(sequence, _ < _)
I believe this code snippet captures the way most of us probably think about the solution to this problem.
This of course requires defining takeWhile:
/**
* Takes elements from a sequence by applying a predicate over two elements at a time.
* #param xs The list to take elements from
* #param f The predicate that operates over two elements at a time
* #return This function is guaranteed to return a sequence with at least one element as
* the first element is assumed to satisfy the predicate as there is no previous
* element to provide the predicate with.
*/
def takeWhile[A](xs: Traversable[A], f: (Int, Int) => Boolean): Traversable[A] = {
// function that operates over tuples and returns true when the predicate does not hold
val not = f.tupled.andThen(!_)
// Maybe one day our languages will be better than this... (dependant types anyone?)
val twos = sequence.sliding(2).map{case List(one, two) => (one, two)}
val indexOfBreak = twos.indexWhere(not)
// Twos has one less element than xs, we need to compensate for that
// An intuition is the fact that this function should always return the first element of
// a non-empty list
xs.take(i + 1)
}

Scala filter on a list by index

I wanted to write it functionally, and the best I could do was:
list.zipWithIndex.filter((tt:Tuple2[Thing,Int])=>(tt._2%3==0)).unzip._1
to get elements 0, 3, 6,...
Is there a more readable Scala idiom for this?
If efficiency is not an issue, you could do the following:
list.grouped(3).map(_.head)
Note that this constructs intermediate lists.
Alternatively you can use a for-comprehension:
for {
(x,i) <- list zipWithIndex
if i % 3 == 0
} yield x
This is of course almost identical to your original solution, just written differently.
My last alternative for you is the use of collect on the zipped list:
list.zipWithIndex.collect {
case (x,i) if i % 3 == 0 => x
}
Not much clear, but still:
xs.indices.collect { case i if i % 3 == 0 => xs(i) }
A nice, functional solution, without creating temporary vectors, lists, and so on:
def everyNth[T](xs: List[T], n:Int): List[T] = xs match {
case hd::tl => hd::everyNth(tl.drop(n-1), n)
case Nil => Nil
}
Clojure has a take-nth function that does what you want, but I was surprised to find that there's not an equivalent method in Scala. You could code up a similar recursive solution based off the Clojure code, or you could read this blog post:
Scala collections: Filtering each n-th element
The author actually has a nice graph at the end showing the relative performance of each of his solutions.
I would do it like in Octave mathematical program.
val indices = 0 until n by 3 // Range 0,3,6,9 ...
and then I needed some way to select the indices from a collection. Obviously I had to have a collection with random-access O(1). Like Array or Vector. For example here I use Vector. To wrap the access into a nice DSL I'd add an implicit class:
implicit class VectorEnrichedWithIndices[T](v:Vector[T]) {
def apply(indices:TraversableOnce[Int]):Vector[T] = {
// some implementation
indices.toVector.map(v)
}
}
The usage would look like:
val vector = list.toVector
val every3rdElement = vector(0 until vector.size by 3)
Ah, how about this?
val l = List(10,9,8,7,6,5,4,3,2,1,0)
for (i <- (0 to l.size - 1 by 3).toList) yield l(i)
//res0: List[Int] = List(10, 7, 4, 1)
which can be made more general by
def seqByN[A](xs: Seq[A], n: Int): Seq[A] = for (i <- 0 to xs.size - 1 by n) yield xs(i)
scala> seqByN(List(10,9,8,7,6,5,4,3,2,1,0), 3)
res1: Seq[Int] = Vector(10,7,4,1)
scala> seqByN(List(10,9,8,7,6,5,4,3,2,1,0), 3).toList
res2: Seq[Int] = List(10,7,4,1)
scala> seqByN(List[Int](), 3)
res1: Seq[Int] = Vector()
But by functional do you mean only using the various List combinator functions? Otherwise, are Streams functional enough?
def fromByN[A](xs: List[A], n: Int): Stream[A] = if (xs.isEmpty) Stream.empty else
xs.head #:: fromByN(xs drop n, n)
scala> fromByN(List(10,9,8,7,6,5,4,3,2,1,0), 3).toList
res17: List[Int] = List(10, 7, 4, 1)