How to aggregateByKey with custom class for frequency distribution? - scala

I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
updateLowestValue
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
updateLowestValue
}
}
}
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
}
index+=1
}
}
}
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
)
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!

Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
}
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
u1
}
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }

Related

How to dynamically provide N codecs to process fields as a VectorCodec for a record of binary fields that do not contain size bytes

Considering this function in Decoder:
final def decodeCollect[F[_], A](dec: Decoder[A], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
What I really need is dec: Vector[Decoder[A]], like this:
final def decodeCollect[F[_], A](dec: Vector[Decoder[A]], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
to process a binary format that has fields that are not self describing. Early in the file are description records, and from these come field sizes that have to be applied later in data records. So I want to build up a list of decoders and apply it N times, where N is the number of decoders.
I could write a new function modeled on decodeCollect, but it takes an implicit Factory, so I probably would have to compile the scodec library and add it.
Is there a simpler approach using what exists in the scodec library? Either a way to deal with the factory or a different approach?
I finally hacked a solution in the codec codebase. Now that that door is open, I'll add whatever I need until I succeed.
final def decodeNCollect[F[_], A](dec: Vector[Decoder[A]])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
val bldr = cbf.newBuilder
var remaining = buffer
var count = 0
val maxCount = dec.length
var error: Option[Err] = None
while (count < maxCount && remaining.nonEmpty) {
dec(count).decode(remaining) match {
case Attempt.Successful(DecodeResult(value, rest)) =>
bldr += value
count += 1
remaining = rest
case Attempt.Failure(err) =>
error = Some(err.pushContext(count.toString))
remaining = BitVector.empty
}
}
Attempt.fromErrOption(error, DecodeResult(bldr.result, remaining))
}
final def encodeNSeq[A](encs: Vector[Encoder[A]])(seq: collection.immutable.Seq[A]): Attempt[BitVector] = {
if (encs.length != seq.length)
return Attempt.failure(Err("encodeNSeq: length of coders and items does not match"))
val buf = new collection.mutable.ArrayBuffer[BitVector](seq.size)
((seq zip (0 until encs.length)): Seq[(A, Int)]) foreach { case (a, i) =>
encs(i).encode(a) match {
case Attempt.Successful(aa) => buf += aa
case Attempt.Failure(err) => return Attempt.failure(err.pushContext(buf.size.toString))
}
}
def merge(offset: Int, size: Int): BitVector = size match {
case 0 => BitVector.empty
case 1 => buf(offset)
case n =>
val half = size / 2
merge(offset, half) ++ merge(offset + half, half + (if (size % 2 == 0) 0 else 1))
}
Attempt.successful(merge(0, buf.size))
}
private[codecs] final class VectorNCodec[A](codecs: Vector[Codec[A]]) extends Codec[Vector[A]] {
def sizeBound = SizeBound(0, Some(codecs.length.toLong))
def encode(vector: Vector[A]) = Encoder.encodeNSeq(codecs)(vector)
def decode(buffer: BitVector) =
Decoder.decodeNCollect[Vector, A](codecs)(buffer)
override def toString = s"vector($codecs)"
}
def vectorOf[A](valueCodecs: Vector[Codec[A]]): Codec[Vector[A]] =
provide(valueCodecs.length).
flatZip { count => new VectorNCodec(valueCodecs) }.
narrow[Vector[A]]({ case (cnt, xs) =>
if (xs.size == cnt) Attempt.successful(xs)
else Attempt.failure(Err(s"Insufficient number of elements: decoded ${xs.size} but should have decoded $cnt"))
}, xs => (xs.size, xs)).
withToString(s"vectorOf($valueCodecs)")

IndexedSeq-based equivalent of Stream?

I have a lazily-calculated sequence of objects, where the lazy calculation depends only on the index (not the previous items) and some constant parameters (p:Bar below). I'm currently using a Stream, however computing the stream.init is typically wasteful.
However, I really like that using Stream[Foo] = ... gets me out of implementing a cache, and has very light declaration syntax while still providing all the sugar (like stream(n) gets element n). Then again, I could just be using the wrong declaration:
class FooSrcCache(p:Bar) {
val src : Stream[FooSrc] = {
def error() : FooSrc = FooSrc(0,p)
def loop(i: Int): Stream[FooSrc] = {
FooSrc(i,p) #:: loop(i + 1)
}
error() #:: loop(1)
}
def apply(max: Int) = src(max)
}
Is there a Stream-comparable base Scala class, that is indexed instead of linear?
PagedSeq should do the job for you:
class FooSrcCache(p:Bar) {
private def fill(buf: Array[FooSrc], start: Int, end: Int) = {
for (i <- start until end) {
buf(i) = FooSrc(i,p)
}
end - start
}
val src = new PagedSeq[FooSrc](fill _)
def apply(max: Int) = src(max)
}
Note that this might calculate FooSrc with higher indices than you requested.

How to add 'Array[Ordered[Any]]' as a method parameter

Below is an implementation of Selection sort written in Scala.
The line ss.sort(arr) causes this error :
type mismatch; found : Array[String] required: Array[Ordered[Any]]
Since the type Ordered is inherited by StringOps should this type not be inferred ?
How can I add the array of Strings to sort() method ?
Here is the complete code :
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort(a : Array[Ordered[Any]]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if( less(a(j) , a(min))){
min = j
}
exchange(a , i , min)
}
}
}
def less(v : Ordered[Any] , w : Ordered[Any]) = {
v.compareTo(w) < 0
}
def exchange(a : Array[Ordered[Any]] , i : Integer , j : Integer) = {
var swap : Ordered[Any] = a(i)
a(i) = a(j)
a(j) = swap
}
}
Array is invariant. You cannot use an Array[A] as an Array[B] even if A is subtype of B. See here why: Why are Arrays invariant, but Lists covariant?
Neither is Ordered, so your implementation of less will not work either.
You should make your implementation generic the following way:
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort[T <% Ordered[T]](a : Array[T]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if(a(j) < a(min)){ // call less directly on Ordered[T]
min = j
}
exchange(a , i , min)
}
}
}
def exchange[T](a : Array[T] , i : Integer , j : Integer) = {
var swap = a(i)
a(i) = a(j)
a(j) = swap
}
}
The somewhat bizarre statement T <% Ordered[T] means "any type T that can be implicitly converted to Ordered[T]". This ensures that you can still use the less-than operator.
See this for details:
What are Scala context and view bounds?
The answer by #gzm0 (with some very nice links) suggests Ordered. I'm going to complement with an answer covering Ordering, which provides equivalent functionality without imposing on your classes as much.
Let's adjust the sort method to accept an array of type 'T' for which an Ordering implicit instance is defined.
def sort[T : Ordering](a: Array[T]) = {
val ord = implicitly[Ordering[T]]
import ord._ // now comparison operations such as '<' are available for 'T'
// ...
if (a(j) < a(min))
// ...
}
The [T : Ordering] and implicitly[Ordering[T]] combo is equivalent to an implicit parameter of type Ordering[T]:
def sort[T](a: Array[T])(implicit ord: Ordering[T]) = {
import ord._
// ...
}
Why is this useful?
Imagine you are provided with a case class Account(balance: Int) by some third party. You can now add an Ordering for it like so:
// somewhere in scope
implicit val accountOrdering = new Ordering[Account] {
def compare(x: Account, y: Account) = x.balance - y.balance
}
// or, more simply
implicit val accountOrdering: Ordering[Account] = Ordering by (_.balance)
As long as that instance is in scope, you should be able to use sort(accounts).
If you want to use some different ordering, you can also provide it explicitly, like so: sort(accounts)(otherOrdering).
Note that this isn't very different from providing an implicit conversion to Ordering (at least not within the context of this question).
Even though, when coding Scala, I'm used to prefer functional programming style (via combinators or recursion) over imperative style (via variables and iterations), THIS TIME, for this specific problem, old school imperative nested loops result in simpler and performant code.
I don't think falling back to imperative style is a mistake for certain classes of problems, such as sorting algorithms which usually transform the input buffer (more like a procedure) rather than resulting to a new sorted collection.
Here it is my solution:
package bitspoke.algo
import scala.math.Ordered
import scala.collection.mutable.Buffer
abstract class Sorter[T <% Ordered[T]] {
// algorithm provided by subclasses
def sort(buffer : Buffer[T]) : Unit
// check if the buffer is sorted
def sorted(buffer : Buffer[T]) = buffer.isEmpty || buffer.view.zip(buffer.tail).forall { t => t._2 > t._1 }
// swap elements in buffer
def swap(buffer : Buffer[T], i:Int, j:Int) {
val temp = buffer(i)
buffer(i) = buffer(j)
buffer(j) = temp
}
}
class SelectionSorter[T <% Ordered[T]] extends Sorter[T] {
def sort(buffer : Buffer[T]) : Unit = {
for (i <- 0 until buffer.length) {
var min = i
for (j <- i until buffer.length) {
if (buffer(j) < buffer(min))
min = j
}
swap(buffer, i, min)
}
}
}
As you can see, to achieve parametric polymorphism, rather than using java.lang.Comparable, I preferred scala.math.Ordered and Scala View Bounds rather than Upper Bounds. That's certainly works thanks to Scala Implicit Conversions of primitive types to Rich Wrappers.
You can write a client program as follows:
import bitspoke.algo._
import scala.collection.mutable._
val sorter = new SelectionSorter[Int]
val buffer = ArrayBuffer(3, 0, 4, 2, 1)
sorter.sort(buffer)
assert(sorter.sorted(buffer))

workaround for prepending to a LinkedHashMap in Scala?

I have a LinkedHashMap which I've been using in a typical way: adding new key-value
pairs to the end, and accessing them in order of insertion. However, now I have a
special case where I need to add pairs to the "head" of the map. I think there's
some functionality inside the LinkedHashMap source for doing this, but it has private
accessibility.
I have a solution where I create a new map, add the pair, then add all the old mappings.
In Java syntax:
newMap.put(newKey, newValue)
newMap.putAll(this.map)
this.map = newMap
It works. But the problem here is that I then need to make my main data structure
(this.map) a var rather than a val.
Can anyone think of a nicer solution? Note that I definitely need the fast lookup
functionality provided by a Map collection. The performance of a prepending is not
such a big deal.
More generally, as a Scala developer how hard would you fight to avoid a var
in a case like this, assuming there's no foreseeable need for concurrency?
Would you create your own version of LinkedHashMap? Looks like a hassle frankly.
This will work but is not especially nice either:
import scala.collection.mutable.LinkedHashMap
def prepend[K,V](map: LinkedHashMap[K,V], kv: (K, V)) = {
val copy = map.toMap
map.clear
map += kv
map ++= copy
}
val map = LinkedHashMap('b -> 2)
prepend(map, 'a -> 1)
map == LinkedHashMap('a -> 1, 'b -> 2)
Have you taken a look at the code of LinkedHashMap? The class has a field firstEntry, and just by taking a quick peek at updateLinkedEntries, it should be relatively easy to create a subclass of LinkedHashMap which only adds a new method prepend and updateLinkedEntriesPrepend resulting in the behavior you need, e.g. (not tested):
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = e
}
}
Here is a sample implementation I threw together real quick (that is, not thoroughly tested!):
class MyLinkedHashMap[A, B] extends LinkedHashMap[A,B] {
def prepend(key: A, value: B): Option[B] = {
val e = findEntry(key)
if (e == null) {
val e = new Entry(key, value)
addEntry(e)
updateLinkedEntriesPrepend(e)
None
} else {
// The key already exists, so we might as well call LinkedHashMap#put
put(key, value)
}
}
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = firstEntry
}
}
}
Tested like this:
object Main {
def main(args:Array[String]) {
val x = new MyLinkedHashMap[String, Int]();
x.prepend("foo", 5)
x.prepend("bar", 6)
x.prepend("olol", 12)
x.foreach(x => println("x:" + x._1 + " y: " + x._2 ));
}
}
Which, on Scala 2.9.0 (yeah, need to update) results in
x:olol y: 12
x:bar y: 6
x:foo y: 5
A quick benchmark shows order of magnitude in performance difference between the extended built-in class and the "map rewrite" approach (I used the code from Debilski's answer in "ExternalMethod" and mine in "BuiltIn"):
benchmark length us linear runtime
ExternalMethod 10 1218.44 =
ExternalMethod 100 1250.28 =
ExternalMethod 1000 19453.59 =
ExternalMethod 10000 349297.25 ==============================
BuiltIn 10 3.10 =
BuiltIn 100 2.48 =
BuiltIn 1000 2.38 =
BuiltIn 10000 3.28 =
The benchmark code:
def timeExternalMethod(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) prepend(map, (i, i))
r -= 1
}
}
def timeBuiltIn(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) map.prepend(i, i)
r -= 1
}
}
Using a scala benchmarking template.

Selection Sort Generic type implementation

I worked my way implementing a recursive version of selection and quick sort,i am trying to modify the code in a way that it can sort a list of any generic type , i want to assume that the generic type supplied can be converted to Comparable at runtime.
Does anyone have a link ,code or tutorial on how to do this please
I am trying to modify this particular code
'def main (args:Array[String]){
val l = List(2,4,5,6,8)
print(quickSort(l))
}
def quickSort(x:List[Int]):List[Int]={
x match{
case xh::xt =>
{
val (first,pivot,second) = partition(x)
quickSort (first):::(pivot :: quickSort(second))
}
case Nil => {x}
}
}
def partition (x:List[Int])=
{
val pivot =x.head
var first:List[Int]=List ()
var second : List[Int]=List ()
val fun=(i:Int)=> {
if (i<pivot)
first=i::first
else
second=i::second
}
x.tail.foreach(fun)
(first,pivot,second)
}
enter code here
def main (args:Array[String]){
val l = List(2,4,5,6,8)
print(quickSort(l))
}
def quickSort(x:List[Int]):List[Int]={
x match{
case xh::xt =>
{
val (first,pivot,second) = partition(x)
quickSort (first):::(pivot :: quickSort(second))
}
case Nil => {x}
}
}
def partition (x:List[Int])=
{
val pivot =x.head
var first:List[Int]=List ()
var second : List[Int]=List ()
val fun=(i:Int)=> {
if (i<pivot)
first=i::first
else
second=i::second
}
x.tail.foreach(fun)
(first,pivot,second)
} '
Language: SCALA
In Scala, Java Comparator is replaced by Ordering (quite similar but comes with more useful methods). They are implemented for several types (primitives, strings, bigDecimals, etc.) and you can provide your own implementations.
You can then use scala implicit to ask the compiler to pick the correct one for you:
def sort[A]( lst: List[A] )( implicit ord: Ordering[A] ) = {
...
}
If you are using a predefined ordering, just call:
sort( myLst )
and the compiler will infer the second argument. If you want to declare your own ordering, use the keyword implicit in the declaration. For instance:
implicit val fooOrdering = new Ordering[Foo] {
def compare( f1: Foo, f2: Foo ) = {...}
}
and it will be implicitly use if you try to sort a List of Foo.
If you have several implementations for the same type, you can also explicitly pass the correct ordering object:
sort( myFooLst )( fooOrdering )
More info in this post.
For Quicksort, I'll modify an example from the "Scala By Example" book to make it more generic.
class Quicksort[A <% Ordered[A]] {
def sort(a:ArraySeq[A]): ArraySeq[A] =
if (a.length < 2) a
else {
val pivot = a(a.length / 2)
sort (a filter (pivot >)) ++ (a filter (pivot == )) ++
sort (a filter(pivot <))
}
}
Test with Int
scala> val quicksort = new Quicksort[Int]
quicksort: Quicksort[Int] = Quicksort#38ceb62f
scala> val a = ArraySeq(5, 3, 2, 2, 1, 1, 9, 39 ,219)
a: scala.collection.mutable.ArraySeq[Int] = ArraySeq(5, 3, 2, 2, 1, 1, 9, 39, 21
9)
scala> quicksort.sort(a).foreach(n=> (print(n), print (" " )))
1 1 2 2 3 5 9 39 219
Test with a custom class implementing Ordered
scala> case class Meh(x: Int, y:Int) extends Ordered[Meh] {
| def compare(that: Meh) = (x + y).compare(that.x + that.y)
| }
defined class Meh
scala> val q2 = new Quicksort[Meh]
q2: Quicksort[Meh] = Quicksort#7677ce29
scala> val a3 = ArraySeq(Meh(1,1), Meh(12,1), Meh(0,1), Meh(2,2))
a3: scala.collection.mutable.ArraySeq[Meh] = ArraySeq(Meh(1,1), Meh(12,1), Meh(0
,1), Meh(2,2))
scala> q2.sort(a3)
res7: scala.collection.mutable.ArraySeq[Meh] = ArraySeq(Meh(0,1), Meh(1,1), Meh(
2,2), Meh(12,1))
Even though, when coding Scala, I'm used to prefer functional programming style (via combinators or recursion) over imperative style (via variables and iterations), THIS TIME, for this specific problem, old school imperative nested loops result in simpler code for the reader. I don't think falling back to imperative style is a mistake for certain classes of problems (such as sorting algorithms which usually transform the input buffer (like a procedure) rather than resulting to a new sorted one
Here it is my solution:
package bitspoke.algo
import scala.math.Ordered
import scala.collection.mutable.Buffer
abstract class Sorter[T <% Ordered[T]] {
// algorithm provided by subclasses
def sort(buffer : Buffer[T]) : Unit
// check if the buffer is sorted
def sorted(buffer : Buffer[T]) = buffer.isEmpty || buffer.view.zip(buffer.tail).forall { t => t._2 > t._1 }
// swap elements in buffer
def swap(buffer : Buffer[T], i:Int, j:Int) {
val temp = buffer(i)
buffer(i) = buffer(j)
buffer(j) = temp
}
}
class SelectionSorter[T <% Ordered[T]] extends Sorter[T] {
def sort(buffer : Buffer[T]) : Unit = {
for (i <- 0 until buffer.length) {
var min = i
for (j <- i until buffer.length) {
if (buffer(j) < buffer(min))
min = j
}
swap(buffer, i, min)
}
}
}
As you can see, rather than using java.lang.Comparable, I preferred scala.math.Ordered and Scala View Bounds rather than Upper Bounds. That's certainly works thanks to many Scala Implicit Conversions of primitive types to Rich Wrappers.
You can write a client program as follows:
import bitspoke.algo._
import scala.collection.mutable._
val sorter = new SelectionSorter[Int]
val buffer = ArrayBuffer(3, 0, 4, 2, 1)
sorter.sort(buffer)
assert(sorter.sorted(buffer))