I am using this implementation of a bounded priority queue, https://gist.github.com/ryanlecompte/5746241, and it works perfectly. However, I want this queue to not contain any element with the same ordering 'ord'. How can I achieve this?
I tried to update the maybeReplaceLowest function by giving the function lteq instead of lt.
private def maybeReplaceLowest(a: A) {
if (ord.lteq(a, head)) {
dequeue()
super.+=(a)
}
}
But I think it does not work because the element which has the same ord as the new element might not be at the head. What could be a quick workaround for this problem?
Many thanks.
Scala PriorityQueue is implemented using Array. Which is very inefficient for lookup operation, which you need to do for every insert (to check if the element already exists in the queue.
For TreeSet is ideal for lookup and ordered storage.
But since this is a sealed class (scala-2.12.8) I created composite class
import scala.collection.mutable
class PrioritySet[A](val maxSize:Int)(implicit ordering:Ordering[A]) {
var set = mutable.TreeSet[A]()(ordering)
private def removeAdditional(): this.type = {
val additionalElements = set.size - maxSize
if( additionalElements > 0) { set = set.drop(additionalElements) }
this
}
def +(elem: A): this.type = {
set += elem
removeAdditional()
}
def ++=(elem: A, elements: A*) : this.type = {
set += elem ++= elements
removeAdditional()
}
override def toString = this.getClass.getName + set.mkString("(", ", ", ")")
}
This can be used as
>>> val set = new PrioritySet[Int](4)
>>> set ++=(1000,3,42,2,5, 1,2,4, 100)
>>> println(set)
PrioritySet(5, 42, 100, 1000)
Related
I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
updateLowestValue
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
updateLowestValue
}
}
}
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
}
index+=1
}
}
}
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
)
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!
Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
}
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
u1
}
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }
I have a lazily-calculated sequence of objects, where the lazy calculation depends only on the index (not the previous items) and some constant parameters (p:Bar below). I'm currently using a Stream, however computing the stream.init is typically wasteful.
However, I really like that using Stream[Foo] = ... gets me out of implementing a cache, and has very light declaration syntax while still providing all the sugar (like stream(n) gets element n). Then again, I could just be using the wrong declaration:
class FooSrcCache(p:Bar) {
val src : Stream[FooSrc] = {
def error() : FooSrc = FooSrc(0,p)
def loop(i: Int): Stream[FooSrc] = {
FooSrc(i,p) #:: loop(i + 1)
}
error() #:: loop(1)
}
def apply(max: Int) = src(max)
}
Is there a Stream-comparable base Scala class, that is indexed instead of linear?
PagedSeq should do the job for you:
class FooSrcCache(p:Bar) {
private def fill(buf: Array[FooSrc], start: Int, end: Int) = {
for (i <- start until end) {
buf(i) = FooSrc(i,p)
}
end - start
}
val src = new PagedSeq[FooSrc](fill _)
def apply(max: Int) = src(max)
}
Note that this might calculate FooSrc with higher indices than you requested.
Below is an implementation of Selection sort written in Scala.
The line ss.sort(arr) causes this error :
type mismatch; found : Array[String] required: Array[Ordered[Any]]
Since the type Ordered is inherited by StringOps should this type not be inferred ?
How can I add the array of Strings to sort() method ?
Here is the complete code :
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort(a : Array[Ordered[Any]]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if( less(a(j) , a(min))){
min = j
}
exchange(a , i , min)
}
}
}
def less(v : Ordered[Any] , w : Ordered[Any]) = {
v.compareTo(w) < 0
}
def exchange(a : Array[Ordered[Any]] , i : Integer , j : Integer) = {
var swap : Ordered[Any] = a(i)
a(i) = a(j)
a(j) = swap
}
}
Array is invariant. You cannot use an Array[A] as an Array[B] even if A is subtype of B. See here why: Why are Arrays invariant, but Lists covariant?
Neither is Ordered, so your implementation of less will not work either.
You should make your implementation generic the following way:
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort[T <% Ordered[T]](a : Array[T]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if(a(j) < a(min)){ // call less directly on Ordered[T]
min = j
}
exchange(a , i , min)
}
}
}
def exchange[T](a : Array[T] , i : Integer , j : Integer) = {
var swap = a(i)
a(i) = a(j)
a(j) = swap
}
}
The somewhat bizarre statement T <% Ordered[T] means "any type T that can be implicitly converted to Ordered[T]". This ensures that you can still use the less-than operator.
See this for details:
What are Scala context and view bounds?
The answer by #gzm0 (with some very nice links) suggests Ordered. I'm going to complement with an answer covering Ordering, which provides equivalent functionality without imposing on your classes as much.
Let's adjust the sort method to accept an array of type 'T' for which an Ordering implicit instance is defined.
def sort[T : Ordering](a: Array[T]) = {
val ord = implicitly[Ordering[T]]
import ord._ // now comparison operations such as '<' are available for 'T'
// ...
if (a(j) < a(min))
// ...
}
The [T : Ordering] and implicitly[Ordering[T]] combo is equivalent to an implicit parameter of type Ordering[T]:
def sort[T](a: Array[T])(implicit ord: Ordering[T]) = {
import ord._
// ...
}
Why is this useful?
Imagine you are provided with a case class Account(balance: Int) by some third party. You can now add an Ordering for it like so:
// somewhere in scope
implicit val accountOrdering = new Ordering[Account] {
def compare(x: Account, y: Account) = x.balance - y.balance
}
// or, more simply
implicit val accountOrdering: Ordering[Account] = Ordering by (_.balance)
As long as that instance is in scope, you should be able to use sort(accounts).
If you want to use some different ordering, you can also provide it explicitly, like so: sort(accounts)(otherOrdering).
Note that this isn't very different from providing an implicit conversion to Ordering (at least not within the context of this question).
Even though, when coding Scala, I'm used to prefer functional programming style (via combinators or recursion) over imperative style (via variables and iterations), THIS TIME, for this specific problem, old school imperative nested loops result in simpler and performant code.
I don't think falling back to imperative style is a mistake for certain classes of problems, such as sorting algorithms which usually transform the input buffer (more like a procedure) rather than resulting to a new sorted collection.
Here it is my solution:
package bitspoke.algo
import scala.math.Ordered
import scala.collection.mutable.Buffer
abstract class Sorter[T <% Ordered[T]] {
// algorithm provided by subclasses
def sort(buffer : Buffer[T]) : Unit
// check if the buffer is sorted
def sorted(buffer : Buffer[T]) = buffer.isEmpty || buffer.view.zip(buffer.tail).forall { t => t._2 > t._1 }
// swap elements in buffer
def swap(buffer : Buffer[T], i:Int, j:Int) {
val temp = buffer(i)
buffer(i) = buffer(j)
buffer(j) = temp
}
}
class SelectionSorter[T <% Ordered[T]] extends Sorter[T] {
def sort(buffer : Buffer[T]) : Unit = {
for (i <- 0 until buffer.length) {
var min = i
for (j <- i until buffer.length) {
if (buffer(j) < buffer(min))
min = j
}
swap(buffer, i, min)
}
}
}
As you can see, to achieve parametric polymorphism, rather than using java.lang.Comparable, I preferred scala.math.Ordered and Scala View Bounds rather than Upper Bounds. That's certainly works thanks to Scala Implicit Conversions of primitive types to Rich Wrappers.
You can write a client program as follows:
import bitspoke.algo._
import scala.collection.mutable._
val sorter = new SelectionSorter[Int]
val buffer = ArrayBuffer(3, 0, 4, 2, 1)
sorter.sort(buffer)
assert(sorter.sorted(buffer))
I have a function that calculates the left and right node values for some collection of treeNodes given a simple node.id, node.parentId association. It's very simple and works well enough...but, well, I am wondering if there is a more idiomatic approach. Specifically is there a way to track the left/right values without using some externally tracked value but still keep the tasty recursion.
/*
* A tree node
*/
case class TreeNode(val id:String, val parentId: String){
var left: Int = 0
var right: Int = 0
}
/*
* a method to compute the left/right node values
*/
def walktree(node: TreeNode) = {
/*
* increment state for the inner function
*/
var c = 0
/*
* A method to set the increment state
*/
def increment = { c+=1; c } // poo
/*
* the tasty inner method
* treeNodes is a List[TreeNode]
*/
def walk(node: TreeNode): Unit = {
node.left = increment
/*
* recurse on all direct descendants
*/
treeNodes filter( _.parentId == node.id) foreach (walk(_))
node.right = increment
}
walk(node)
}
walktree(someRootNode)
Edit -
The list of nodes is taken from a database. Pulling the nodes into a proper tree would take too much time. I am pulling a flat list into memory and all I have is an association via node id's as pertains to parents and children.
Adding left/right node values allows me to get a snapshop of all children (and childrens children) with a single SQL query.
The calculation needs to run very quickly in order to maintain data integrity should parent-child associations change (which they do very frequently).
In addition to using the awesome Scala collections I've also boosted speed by using parallel processing for some pre/post filtering on the tree nodes. I wanted to find a more idiomatic way of tracking the left/right node values. After looking at the answer from #dhg it got even better. Using groupBy instead of a filter turns the algorithm (mostly?) linear instead of quadtratic!
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree(node: TreeNode) = {
def walk(node: TreeNode, counter: Int): Int = {
node.left = counter
node.right =
treeNodeMap(node.id)
.foldLeft(counter+1) {
(result, curnode) => walk(curnode, result) + 1
}
node.right
}
walk(node,1)
}
Your code appears to be calculating an in-order traversal numbering.
I think what you want to make your code better is a fold that carries the current value downward and passes the updated value upward. Note that it might also be worth it to do a treeNodes.groupBy(_.parentId) before walktree to prevent you from calling treeNodes.filter(...) every time you call walk.
val treeNodes = List(TreeNode("1","0"),TreeNode("2","1"),TreeNode("3","1"))
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree2(node: TreeNode) = {
def walk(node: TreeNode, c: Int): Int = {
node.left = c
val newC =
treeNodeMap(node.id) // get the children without filtering
.foldLeft(c+1)((c, child) => walk(child, c) + 1)
node.right = newC
newC
}
walk(node, 1)
}
And it produces the same result:
scala> walktree2(TreeNode("0","-1"))
scala> treeNodes.map(n => "(%s,%s)".format(n.left,n.right))
res32: List[String] = List((2,7), (3,4), (5,6))
That said, I would completely rewrite your code as follows:
case class TreeNode( // class is now immutable; `walktree` returns a new tree
id: String,
value: Int, // value to be set during `walktree`
left: Option[TreeNode], // recursively-defined structure
right: Option[TreeNode]) // makes traversal much simpler
def walktree(node: TreeNode) = {
def walk(nodeOption: Option[TreeNode], c: Int): (Option[TreeNode], Int) = {
nodeOption match {
case None => (None, c) // if this child doesn't exist, do nothing
case Some(node) => // if this child exists, recursively walk
val (newLeft, cLeft) = walk(node.left, c) // walk the left side
val newC = cLeft + 1 // update the value
val (newRight, cRight) = walk(node.right, newC) // walk the right side
(Some(TreeNode(node.id, newC, newLeft, newRight)), cRight)
}
}
walk(Some(node), 0)._1
}
Then you can use it like this:
walktree(
TreeNode("1", -1,
Some(TreeNode("2", -1,
Some(TreeNode("3", -1, None, None)),
Some(TreeNode("4", -1, None, None)))),
Some(TreeNode("5", -1, None, None))))
To produce:
Some(TreeNode(1,4,
Some(TreeNode(2,2,
Some(TreeNode(3,1,None,None)),
Some(TreeNode(4,3,None,None)))),
Some(TreeNode(5,5,None,None))))
If I get your algorithm correctly:
def walktree(node: TreeNode, c: Int): Int = {
node.left = c
val c2 = treeNodes.filter(_.parentId == node.id).foldLeft(c + 1) {
(cur, n) => walktree(n, cur)
}
node.right = c2 + 1
c2 + 2
}
walktree(new TreeNode("", ""), 0)
Off-by-one errors are likely to occur.
Few random thoughts (better suited for http://codereview.stackexchange.com):
try posting that compiles... We have to guess that is a sequence of TreeNode:
val is implicit for case classes:
case class TreeNode(val id: String, val parentId: String) {
Avoid explicit = and Unit for Unit functions:
def walktree(node: TreeNode) = {
def walk(node: TreeNode): Unit = {
Methods with side-effects should have ():
def increment = {c += 1; c}
This is terribly slow, consider storing list of children in the actual node:
treeNodes filter (_.parentId == node.id) foreach (walk(_))
More concice syntax would be treeNodes foreach walk:
treeNodes foreach (walk(_))
I have very large Iterators that I want to split into pieces. I have a predicate that looks at an item and returns true if it is the start of a new piece. I need the pieces to be Iterators, because even the pieces will not fit into memory. There are so many pieces that I would be wary of a recursive solution blowing out your stack. The situation is similar to this question, but I need Iterators instead of Lists, and the "sentinels" (items for which the predicate is true) occur (and should be included) at the beginning of a piece. The resulting iterators will only be used in order, though some may not be used at all, and they should only use O(1) memory. I imagine this means they should all share the same underlying iterator. Performance is important.
If I were to take a stab at a function signature, it would be this:
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] = ...
I would have loved to use takeWhile, but it loses the last element. I investigated span, but it buffers results. My current best idea involves BufferedIterator, but maybe there is a better way.
You'll know you've got it right because something like this doesn't crash your JVM:
groupby((1 to Int.MaxValue).iterator)(_ % (Int.MaxValue / 2) == 0).foreach(group => println(group.sum))
groupby((1 to Int.MaxValue).iterator)(_ % 10 == 0).foreach(group => println(group.sum))
You have an inherent problem. Iterable implies that you can get multiple iterators. Iterator implies that you can only pass through once. That means that your Iterable[Iterable[T]] should be able to produce Iterator[Iterable[T]]s. But when this returns an element--an Iterable[T]--and that asks for multiple iterators, the underlying single iterator can't comply without either caching the results of the list (too big) or calling the original iterable and going through absolutely everything again (very inefficient).
So, while you could do this, I think you should conceive of your problem in a different way.
If you could start with a Seq instead, you could grab subsets as ranges.
If you already know how you want to use your iterable, you could write a method
def process[T](source: Iterable[T])(starts: T => Boolean)(handlers: T => Unit *)
which increments through the set of handlers each time starts fires off a "true". If there's any way you can do your processing in one sweep, something like this is the way to go. (Your handlers will have to save state via mutable data structures or variables, however.)
If you can permit iteration on the outer list to break the inner list, you could have an Iterable[Iterator[T]] with the additional constraint that once you iterate to a later sub-iterator, all previous sub-iterators are invalid.
Here's a solution of the last type (from Iterator[T] to Iterator[Iterator[T]]; one can wrap this to make the outer layers Iterable instead).
class GroupedBy[T](source: Iterator[T])(starts: T => Boolean)
extends Iterator[Iterator[T]] {
private val underlying = source
private var saved: T = _
private var cached = false
private var starting = false
private def cacheNext() {
saved = underlying.next
starting = starts(saved)
cached = true
}
private def oops() { throw new java.util.NoSuchElementException("empty iterator") }
// Comment the next line if you do NOT want the first element to always start a group
if (underlying.hasNext) { cacheNext(); starting = true }
def hasNext = {
while (!(cached && starting) && underlying.hasNext) cacheNext()
cached && starting
}
def next = {
if (!(cached && starting) && !hasNext) oops()
starting = false
new Iterator[T] {
var presumablyMore = true
def hasNext = {
if (!cached && !starting && underlying.hasNext && presumablyMore) cacheNext()
presumablyMore = cached && !starting
presumablyMore
}
def next = {
if (presumablyMore && (cached || hasNext)) {
cached = false
saved
}
else oops()
}
}
}
}
Here's my solution using BufferedIterator. It doesn't let you skip iterators correctly, but it's fairly simple and functional. The first element(s) go into a group even if !startsGroup(first).
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
val base = iter.buffered
override def hasNext = base.hasNext
override def next() = Iterator(base.next()) ++ new Iterator[T] {
override def hasNext = base.hasNext && !startsGroup(base.head)
override def next() = if (hasNext) base.next() else Iterator.empty.next()
}
}
Update: Keeping a little state lets you skip iterators and prevent people from messing with previous ones:
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
val base = iter.buffered
var prev: Iterator[T] = Iterator.empty
override def hasNext = base.hasNext
override def next() = {
while (prev.hasNext) prev.next() // Exhaust previous iterator; take* and drop* do NOT always work!! (Jira SI-5002?)
prev = Iterator(base.next()) ++ new Iterator[T] {
var hasMore = true
override def hasNext = { hasMore = hasMore && base.hasNext && !startsGroup(base.head) ; hasMore }
override def next() = if (hasNext) base.next() else Iterator.empty.next()
}
prev
}
}
If you are looking at memory constraints then the following will work. You can only use it if your underlying iterable object supports views. This implementation will iterate over the Iterable and then generate IterableViews which can then be iterated over. This implementation does not care if the very first element tests as a start group since it will be regardless.
def groupby[T](iter: Iterable[T])(startsGroup: T => Boolean): Iterable[Iterable[T]] = new Iterable[Iterable[T]] {
def iterator = new Iterator[Iterable[T]] {
val i = iter.iterator
var index = 0
var nextView: IterableView[T, Iterable[T]] = getNextView()
private def getNextView() = {
val start = index
var hitStartGroup = false
while ( i.hasNext && ! hitStartGroup ) {
val next = i.next()
index += 1
hitStartGroup = ( index > 1 && startsGroup( next ) )
}
if ( hitStartGroup ) {
if ( start == 0 ) iter.view( start, index - 1 )
else iter.view( start - 1, index - 1 )
} else { // hit end
if ( start == index ) null
else if ( start == 0 ) iter.view( start, index )
else iter.view( start - 1, index )
}
}
def hasNext = nextView != null
def next() = {
if ( nextView != null ) {
val next = nextView
nextView = getNextView()
next
} else null
}
}
}
You can maintain low memory foot-print by using Streams. Use result.toIterator, if you an iterator again.
With streams, there's no mutable state, only a single conditional and it's nearly as concise as Jay Hacker's solution.
def batchBy[A,B](iter: Iterator[A])(f: A => B): Stream[(B, Iterator[A])] = {
val base = iter.buffered
val empty = Stream.empty[(B, Iterator[A])]
def getBatch(key: B) = {
Iterator(base.next()) ++ new Iterator[A] {
def hasNext: Boolean = base.hasNext && (f(base.head) == key)
def next(): A = base.next()
}
}
def next(skipList: Option[Iterator[A]] = None): Stream[(B, Iterator[A])] = {
skipList.foreach{_.foreach{_=>}}
if (base.isEmpty) empty
else {
val key = f(base.head)
val batch = getBatch(key)
Stream.cons((key, batch), next(Some(batch)))
}
}
next()
}
I ran the tests:
scala> batchBy((1 to Int.MaxValue).iterator)(_ % (Int.MaxValue / 2) == 0)
.foreach{case(_,group) => println(group.sum)}
-1610612735
1073741823
-536870909
2147483646
2147483647
The second test prints too much to paste to Stack Overflow.
import scala.collection.mutable.ArrayBuffer
object GroupingIterator {
/**
* Create a new GroupingIterator with a grouping predicate.
*
* #param it The original iterator
* #param p Predicate controlling the grouping
* #tparam A Type of elements iterated
* #return A new GroupingIterator
*/
def apply[A](it: Iterator[A])(p: (A, IndexedSeq[A]) => Boolean): GroupingIterator[A] =
new GroupingIterator(it)(p)
}
/**
* Group elements in sequences of contiguous elements that satisfy a predicate. The predicate
* tests each single potential next element of the group with the help of the elements grouped so far.
* If it returns true, the potential next element is added to the group, otherwise
* a new group is started with the potential next element as first element
*
* #param self The original iterator
* #param p Predicate controlling the grouping
* #tparam A Type of elements iterated
*/
class GroupingIterator[+A](self: Iterator[A])(p: (A, IndexedSeq[A]) => Boolean) extends Iterator[IndexedSeq[A]] {
private[this] val source = self.buffered
private[this] val buffer: ArrayBuffer[A] = ArrayBuffer()
def hasNext: Boolean = source.hasNext
def next(): IndexedSeq[A] = {
if (hasNext)
nextGroup()
else
Iterator.empty.next()
}
private[this] def nextGroup(): IndexedSeq[A] = {
assert(source.hasNext)
buffer.clear()
buffer += source.next
while (source.hasNext && p(source.head, buffer)) {
buffer += source.next
}
buffer.toIndexedSeq
}
}