IndexedSeq-based equivalent of Stream? - scala

I have a lazily-calculated sequence of objects, where the lazy calculation depends only on the index (not the previous items) and some constant parameters (p:Bar below). I'm currently using a Stream, however computing the stream.init is typically wasteful.
However, I really like that using Stream[Foo] = ... gets me out of implementing a cache, and has very light declaration syntax while still providing all the sugar (like stream(n) gets element n). Then again, I could just be using the wrong declaration:
class FooSrcCache(p:Bar) {
val src : Stream[FooSrc] = {
def error() : FooSrc = FooSrc(0,p)
def loop(i: Int): Stream[FooSrc] = {
FooSrc(i,p) #:: loop(i + 1)
}
error() #:: loop(1)
}
def apply(max: Int) = src(max)
}
Is there a Stream-comparable base Scala class, that is indexed instead of linear?

PagedSeq should do the job for you:
class FooSrcCache(p:Bar) {
private def fill(buf: Array[FooSrc], start: Int, end: Int) = {
for (i <- start until end) {
buf(i) = FooSrc(i,p)
}
end - start
}
val src = new PagedSeq[FooSrc](fill _)
def apply(max: Int) = src(max)
}
Note that this might calculate FooSrc with higher indices than you requested.

Related

Cats Writer Vector is empty

I wrote this simple program in my attempt to learn how Cats Writer works
import cats.data.Writer
import cats.syntax.applicative._
import cats.syntax.writer._
import cats.instances.vector._
object WriterTest extends App {
type Logged2[A] = Writer[Vector[String], A]
Vector("started the program").tell
val output1 = calculate1(10)
val foo = new Foo()
val output2 = foo.calculate2(20)
val (log, sum) = (output1 + output2).pure[Logged2].run
println(log)
println(sum)
def calculate1(x : Int) : Int = {
Vector("came inside calculate1").tell
val output = 10 + x
Vector(s"Calculated value ${output}").tell
output
}
}
class Foo {
def calculate2(x: Int) : Int = {
Vector("came inside calculate 2").tell
val output = 10 + x
Vector(s"calculated ${output}").tell
output
}
}
The program works and the output is
> run-main WriterTest
[info] Compiling 1 Scala source to /Users/Cats/target/scala-2.11/classes...
[info] Running WriterTest
Vector()
50
[success] Total time: 1 s, completed Jan 21, 2017 8:14:19 AM
But why is the vector empty? Shouldn't it contain all the strings on which I used the "tell" method?
When you call tell on your Vectors, each time you create a Writer[Vector[String], Unit]. However, you never actually do anything with your Writers, you just discard them. Further, you call pure to create your final Writer, which simply creates a Writer with an empty Vector. You have to combine the writers together in a chain that carries your value and message around.
type Logged[A] = Writer[Vector[String], A]
val (log, sum) = (for {
_ <- Vector("started the program").tell
output1 <- calculate1(10)
foo = new Foo()
output2 <- foo.calculate2(20)
} yield output1 + output2).run
def calculate1(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate1").tell
output = 10 + x
_ <- Vector(s"Calculated value ${output}").tell
} yield output
class Foo {
def calculate2(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate2").tell
output = 10 + x
_ <- Vector(s"calculated ${output}").tell
} yield output
}
Note the use of for notation. The definition of calculate1 is really
def calculate1(x: Int): Logged[Int] = Vector("came inside calculate1").tell.flatMap { _ =>
val output = 10 + x
Vector(s"calculated ${output}").tell.map { _ => output }
}
flatMap is the monadic bind operation, which means it understands how to take two monadic values (in this case Writer) and join them together to get a new one. In this case, it makes a Writer containing the concatenation of the logs and the value of the one on the right.
Note how there are no side effects. There is no global state by which Writer can remember all your calls to tell. You instead make many Writers and join them together with flatMap to get one big one at the end.
The problem with your example code is that you're not using the result of the tell method.
If you take a look at its signature, you'll see this:
final class WriterIdSyntax[A](val a: A) extends AnyVal {
def tell: Writer[A, Unit] = Writer(a, ())
}
it is clear that tell returns a Writer[A, Unit] result which is immediately discarded because you didn't assign it to a value.
The proper way to use a Writer (and any monad in Scala) is through its flatMap method. It would look similar to this:
println(
Vector("started the program").tell.flatMap { _ =>
15.pure[Logged2].flatMap { i =>
Writer(Vector("ended program"), i)
}
}
)
The code above, when executed will give you this:
WriterT((Vector(started the program, ended program),15))
As you can see, both messages and the int are stored in the result.
Now this is a bit ugly, and Scala actually provides a better way to do this: for-comprehensions. For-comprehension are a bit of syntactic sugar that allows us to write the same code in this way:
println(
for {
_ <- Vector("started the program").tell
i <- 15.pure[Logged2]
_ <- Vector("ended program").tell
} yield i
)
Now going back to your example, what I would recommend is for you to change the return type of compute1 and compute2 to be Writer[Vector[String], Int] and then try to make your application compile using what I wrote above.

How to aggregateByKey with custom class for frequency distribution?

I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
updateLowestValue
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
updateLowestValue
}
}
}
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
}
index+=1
}
}
}
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
)
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!
Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
}
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
u1
}
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }

Incrementing 'i' in scala for loop by differing amounts depending on circumstance

I want to write a for loop in scala, but the counter should get incremented by more than one (the amount is variable) in some special cases.
You can do this with a combination of a filter and an external var. Here is an example:
var nextValidVal = 0
for (i <- 0 to 99; if i >= nextValidVal) {
var amountToSkip = 0
// Whatever this loop is for
nextValidVal = if (amountToSkip > 0) i + amountToSkip + 1 else nextValidVal
}
So in the main body of your loop, you can set amountToSkip to n according to your conditions. The next n values of i´s sequence will be skipped.
If your sequence is pulled from some other kind of sequence, you could do it like this
var skip = 0
for (o <- someCollection if { val res = skip == 0; skip = if (!res) skip - 1 else 0; res } ) {
// Do stuff
}
If you set skip to a positive value in the body of the loop, the next n elements of the sequence will be skipped.
Of course, this is terribly imperative and side-effecty. I would look for other ways to to this where ever possible, by mapping or filtering or folding the original sequence.
You could implement your own stream to reflect step, for example:
import scala.collection.immutable.Stream
import ForStream._
object Test {
def main(args: Array[String]): Unit = {
val range = 0 to 20 by 1 withVariableStep; // in case you like definition through range
//val range = ForStream(0,20,1) // direct definition
for (i<- range) {
println(s"i=$i")
range.step = range.step + 1
}
}
}
object ForStream{
implicit def toForStream(range: Range): ForStream = new ForStreamMaster(range.start, range.end,range.step)
def apply(head:Int, end:Int, step:Int) = new ForStreamMaster(head, end,step)
}
abstract class ForStream(override val head: Int, val end: Int, var step: Int) extends Stream[Int] {
override val tailDefined = false
override val isEmpty = head > end
def withVariableStep = this
}
class ForStreamMaster(_head: Int, _end: Int, _Step: Int) extends ForStream(_head, _end,_Step){
override def tail = if (isEmpty) Stream.Empty else new ForStreamSlave(head + step, end, step, this)
}
class ForStreamSlave(_head: Int, _end: Int, _step: Int, val master: ForStream) extends ForStream(_head, _end,_step){
override def tail = if (isEmpty) Stream.Empty else new ForStreamSlave(head + master.step, end, master.step, master)
}
This prints:
i=0
i=2
i=5
i=9
i=14
i=20
You can define ForStream from Range with implicits, or define it directly. But be carefull:
You are not iterating Range anymore!
Stream should be immutable, but step is mutable!
Also as #om-nom-nom noted, this might be better implemented with recursion
Why not use the do-while loop?
var x = 0;
do{
...something
if(condition){change x to something else}
else{something else}
x+=1
}while(some condition for x)

How to add 'Array[Ordered[Any]]' as a method parameter

Below is an implementation of Selection sort written in Scala.
The line ss.sort(arr) causes this error :
type mismatch; found : Array[String] required: Array[Ordered[Any]]
Since the type Ordered is inherited by StringOps should this type not be inferred ?
How can I add the array of Strings to sort() method ?
Here is the complete code :
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort(a : Array[Ordered[Any]]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if( less(a(j) , a(min))){
min = j
}
exchange(a , i , min)
}
}
}
def less(v : Ordered[Any] , w : Ordered[Any]) = {
v.compareTo(w) < 0
}
def exchange(a : Array[Ordered[Any]] , i : Integer , j : Integer) = {
var swap : Ordered[Any] = a(i)
a(i) = a(j)
a(j) = swap
}
}
Array is invariant. You cannot use an Array[A] as an Array[B] even if A is subtype of B. See here why: Why are Arrays invariant, but Lists covariant?
Neither is Ordered, so your implementation of less will not work either.
You should make your implementation generic the following way:
object SelectionSortTest {
def main(args: Array[String]){
val arr = Array("Hello","World")
val ss = new SelectionSort()
ss.sort(arr)
}
}
class SelectionSort {
def sort[T <% Ordered[T]](a : Array[T]) = {
var N = a.length
for (i <- 0 until N) {
var min = i
for(j <- i + 1 until N){
if(a(j) < a(min)){ // call less directly on Ordered[T]
min = j
}
exchange(a , i , min)
}
}
}
def exchange[T](a : Array[T] , i : Integer , j : Integer) = {
var swap = a(i)
a(i) = a(j)
a(j) = swap
}
}
The somewhat bizarre statement T <% Ordered[T] means "any type T that can be implicitly converted to Ordered[T]". This ensures that you can still use the less-than operator.
See this for details:
What are Scala context and view bounds?
The answer by #gzm0 (with some very nice links) suggests Ordered. I'm going to complement with an answer covering Ordering, which provides equivalent functionality without imposing on your classes as much.
Let's adjust the sort method to accept an array of type 'T' for which an Ordering implicit instance is defined.
def sort[T : Ordering](a: Array[T]) = {
val ord = implicitly[Ordering[T]]
import ord._ // now comparison operations such as '<' are available for 'T'
// ...
if (a(j) < a(min))
// ...
}
The [T : Ordering] and implicitly[Ordering[T]] combo is equivalent to an implicit parameter of type Ordering[T]:
def sort[T](a: Array[T])(implicit ord: Ordering[T]) = {
import ord._
// ...
}
Why is this useful?
Imagine you are provided with a case class Account(balance: Int) by some third party. You can now add an Ordering for it like so:
// somewhere in scope
implicit val accountOrdering = new Ordering[Account] {
def compare(x: Account, y: Account) = x.balance - y.balance
}
// or, more simply
implicit val accountOrdering: Ordering[Account] = Ordering by (_.balance)
As long as that instance is in scope, you should be able to use sort(accounts).
If you want to use some different ordering, you can also provide it explicitly, like so: sort(accounts)(otherOrdering).
Note that this isn't very different from providing an implicit conversion to Ordering (at least not within the context of this question).
Even though, when coding Scala, I'm used to prefer functional programming style (via combinators or recursion) over imperative style (via variables and iterations), THIS TIME, for this specific problem, old school imperative nested loops result in simpler and performant code.
I don't think falling back to imperative style is a mistake for certain classes of problems, such as sorting algorithms which usually transform the input buffer (more like a procedure) rather than resulting to a new sorted collection.
Here it is my solution:
package bitspoke.algo
import scala.math.Ordered
import scala.collection.mutable.Buffer
abstract class Sorter[T <% Ordered[T]] {
// algorithm provided by subclasses
def sort(buffer : Buffer[T]) : Unit
// check if the buffer is sorted
def sorted(buffer : Buffer[T]) = buffer.isEmpty || buffer.view.zip(buffer.tail).forall { t => t._2 > t._1 }
// swap elements in buffer
def swap(buffer : Buffer[T], i:Int, j:Int) {
val temp = buffer(i)
buffer(i) = buffer(j)
buffer(j) = temp
}
}
class SelectionSorter[T <% Ordered[T]] extends Sorter[T] {
def sort(buffer : Buffer[T]) : Unit = {
for (i <- 0 until buffer.length) {
var min = i
for (j <- i until buffer.length) {
if (buffer(j) < buffer(min))
min = j
}
swap(buffer, i, min)
}
}
}
As you can see, to achieve parametric polymorphism, rather than using java.lang.Comparable, I preferred scala.math.Ordered and Scala View Bounds rather than Upper Bounds. That's certainly works thanks to Scala Implicit Conversions of primitive types to Rich Wrappers.
You can write a client program as follows:
import bitspoke.algo._
import scala.collection.mutable._
val sorter = new SelectionSorter[Int]
val buffer = ArrayBuffer(3, 0, 4, 2, 1)
sorter.sort(buffer)
assert(sorter.sorted(buffer))

How can I make this method more Scalalicious

I have a function that calculates the left and right node values for some collection of treeNodes given a simple node.id, node.parentId association. It's very simple and works well enough...but, well, I am wondering if there is a more idiomatic approach. Specifically is there a way to track the left/right values without using some externally tracked value but still keep the tasty recursion.
/*
* A tree node
*/
case class TreeNode(val id:String, val parentId: String){
var left: Int = 0
var right: Int = 0
}
/*
* a method to compute the left/right node values
*/
def walktree(node: TreeNode) = {
/*
* increment state for the inner function
*/
var c = 0
/*
* A method to set the increment state
*/
def increment = { c+=1; c } // poo
/*
* the tasty inner method
* treeNodes is a List[TreeNode]
*/
def walk(node: TreeNode): Unit = {
node.left = increment
/*
* recurse on all direct descendants
*/
treeNodes filter( _.parentId == node.id) foreach (walk(_))
node.right = increment
}
walk(node)
}
walktree(someRootNode)
Edit -
The list of nodes is taken from a database. Pulling the nodes into a proper tree would take too much time. I am pulling a flat list into memory and all I have is an association via node id's as pertains to parents and children.
Adding left/right node values allows me to get a snapshop of all children (and childrens children) with a single SQL query.
The calculation needs to run very quickly in order to maintain data integrity should parent-child associations change (which they do very frequently).
In addition to using the awesome Scala collections I've also boosted speed by using parallel processing for some pre/post filtering on the tree nodes. I wanted to find a more idiomatic way of tracking the left/right node values. After looking at the answer from #dhg it got even better. Using groupBy instead of a filter turns the algorithm (mostly?) linear instead of quadtratic!
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree(node: TreeNode) = {
def walk(node: TreeNode, counter: Int): Int = {
node.left = counter
node.right =
treeNodeMap(node.id)
.foldLeft(counter+1) {
(result, curnode) => walk(curnode, result) + 1
}
node.right
}
walk(node,1)
}
Your code appears to be calculating an in-order traversal numbering.
I think what you want to make your code better is a fold that carries the current value downward and passes the updated value upward. Note that it might also be worth it to do a treeNodes.groupBy(_.parentId) before walktree to prevent you from calling treeNodes.filter(...) every time you call walk.
val treeNodes = List(TreeNode("1","0"),TreeNode("2","1"),TreeNode("3","1"))
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree2(node: TreeNode) = {
def walk(node: TreeNode, c: Int): Int = {
node.left = c
val newC =
treeNodeMap(node.id) // get the children without filtering
.foldLeft(c+1)((c, child) => walk(child, c) + 1)
node.right = newC
newC
}
walk(node, 1)
}
And it produces the same result:
scala> walktree2(TreeNode("0","-1"))
scala> treeNodes.map(n => "(%s,%s)".format(n.left,n.right))
res32: List[String] = List((2,7), (3,4), (5,6))
That said, I would completely rewrite your code as follows:
case class TreeNode( // class is now immutable; `walktree` returns a new tree
id: String,
value: Int, // value to be set during `walktree`
left: Option[TreeNode], // recursively-defined structure
right: Option[TreeNode]) // makes traversal much simpler
def walktree(node: TreeNode) = {
def walk(nodeOption: Option[TreeNode], c: Int): (Option[TreeNode], Int) = {
nodeOption match {
case None => (None, c) // if this child doesn't exist, do nothing
case Some(node) => // if this child exists, recursively walk
val (newLeft, cLeft) = walk(node.left, c) // walk the left side
val newC = cLeft + 1 // update the value
val (newRight, cRight) = walk(node.right, newC) // walk the right side
(Some(TreeNode(node.id, newC, newLeft, newRight)), cRight)
}
}
walk(Some(node), 0)._1
}
Then you can use it like this:
walktree(
TreeNode("1", -1,
Some(TreeNode("2", -1,
Some(TreeNode("3", -1, None, None)),
Some(TreeNode("4", -1, None, None)))),
Some(TreeNode("5", -1, None, None))))
To produce:
Some(TreeNode(1,4,
Some(TreeNode(2,2,
Some(TreeNode(3,1,None,None)),
Some(TreeNode(4,3,None,None)))),
Some(TreeNode(5,5,None,None))))
If I get your algorithm correctly:
def walktree(node: TreeNode, c: Int): Int = {
node.left = c
val c2 = treeNodes.filter(_.parentId == node.id).foldLeft(c + 1) {
(cur, n) => walktree(n, cur)
}
node.right = c2 + 1
c2 + 2
}
walktree(new TreeNode("", ""), 0)
Off-by-one errors are likely to occur.
Few random thoughts (better suited for http://codereview.stackexchange.com):
try posting that compiles... We have to guess that is a sequence of TreeNode:
val is implicit for case classes:
case class TreeNode(val id: String, val parentId: String) {
Avoid explicit = and Unit for Unit functions:
def walktree(node: TreeNode) = {
def walk(node: TreeNode): Unit = {
Methods with side-effects should have ():
def increment = {c += 1; c}
This is terribly slow, consider storing list of children in the actual node:
treeNodes filter (_.parentId == node.id) foreach (walk(_))
More concice syntax would be treeNodes foreach walk:
treeNodes foreach (walk(_))