How to reason about stack safety in Scala Cats / fs2? - scala

Here is a piece of code from the documentation for fs2. The function go is recursive. The question is how do we know if it is stack safe and how to reason if any function is stack safe?
import fs2._
// import fs2._
def tk[F[_],O](n: Long): Pipe[F,O,O] = {
def go(s: Stream[F,O], n: Long): Pull[F,O,Unit] = {
s.pull.uncons.flatMap {
case Some((hd,tl)) =>
hd.size match {
case m if m <= n => Pull.output(hd) >> go(tl, n - m)
case m => Pull.output(hd.take(n.toInt)) >> Pull.done
}
case None => Pull.done
}
}
in => go(in,n).stream
}
// tk: [F[_], O](n: Long)fs2.Pipe[F,O,O]
Stream(1,2,3,4).through(tk(2)).toList
// res33: List[Int] = List(1, 2)
Would it also be stack safe if we call go from another method?
def tk[F[_],O](n: Long): Pipe[F,O,O] = {
def go(s: Stream[F,O], n: Long): Pull[F,O,Unit] = {
s.pull.uncons.flatMap {
case Some((hd,tl)) =>
hd.size match {
case m if m <= n => otherMethod(...)
case m => Pull.output(hd.take(n.toInt)) >> Pull.done
}
case None => Pull.done
}
}
def otherMethod(...) = {
Pull.output(hd) >> go(tl, n - m)
}
in => go(in,n).stream
}

My previous answer here gives some background information that might be useful. The basic idea is that some effect types have flatMap implementations that support stack-safe recursion directly—you can nest flatMap calls either explicitly or through recursion as deeply as you want and you won't overflow the stack.
For some effect types it's not possible for flatMap to be stack-safe, because of the semantics of the effect. In other cases it may be possible to write a stack-safe flatMap, but the implementers might have decided not to because of performance or other considerations.
Unfortunately there's no standard (or even conventional) way to know whether the flatMap for a given type is stack-safe. Cats does include a tailRecM operation that should provide stack-safe monadic recursion for any lawful monadic effect type, and sometimes looking at a tailRecM implementation that's known to be lawful can provide some hints about whether a flatMap is stack-safe. In the case of Pull it looks like this:
def tailRecM[A, B](a: A)(f: A => Pull[F, O, Either[A, B]]) =
f(a).flatMap {
case Left(a) => tailRecM(a)(f)
case Right(b) => Pull.pure(b)
}
This tailRecM is just recursing through flatMap, and we know that Pull's Monad instance is lawful, which is pretty good evidence that Pull's flatMap is stack-safe. The one complicating factor here is that the instance for Pull has an ApplicativeError constraint on F that Pull's flatMap doesn't, but in this case that doesn't change anything.
So the tk implementation here is stack-safe because flatMap on Pull is stack-safe, and we know that from looking at its tailRecM implementation. (If we dug a little deeper we could figure out that flatMap is stack-safe because Pull is essentially a wrapper for FreeC, which is trampolined.)
It probably wouldn't be terribly hard to rewrite tk in terms of tailRecM, although we'd have to add the otherwise unnecessary ApplicativeError constraint. I'm guessing the authors of the documentation chose not to do that for clarity, and because they knew Pull's flatMap is fine.
Update: here's a fairly mechanical tailRecM translation:
import cats.ApplicativeError
import fs2._
def tk[F[_], O](n: Long)(implicit F: ApplicativeError[F, Throwable]): Pipe[F, O, O] =
in => Pull.syncInstance[F, O].tailRecM((in, n)) {
case (s, n) => s.pull.uncons.flatMap {
case Some((hd, tl)) =>
hd.size match {
case m if m <= n => Pull.output(hd).as(Left((tl, n - m)))
case m => Pull.output(hd.take(n.toInt)).as(Right(()))
}
case None => Pull.pure(Right(()))
}
}.stream
Note that there's no explicit recursion.
The answer to your second question depends on what the other method looks like, but in the case of your specific example, >> will just result in more flatMap layers, so it should be fine.
To address your question more generally, this whole topic is a confusing mess in Scala. You shouldn't have to dig into implementations like we did above just to know whether a type supports stack-safe monadic recursion or not. Better conventions around documentation would be a help here, but unfortunately we're not doing a very good job of that. You could always use tailRecM to be "safe" (which is what you'll want to do when the F[_] is generic, anyway), but even then you're trusting that the Monad implementation is lawful.
To sum up: it's a bad situation all around, and in sensitive situations you should definitely write your own tests to verify that implementations like this are stack-safe.

Related

Implementing functor map for class-tagged arguments only

I have the following data structure:
class MyDaSt[A]{
def map[B: ClassTag](f: A => B) = //...
}
I'd like to implement a Functor instance for to be able to use ad-hoc polymorphism. The obvious attempt would be as follows:
implicit val mydastFunctor: Functor[MyDaSt] = new Functor[MyDaSt] {
override def map[A, B](fa: MyDaSt[A])(f: A => B): MyDaSt[B] = fa.map(f) //compile error
}
It obviously does not compile because we did not provide an implicit ClassTag[B]. But would it be possible to use map only with functions f: A => B such that there is ClassTag[B]. Otherwise compile error. I mean something like that:
def someFun[A, B, C[_]: Functor](cc: C[A], f: A => B) = cc.map(f)
val f: Int => Int = //...
val v: MyDaSt[Int] = //...
someFunc(v, f) //fine, ClassTag[Int] exists and in scope
I cannot change its implementation in anyway, but I can create wrappers (which does not look helpful through) or inheritance. I'm free to use shapeless of any version.
I currently think that shapeless is a way to go in such case...
I'll expand on what comments touched:
Functor
cats.Functor describes an endofunctor in a category of Scala types - that is, you should be able to map with a function A => B where A and B must support any Scala types.
What you have is a mathematical functor, but in a different, smaller category of types that have a ClassTag. These general functors are somewhat uncommon - I think for stdlib types, only SortedSet can be a functor on a category of ordered things - so it's fairly unexplored territory in Scala FP right now, only rumored somewhat in Scalaz 8.
Cats does not have any tools for abstracting over such things, so you won't get any utility methods and ecosystem support. You can use that answer linked by #DmytroMitin if you want to roll your own
Coyoneda
Coyoneda can make an endofunctor on Scala types from any type constructor F[_]. The idea is simple:
have some initial value F[Initial]
have a function Initial => A
to map with A => B, you don't touch initial value, but simply compose the functions to get Initial => B
You can lift any F[A] into cats.free.Coyoneda[F, A]. The question is how to get F[A] out.
If F is a cats.Functor, then it is totally natural that you can use it's native map, and, in fact, there will not be any difference in result with using Coyoneda and using F directly, due to functor law (x.map(f).map(g) <-> x.map(f andThen g)).
In your case, it's not. But you can tear cats.free.Coyoneda apart and delegate to your own map:
def coy[A](fa: MyDaSt[A]): Coyoneda[MyDaSt, A] = Coyoneda.lift(fa)
def unCoy[A: ClassTag](fa: Coyoneda[MyDaSt, A]): MyDaSt[A] =
fa.fi.map(fa.k) // fi is initial value, k is the composed function
Which will let you use functions expecting cats.Functor:
def generic[F[_]: Functor, A: Show](fa: F[A]): F[String] = fa.map(_.show)
unCoy(generic(coy(v))) // ok, though cumbersome and needs -Ypartial-unification on scala prior to 2.13
(runnable example on scastie)
An obvious limitation is that you need to have a ClassTag[A] in any spot you want to call unCo - even if you did not need it to create an instance of MyDaSt[A] in the first place.
The less obvious one is that you don't automatically have that guarantee about having no behavioral differences. Whether it's okay or not depends on what your map does - e.g. if it's just allocating some Arrays, it shouldn't cause issues.

Scala - function map is a pattern matching or an interation

I had spent weeks on trying to understand the idea behind "lifting" in scala.
Originally, it was from the example related to Chapter 4 of book "Functional Programming in Scala"
Then I found below topic "How map work on Options in Scala?"
The selected answer specify that:
def map[B](f: A => B): Option[B] = this match (Let's considered this as (*) )
So, from above code, I assume that function "map" is derived from function match. Hence, the mechanism behind "map"
is a kind of pattern matching to provide a case selection between Some, and None
Then, I created below examples by using function map for Seq, Option, and Map (Let's considered below examples as (**) )
Example 1: map for Seq
val xs = Seq(1, 2, 3)
xs.map(println)
Example 2: map for Option
val a:Option[Int] = Some(5)
a.map(println)
val b:Option[Int] = None
b.map(println)
Example 3: map for Map
val capitals = Map("France" -> "Paris", "Japan" -> "Tokyo")
capitals.map(println)
From (*) and (**), I could not know whether "map" is a pattern matching or an iteration, or both.
Thank you for helping me to understand this.
#Jwvh provided a more programming based answer but I want to dig a little bit deeper.
I certainly appreciate you trying to understand how things work in Scala, however if you really want to dig that deep, I am afraid you will need to obtain some basic knowledge of Category Theory since there is no "idea behind lifting in scala" but just the "idea behind lifting"
This is also why functions like "map" can be very confusing. Inherently, programmers are taught map etc. as operations on collections, where as they are actually operations that come with Functors and Natural Transformations (this is normally referred to as fmap in Category Theory and also Haskell).
Before I move on, the short answer is it is a pattern matching in the examples you gave and in some of them it is both. Map is defined specifically to the case, the only condition is that it maintains functoriality
Attention: I will not be defining every single term below, since I would need to write a book to build up to some of the following definitions, interested readers are welcome to research them on their own. You should be able to get some basic understanding by following the types
Let's consider these as Functors, the definition will be something along the lines of this:
In (very very) short, we consider types as objects in the category of our language. The functions between these types (type constructors) are morphisms between types in this category. The set of these transformations are called Endo-Functors (take us from the category of Scala and drop us back in the category of Scala). Functors have to have a polymorphic (which actually has a whole different (extra) definition in category theory) map function, that will take some object A, through some type constructor turn it into object B.
implicit val option: Functor[Option] = new Functor[Option] {
override def map[A,B](optA: Option[A])(f: (A) => B): Option[B] = optA match{
case Some(a) => Some(f(a))
case _ => None
}
}
implicit val seq: Functor[Seq[_]] = new Functor[Seq[_]] {
override def map[A,B](sA: Seq[A])(f: (A) => B): Seq[B] = sA match{
case a :: tail => Seq(f(a), map(tail)(f))
case Nil => Nil
}
}
As you can see in the second case, there is a little bit of both (more of a recursion than iteration but still).
Now before the internet blows up on me, I will say you cant pattern match on Seq in Scala. It works here because the default Seq is also a List. I just provided this example because it is simpler to understand. The underlying definition something along the lines of that.
Now hold on a second. If you look at these types, you see that they also have flatMap defined on them. This means they are something more special than plain Functors. They are Monads. So beyond satisfying functoriality, they obey the monadic laws.
Turns out Monad has a different kind of meaning in the core scala, more on that here: What exactly makes Option a monad in Scala?
But again very very short, this means that we are now in a category where the endofunctors from our previous category are the objects and the mappings between them are morphisms (natural transformations), this is slightly more accurate because if you think about it when you take a type and transform it, you take (carry over) all of it's internal type constructors (2-cell or internal morphisms) with it, you do not only take this sole idea of a type without it's functions.
implicit val optionMonad: Monad[Option] = new Monad[Option] {
override def flatMap[A, B](optA: Option[A])(f: (A) => Option[B]): Option[B] = optA match{
case Some(a) => f(a)
case _ => None
}
def pure[A](a: A): Option[A] = Some(a)
//You can define map using pure and flatmap
}
implicit val seqMonad: Monad[Seq[_]] = new Monad[Seq[_]] {
override def flatMap[A, B](sA: Seq[A])(f: (A) => Seq[B]): Seq[B] = sA match{
case x :: xs => f(a).append(flatMap(tail)(f))
case Nil => Nil
}
override def pure[A](a: A): Seq[A] = Seq(a)
//Same warning as above, also you can implement map with the above 2 funcs
}
One thing you can always count on is map being having pattern match (or some if statement). Why?
In order to satisfy the identity laws, we need to have some sort of a "base case", a unit object and in many cases (such as Lists) those types are gonna be what we call either a product or coproduct.
Hopefully, this did not confuse you further. I wish I could get into every detail of this but it would simply take pages, I highly recommend getting into categories to fully understand where these come from.
From the ScalaDocs page we can see that the type profile for the Standard Library map() method is a little different.
def map[B](f: (A) => B): Seq[B]
So the Standard Library map() is the means to transition from a collection of elements of type A to the same collection but the elements are type B. (A and B might be the same type. They aren't required to be different.)
So, yes, it iterates through the collection applying function f() to each element A to create each new element B. And function f() might use pattern matching in its code, but it doesn't have to.
Now consider a.map(println). Every element of a is sent to println which returns Unit. So if a is List[Int] then the result of a.map(println) is List[Unit], which isn't terribly useful.
When all we want is the side effect of sending information to StdOut then we use foreach() which doesn't create a new collection: a.foreach(println)
Function map for Option isn't about pattern matching. The match/case used in your referred link is just one of the many ways to define the function. It could've been defined using if/else. In fact, that's how it's defined in Scala 2.13 source of class Option:
sealed abstract class Option[+A] extends IterableOnce[A] with Product with Serializable {
self =>
...
final def map[B](f: A => B): Option[B] =
if (isEmpty) None else Some(f(this.get))
...
}
If you view Option like a "collection" of either one element (Some(x)) or no elements (None), it might be easier to see the resemblance of how map transforms an Option versus, say, a List:
val f: Int => Int = _ + 1
List(42).map(f)
// res1: List[Int] = List(43)
List.empty[Int].map(f)
// res2: List[Int] = List()
Some(42).map(f)
// res3: Option[Int] = Some(43)
None.map(f)
// res4: Option[Int] = None

List implementation of foldLeft in Scala

Scala foldLeft implementation is:
def foldLeft[B](z: B)(op: (B, A) => B): B = {
var result = z
this foreach (x => result = op(result, x))
result
}
Why scala develovers don't use something like tail recursion or something else like this(It's just example) :
def foldLeft[T](start: T, myList: List[T])(f:(T, T) => T): T = {
def foldRec(accum: T, list: List[T]): T = {
list match {
case Nil => accum
case head :: tail => foldRec(f(accum, head), tail)
}
}
foldRec(start, myList)
}
Can it be? Why if it cannot/can?
"Why not replace this simple three-line piece of code with this less simple seven-line piece of code that does the same thing?"
Um. That's why.
(If you are asking about performance, then one would need benchmarks of both solutions and an indication that the non-closure version was significantly faster.)
According to this answer, Scala does support tail-recursion optimization, but it looks like it wasn't there from the beginning, and it might still not work in every case, so that specific implementation might be a leftover.
That said, Scala is multi-paradigm and I don't think it strives for purity in terms of its functional programming, so I wouldn't be surprised if they went for the most practical or convenient approach.
Beside the imperative solution is simpler, it is also way more general. As you may have noticed, foldLeft is implemented in TraversableOnce and depends only on the foreach method. Thus, by extending Traversable and implementing foreach, which is probably the simplest method to implement on any collection, you get all these wonderful methods.
The declarative implementation on the other hand is reflexive on the structure of the List and very specific as it depends on Nil and ::.

Monadic fold with State monad in constant space (heap and stack)?

Is it possible to perform a fold in the State monad in constant stack and heap space? Or is a different functional technique a better fit to my problem?
The next sections describe the problem and a motivating use case. I'm using Scala, but solutions in Haskell are welcome too.
Fold in the State Monad Fills the Heap
Assume Scalaz 7. Consider a monadic fold in the State monad. To avoid stack overflows, we'll trampoline the fold.
import scalaz._
import Scalaz._
import scalaz.std.iterable._
import Free.Trampoline
type TrampolinedState[S, B] = StateT[Trampoline, S, B] // monad type constructor
type S = Int // state is an integer
type M[B] = TrampolinedState[S, B] // our trampolined state monad
type R = Int // or some other monoid
val col: Iterable[R] = largeIterableofRs() // defined elsewhere
val (count, sum): (S, R) = col.foldLeftM[M, R](Monoid[R].zero){
(acc: R, x: R) => StateT[Trampoline, S, R] {
s: S => Trampoline.done {
(s + 1, Monoid[R].append(acc, x))
}
}
} run 0 run
// In Scalaz 7, foldLeftM is implemented in terms of foldRight, which in turn
// is a reversed.foldLeft. This pulls the whole collection into memory and kills
// the heap. Ignore this heap overflow. We could reimplement foldLeftM to avoid
// this overflow or use a foldRightM instead.
// Our real issue is the heap used by the unexecuted State mobits.
For a large collection col, this will fill the heap.
I believe that during the fold, a closure (a State mobit) is created for each value in the collection (the x: R parameter), filling the heap. None of those can be evaluated until run 0 is executed, providing the initial state.
Can this O(n) heap usage be avoided?
More specifically, can the initial state be provided before the fold so that the State monad can execute during each bind, rather than nesting closures for later evaluation?
Or can the fold be constructed such that it is executed lazily after the State monad is run? In this way, the next x: R closure would not be created until after the previous ones have been evaluated and made suitable for garbage collection.
Or is there a better functional paradigm for this sort of work?
Example Application
But perhaps I'm using the wrong tool for the job. The evolution of an example use case follows. Am I wandering down the wrong path here?
Consider reservoir sampling, i.e., picking in one pass a uniform random k items from a collection too large to fit in memory. In Scala, such a function might be
def sample[A](col: TraversableOnce[A])(k: Int): Vector[A]
and if pimped into the TraversableOnce type could be used like this
val tenRandomInts = (Int.Min to Int.Max) sample 10
The work done by sample is essentially a fold:
def sample[A](col: Traversable[A])(k: Int): Vector[A] = {
col.foldLeft(Vector()){update(k)(_: Vector[A], _: A)}
}
However, update is stateful; it depends on n, the number of items already seen. (It also depends on an RNG, but for simplicity I assume that is global and stateful. The techniques used to handle n would extend trivially.). So how to handle this state?
The impure solution is simple and runs with constant stack and heap.
/* Impure version of update function */
def update[A](k: Int) = new Function2[Vector[A], A, Vector[A]] {
var n = 0
def apply(sample: Vector[A], x: A): Vector[A] = {
n += 1
algorithmR(k, n, acc, x)
}
}
def algorithmR(k: Int, n: Int, acc: Vector[A], x: A): Vector[A] = {
if (sample.size < k) {
sample :+ x // must keep first k elements
} else {
val r = rand.nextInt(n) + 1 // for simplicity, rand is global/stateful
if (r <= k)
sample.updated(r - 1, x) // sample is 0-index
else
sample
}
}
But what about a purely functional solution? update must take n as an additional parameter and return the new value along with the updated sample. We could include n in the implicit state, the fold accumulator, e.g.,
(col.foldLeft ((0, Vector())) (update(k)(_: (Int, Vector[A]), _: A)))._2
But that obscures the intent; we only really intend to accumulate the sample vector. This problem seems ready made for the State monad and a monadic left fold. Let's try again.
We'll use Scalaz 7, with these imports
import scalaz._
import Scalaz._
import scalaz.std.iterable_
and operate over an Iterable[A], since Scalaz doesn't support monadic folding of a Traversable.
sample is now defined
// sample using State monad
def sample[A](col: Iterable[A])(k: Int): Vector[A] = {
type M[B] = State[Int, B]
// foldLeftM is implemented using foldRight, which must reverse `col`, blowing
// the heap for large `col`. Ignore this issue for now.
// foldLeftM could be implemented differently or we could switch to
// foldRightM, implemented using foldLeft.
col.foldLeftM[M, Vector[A]](Vector())(update(k)(_: Vector[A], _: A)) eval 0
}
where update is
// update using State monad
def update(k: Int) = {
(acc: Vector[A], x: A) => State[Int, Vector[A]] {
n => (n + 1, algorithmR(k, n + 1, acc, x)) // algR same as impure solution
}
}
Unfortunately, this blows the stack on a large collection.
So let's trampoline it. sample is now
// sample using trampolined State monad
def sample[A](col: Iterable[A])(k: Int): Vector[A] = {
import Free.Trampoline
type TrampolinedState[S, B] = StateT[Trampoline, S, B]
type M[B] = TrampolinedState[Int, B]
// Same caveat about foldLeftM using foldRight and blowing the heap
// applies here. Ignore for now. This solution blows the heap anyway;
// let's fix that issue first.
col.foldLeftM[M, Vector[A]](Vector())(update(k)(_: Vector[A], _: A)) eval 0 run
}
where update is
// update using trampolined State monad
def update(k: Int) = {
(acc: Vector[A], x: A) => StateT[Trampoline, Int, Vector[A]] {
n => Trampoline.done { (n + 1, algorithmR(k, n + 1, acc, x) }
}
}
This fixes the stack overflow, but still blows the heap for very large collections (or very small heaps). One anonymous function per
value in the collection is created during the fold (I believe to close over each x: A parameter), consuming the heap before the trampoline is even run. (FWIW, the State version has this issue too; the stack overflow just surfaces first with smaller collections.)
Our real issue is the heap used by the unexecuted State mobits.
No, it is not. The real issue is that the collection doesn't fit in memory and that foldLeftM and foldRightM force the entire collection. A side effect of the impure solution is that you are freeing memory as you go. In the "purely functional" solution, you're not doing that anywhere.
Your use of Iterable ignores a crucial detail: what kind of collection col actually is, how its elements are created and how they are expected to be discarded. And so, necessarily, does foldLeftM on Iterable. It is likely too strict, and you are forcing the entire collection into memory. For example, if it is a Stream, then as long as you are holding on to col all the elements forced so far will be in memory. If it's some other kind of lazy Iterable that doesn't memoize its elements, then the fold is still too strict.
I tried your first example with an EphemeralStream did not see any significant heap pressure, even though it will clearly have the same "unexecuted State mobits". The difference is that an EphemeralStream's elements are weakly referenced and its foldRight doesn't force the entire stream.
I suspect that if you used Foldable.foldr, then you would not see the problematic behaviour since it folds with a function that is lazy in its second argument. When you call the fold, you want it to return a suspension that looks something like this immediately:
Suspend(() => head |+| tail.foldRightM(...))
When the trampoline resumes the first suspension and runs up to the next suspension, all of the allocations between suspensions will become available to be freed by the garbage collector.
Try the following:
def foldM[M[_]:Monad,A,B](a: A, bs: Iterable[B])(f: (A, B) => M[A]): M[A] =
if (bs.isEmpty) Monad[M].point(a)
else Monad[M].bind(f(a, bs.head))(fax => foldM(fax, bs.tail)(f))
val MS = StateT.stateTMonadState[Int, Trampoline]
import MS._
foldM[M,R,Int](Monoid[R].zero, col) {
(x, r) => modify(_ + 1) map (_ => Monoid[R].append(x, r))
} run 0 run
This will run in constant heap for a trampolined monad M, but will overflow the stack for a non-trampolined monad.
But the real problem is that Iterable is not a good abstraction for data that are too large to fit in memory. Sure, you can write an imperative side-effecty program where you explicitly discard elements after each iteration or use a lazy right fold. That works well until you want to compose that program with another one. And I'm assuming that the whole reason you're investigating doing this in a State monad to begin with is to gain compositionality.
So what can you do? Here are some options:
Make use of Reducer, Monoid, and composition thereof, then run in an imperative explicitly-freeing loop (or a trampolined lazy right fold) as the last step, after which composition is not possible or expected.
Use Iteratee composition and monadic Enumerators to feed them.
Write compositional stream transducers with Scalaz-Stream.
The last of these options is the one that I would use and recommend in the general case.
Using State, or any similar monad, isn't a good approach to the problem.
Using State is condemned to blow the stack/heap on large collections.
Consider a value of x: State[A,B] constructed from a large collection (for
example by folding over it). Then x can be evaluated on different values of the initial state A, yielding different results. So x needs to retain all information
contained in the collection. An in pure settings, x can't forget some
information not to blow stack/heap, so anything that is computed remains in
memory until the whole monadic value is freed, which happens only after the
result is evaluated. So the memory consumption of x is proportional to the size of the collection.
I believe a fitting approach to this problem is to use functional iteratees/pipes/conduits. This concept (referred to under these three names) was invented to process large collections of data with constant memory consumption, and to describe such processes using simple combinator.
I tried to use Scalaz' Iteratees, but it seems this part isn't mature yet, it suffers from stack overflows just as State does (or perhaps I'm not using it right; the code is available here, if anybody is interested).
However, it was simple using my (still a bit experimental) scala-conduit library (disclaimer: I'm the author):
import conduit._
import conduit.Pipe._
object Run extends App {
// Define a sampling function as a sink: It consumes
// data of type `A` and produces a vector of samples.
def sampleI[A](k: Int): Sink[A, Vector[A]] =
sampleI[A](k, 0, Vector())
// Create a sampling sink with a given state. It requests
// a value from the upstream conduit. If there is one,
// update the state and continue (the first argument to `requestF`).
// If not, return the current sample (the second argument).
// The `Finalizer` part isn't important for our problem.
private def sampleI[A](k: Int, n: Int, sample: Vector[A]):
Sink[A, Vector[A]] =
requestF((x: A) => sampleI(k, n + 1, algorithmR(k, n + 1, sample, x)),
(_: Any) => sample)(Finalizer.empty)
// The sampling algorithm copied from the question.
val rand = new scala.util.Random()
def algorithmR[A](k: Int, n: Int, sample: Vector[A], x: A): Vector[A] = {
if (sample.size < k) {
sample :+ x // must keep first k elements
} else {
val r = rand.nextInt(n) + 1 // for simplicity, rand is global/stateful
if (r <= k)
sample.updated(r - 1, x) // sample is 0-index
else
sample
}
}
// Construct an iterable of all `short` values, pipe it into our sampling
// funcition, and run the combined pipe.
{
print(runPipe(Util.fromIterable(Short.MinValue to Short.MaxValue) >->
sampleI(10)))
}
}
Update: It'd be possible to solve the problem using State, but we need to implement a custom fold specifically for State that knows how to do it constant space:
import scala.collection._
import scala.language.higherKinds
import scalaz._
import Scalaz._
import scalaz.std.iterable._
object Run extends App {
// Folds in a state monad over a foldable
def stateFold[F[_],E,S,A](xs: F[E],
f: (A, E) => State[S,A],
z: A)(implicit F: Foldable[F]): State[S,A] =
State[S,A]((s: S) => F.foldLeft[E,(S,A)](xs, (s, z))((p, x) => f(p._2, x)(p._1)))
// Sample a lazy collection view
def sampleS[F[_],A](k: Int, xs: F[A])(implicit F: Foldable[F]):
State[Int,Vector[A]] =
stateFold[F,A,Int,Vector[A]](xs, update(k), Vector())
// update using State monad
def update[A](k: Int) = {
(acc: Vector[A], x: A) => State[Int, Vector[A]] {
n => (n + 1, algorithmR(k, n + 1, acc, x)) // algR same as impure solution
}
}
def algorithmR[A](k: Int, n: Int, sample: Vector[A], x: A): Vector[A] = ...
{
print(sampleS(10, (Short.MinValue to Short.MaxValue)).eval(0))
}
}

Explain Traverse[List] implementation in scalaz-seven

I'm trying to understand the traverseImpl implementation in scalaz-seven:
def traverseImpl[F[_], A, B](l: List[A])(f: A => F[B])(implicit F: Applicative[F]) = {
DList.fromList(l).foldr(F.point(List[B]())) {
(a, fbs) => F.map2(f(a), fbs)(_ :: _)
}
}
Can someone explain how the List interacts with the Applicative? Ultimately, I'd like to be able to implement other instances for Traverse.
An applicative lets you apply a function in a context to a value in a context. So for instance, you can apply some((i: Int) => i + 1) to some(3) and get some(4). Let's forget that for now. I'll come back to that later.
List has two representations, it's either Nil or head :: tail. You may be used to fold over it using foldLeft but there is another way to fold over it:
def foldr[A, B](l: List[A], acc0: B, f: (A, B) => B): B = l match {
case Nil => acc0
case x :: xs => f(x, foldr(xs, acc0, f))
}
Given List(1, 2) we fold over the list applying the function starting from the right side - even though we really deconstruct the list from the left side!
f(1, f(2, Nil))
This can be used to compute the length of a list. Given List(1, 2):
foldr(List(1, 2), 0, (i: Int, acc: Int) => 1 + acc)
// returns 2
This can also be used to create another list:
foldr[Int, List[Int]](List(1, 2), List[Int](), _ :: _)
//List[Int] = List(1, 2)
So given an empty list and the :: function we were able to create another list. What if our elements are in some context? If our context is an applicative then we can still apply our elements and :: in that context. Continuing with List(1, 2) and Option as our applicative. We start with some(List[Int]())) we want to apply the :: function in the Option context. This is what the F.map2 does. It takes two values in their Option context, put the provided function of two arguments into the Option context and apply them together.
So outside the context we have (2, Nil) => 2 :: Nil
In context we have: (Some(2), Some(Nil)) => Some(2 :: Nil)
Going back to the original question:
// do a foldr
DList.fromList(l).foldr(F.point(List[B]())) {
// starting with an empty list in its applicative context F.point(List[B]())
(a, fbs) => F.map2(f(a), fbs)(_ :: _)
// Apply the `::` function to the two values in the context
}
I am not sure why the difference DList is used. What I see is that it uses trampolines so hopefully that makes this implementation work without blowing the stack, but I have not tried so I don't know.
The interesting part about implementing the right fold like this is that I think it gives you an approach to implement traverse for algebric data types using catamorphisms.
For instance given:
trait Tree[+A]
object Leaf extends Tree[Nothing]
case class Node[A](a: A, left: Tree[A], right: Tree[A]) extends Tree[A]
Fold would be defined like this (which is really following the same approach as for List):
def fold[A, B](tree: Tree[A], valueForLeaf: B, functionForNode: (A, B, B) => B): B = {
tree match {
case Leaf => valueForLeaf
case Node(a, left, right) => functionForNode(a,
fold(left, valueForLeaf, functionForNode),
fold(right, valueForLeaf, functionForNode)
)
}
}
And traverse would use that fold with F.point(Leaf) and apply it to Node.apply. Though there is no F.map3 so it may be a bit cumbersome.
This not something so easy to grasp. I recommend reading the article linked at the beginning of my blog post on the subject.
I also did a presentation on the subject during the last Functional Programming meeting in Sydney and you can find the slides here.
If I can try to explain in a few words, traverse is going to traverse each element of the list one by one, eventually re-constructing the list (_ :: _) but accumulating/executing some kind of "effects" as given by the F Applicative. If F is State it keeps track of some state. If F is the applicative corresponding to a Monoid it aggregates some kind of measure for each element of the list.
The main interaction of the list and the applicative is with the map2 application where it receives a F[B] element and attach it to the other F[List[B]] elements by definition of F as an Applicative and the use of the List constructor :: as the specific function to apply.
From there you see that implementing other instances of Traverse is only about applying the data constructors of the data structure you want to traverse. If you have a look at the linked powerpoint presentation, you'll see some slides with a binary tree traversal.
List#foldRight blows the stack for large lists. Try this in a REPL:
List.range(0, 10000).foldRight(())((a, b) => ())
Typically, you can reverse the list, use foldLeft, then reverse the result to avoid this problem. But with traverse we really have to process the elements in the correct order, to make sure that the effect is treated correctly. DList is a convenient way to do this, by virtue of trampolining.
In the end, these tests must pass:
https://github.com/scalaz/scalaz/blob/scalaz-seven/tests/src/test/scala/scalaz/TraverseTest.scala#L13
https://github.com/scalaz/scalaz/blob/scalaz-seven/tests/src/test/scala/scalaz/std/ListTest.scala#L11
https://github.com/scalaz/scalaz/blob/scalaz-seven/core/src/main/scala/scalaz/Traverse.scala#L76