How to convert `fs2.Stream[IO, T]` to `Iterator[T]` in Scala - scala

Need to fill in the methods next and hasNext and preserve laziness
new Iterator[T] {
val stream: fs2.Stream[IO, T] = ...
def next(): T = ???
def hasNext(): Boolean = ???
}
But cannot figure out how an earth to do this from a fs2.Stream? All the methods on a Stream (or on the "compiled" thing) are fairly useless.
If this is simply impossible to do in a reasonable amount of code, then that itself is a satisfactory answer and we will just rip out fs2.Stream from the codebase - just want to check first!

fs2.Stream, while similar in concept to Iterator, cannot be converted to one while preserving laziness. I'll try to elaborate on why...
Both represent a pull-based series of items, but the way in which they represent that series and implement the laziness differs too much.
As you already know, Iterator represents its pull in terms of the next() and hasNext methods, both of which are synchronous and blocking. To consume the iterator and return a value, you can directly call those methods e.g. in a loop, or use one of its many convenience methods.
fs2.Stream supports two capabilities that make it incompatible with that interface:
cats.effect.Resource can be included in the construction of a Stream. For example, you could construct a fs2.Stream[IO, Byte] representing the contents of a file. When consuming that stream, even if you abort early or do some strange flatMap, the underlying Resource is honored and your file handle is guaranteed to be closed. If you were trying to do the same thing with iterator, the "abort early" case would pose problems, forcing you to do something like Iterator[Byte] with Closeable and the caller would have to make sure to .close() it, or some other pattern.
Evaluation of "effects". In this context, effects are types like IO or Future, where the process of obtaining the value may perform some possibly-asynchronous action, and may perform side-effects. Asynchrony poses a problem when trying to force the process into a synchronous interface, since it forces you to block your current thread to wait for the asynchronous answer, which can cause deadlocks if you aren't careful. Libraries like cats-effect strongly discourage you from calling methods like unsafeRunSync.
fs2.Stream does allow for some special cases that prevent the inclusion of Resource and Effects, via its Pure type alias which you can use in place of IO. That gets you access to Stream.PureOps, but that only gets you methods that consume the whole stream by building a collection; the laziness you want to preserve would be lost.
Side note: you can convert an Iterator to a Stream.
The only way to "convert" a Stream to an Iterator is to consume it to some collection type via e.g. .compile.toList, which would get you an IO[List[T]], then .map(_.iterator) that to get an IO[Iterator[T]]. But ultimately that doesn't fit what you're asking for since it forces you to consume the stream to a buffer, breaking laziness.
#Dima mentioned the "XY Problem", which was poorly-received since they didn't really elaborate (initially) on the incompatibility, but they're right. It would be helpful to know why you're trying to make a Stream-to-Iterator conversion, in case there's some other approach that would serve your overall goal instead.

Related

What are some best practices to mix async libraries with sync code in scala

I'm working on a scala code where a 3rd party library returns a Future[Boolean] object while I need to consume this future object in my scala code which is fully written in a synchronous manner.
Currently, I'm doing Await.result on 3rd party lib operation to ensure it returns just boolean. Is there a better way to handle this, my scala code needs a boolean value for further operation?
As Luis noted in the comments, in general there's no alternative to Awaiting on the Future.
That said, you may have some choice about where to Await.
For instance, if you have code like
val result = Await.result(someFuture, Duration.Inf)
f(result)
It may be more useful to run f in Future land with
Await.result(someFuture.map(f), Duration.Inf)
If f happens to block, then it may be worth either wrapping f in blocking or explicitly using an ExecutionContext which will handle a lot of its threads being blocked (e.g. one that can have more threads than cores) for the map.
In general, you'll want to move Awaits to the outermost edge of your code as you can, even shifting edges if you can.

Tagless-final effect propagation

The tagless-final pattern lets us write pure functional programs which are explicit about the effects they require.
However, scaling this pattern might become challenging. I'll try to demonstrate this with an example. Imagine a simple program that reads records from the database and prints them to the console. We will require some custom typeclasses Database and Console, in addition to Monad from cats/scalaz in order to compose them:
def main[F[_]: Monad: Console: Database]: F[Unit] =
read[F].flatMap(Console[F].print)
def read[F[_]: Functor: Database]: F[List[String]] =
Database[F].read.map(_.map(recordToString))
The problem starts when I want to add a new a effect to a function in the inner layers. For example, I want my read function to log a message if no records were found
def read[F[_]: Monad: Database: Logger]: F[List[String]] =
Database[F].read.flatMap {
case Nil => Logger[F].log("no records found") *> Nil.pure
case records => records.map(recordToString).pure
}
But now, I have to add the Logger constraint to all the callers of read up the chain. In this contrived example it's just main, but imagine this is several layers down a complicated real-world application.
We can look at this issue in two ways:
We can say it's a good thing that were explicit about our effects, and we know exactly which effects are needed by each layer
We can also say that this leaks implementation details - main doesn't care about logging, it's just needs the result of read. Also, in real applications you see really long chains of effects in the top layers. It feels like a code-smell, but I can't put my finger on what other approach I can take.
Would love to get your insights on this.
Thanks.
We can also say that this leaks implementation details - main doesn't
care about logging, it's just needs the result of read. Also, in real
applications you see really long chains of effects in the top layers.
It feels like a code-smell, but I can't put my finger on what other
approach I can take.
I actually believe the contrary is true. One of the key promises of pure FP is equational reasoning as a means of deriving the method implementation from it's signature. If read needs a logging effect in order to do it's business, then by all means it should be declaratively expressed in the signature. Another advantage of being explicit about your effects is the fact that when they start to accumulate, perhaps we need to rethink what this specific method is doing and split it up into smaller components? Or should this effect really be used here?
It is true that effects stack up, but as #TravisBrown mentioned in the comments, it is usually the highest place in the call stack that has to "suffer the consequence" of actually providing all the implicit evidence for the entire call tree.

Why future has side effects?

I am reading the book FPiS and on the page 107 the author says:
We should note that Future doesn’t have a purely functional interface.
This is part of the reason why we don’t want users of our library to
deal with Future directly. But importantly, even though methods on
Future rely on side effects, our entire Par API remains pure. It’s
only after the user calls run and the implementation receives an
ExecutorService that we expose the Future machinery. Our users
therefore program to a pure interface whose implementation
nevertheless relies on effects at the end of the day. But since our
API remains pure, these effects aren’t side effects.
Why Future has not purely functional interface?
The problem is that creating a Future that induces a side-effect is in itself also a side-effect, due to Future's eager nature.
This breaks referential transparency. I.e. if you create a Future that only prints to the console, the future will be run immediately and run the side-effect without you asking it to.
An example:
for {
x <- Future { println("Foo") }
y <- Future { println("Foo") }
} yield ()
This results in "Foo" being printed twice. Now if Future was referentially transparent we should be able to get the same result in the non-inlined version below:
val printFuture = Future { println("Foo") }
for {
x <- printFuture
y <- printFuture
} yield ()
However, this instead prints "Foo" only once and even more problematic, it prints it no matter if you include the for-expression or not.
With referentially transparent expression we should be able to inline any expression without changing the semantics of the program, Future can not guarantee this, therefore it breaks referential transparency and is inherently effectful.
A basic premise of FP is referential transparency. In other words, avoiding side effects.
What's a side effect? From Wikipedia:
In computer science, a function or expression is said to have a side effect if it modifies some state outside its scope or has an observable interaction with its calling functions or the outside world. (Except, by convention, returning a value: returning a value has an effect on the calling function, but this is usually not considered as a side effect.)
And what is a Scala future? From the documentation page:
A Future is a placeholder object for a value that may not yet exist.
So a future can transition from a not-yet-existing-value to an existing-value without any interaction from or with the rest of the program, and, as you quoted: "methods on Future rely on side effects."
It would appear that Scala futures do not maintain referential transparency.
As far as I know, Future runs its computation automatically when it's created. Even if it lacks side-effects in its nested computation, it still breaks flatMap composition rule, because it changes state over time:
someFuture.flatMap(Future(_)) == someFuture // can be false
Equality implementation questions aside, we can have a race condition here: new Future immediately runs for a tiny fraction of time, and its isCompleted can differ from someFuture if it is already done.
In order to be pure w.r.t. effect it represents, Future should defer its computation and run it only when explicitly asked for it, like in the case of Par (or scalaz's Task).
To complement the other points and explain relationship between referential transparency (a requirement) and side-effects (mutation that might break this requirement), here is kinda simplistic but pragmatic view on what's happening:
newly created Future immediately submits a Callable task into your pool's queue. Given that queue is a mutable collection - this is basically a side-effect
any subscription (from onComplete to map) does the same + uses an additional mutable collection of subscribers per Callable.
Btw, subscriptions are not only in violation of Monad laws as noted by #P.Frolov (for flatMap) - Functor laws f.map(identity) == f are broken too. Especially, in the light of fact that newly created Future (by map) isn't equivalent to original - it has its separate subscriptions and Callable
This "fire and subscribe" allows you to do stuff like:
val f = Future{...}
val f2 = f.map(...)
val f3 = f.map(...)//twice or more
Every line of this code produces a side-effect that might potentially break referential transparency and actually does as many mentioned.
The reason why many authors prefer "referential transparency" term is probably because from low-level perspective we always do some side-effects, however only subset (usually a more high-level one) of those actually makes your code "non-functional".
As per the futures, breaking referential transparency is most disruptive as it also leads to non-determinism (in Futures case):
val f1 = Future {
println("1")
}
val f2 = Future {
println("2")
}
It gets worse when this is combined with Monads, including for-comprehension cases mentioned by #Luka Jacobowitz. In practice, monads are used not only to flatten-merge compatible containers, but also in order to guarantee [con]sequential relation. This is probably because even in abstract algebra Monads are generalizing over consequence operators meant as a general characterization of the notion of deduction.
This simply means that it's hard to reason about non-deterministic logic, even harder than just non-referential-transparent stuff:
analyzing logs produced by Futures, or even worse actors, is a hell. Even no matter how many labels and thread-local propagation you have - everything breaks eventually.
non-deterministic (aka "sometimes appearing") bugs are most annoying and stay in production for years(!) - even extensive high-load testing (including performance tests) doesn't always catch those.
So, even in absence of other criteria, code that is easier to reason about, is essentially more functional and Futures often lead to code that isn't.
P.S. As a conclusion, if your project is tolerant to scalaz/cats/monix/fs2 so on, it's better to use Tasks/Streams/Iteratees. Those libraries introduce some risks of overdesgn of course; however, IMO it's better to spent time simplifying incomprehensible scalaz-code than debugging an incomprehensible bug.

Traversing lists and streams with a function returning a future

Introduction
Scala's Future (new in 2.10 and now 2.9.3) is an applicative functor, which means that if we have a traversable type F, we can take an F[A] and a function A => Future[B] and turn them into a Future[F[B]].
This operation is available in the standard library as Future.traverse. Scalaz 7 also provides a more general traverse that we can use here if we import the applicative functor instance for Future from the scalaz-contrib library.
These two traverse methods behave differently in the case of streams. The standard library traversal consumes the stream before returning, while Scalaz's returns the future immediately:
import scala.concurrent._
import ExecutionContext.Implicits.global
// Hangs.
val standardRes = Future.traverse(Stream.from(1))(future(_))
// Returns immediately.
val scalazRes = Stream.from(1).traverse(future(_))
There's also another difference, as Leif Warner observes here. The standard library's traverse starts all of the asynchronous operations immediately, while Scalaz's starts the first, waits for it to complete, starts the second, waits for it, and so on.
Different behavior for streams
It's pretty easy to show this second difference by writing a function that will sleep for a few seconds for the first value in the stream:
def howLong(i: Int) = if (i == 1) 10000 else 0
import scalaz._, Scalaz._
import scalaz.contrib.std._
def toFuture(i: Int)(implicit ec: ExecutionContext) = future {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
Now Future.traverse(Stream(1, 2))(toFuture) will print the following:
Starting 1!
Starting 2!
Done 2!
Done 1!
And the Scalaz version (Stream(1, 2).traverse(toFuture)):
Starting 1!
Done 1!
Starting 2!
Done 2!
Which probably isn't what we want here.
And for lists?
Strangely enough the two traversals behave the same in this respect on lists—Scalaz's doesn't wait for one future to complete before starting the next.
Another future
Scalaz also includes its own concurrent package with its own implementation of futures. We can use the same kind of setup as above:
import scalaz.concurrent.{ Future => FutureZ, _ }
def toFutureZ(i: Int) = FutureZ {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
And then we get the behavior of Scalaz on streams for lists as well as streams:
Starting 1!
Done 1!
Starting 2!
Done 2!
Perhaps less surprisingly, traversing an infinite stream still returns immediately.
Question
At this point we really need a table to summarize, but a list will have to do:
Streams with standard library traversal: consume before returning; don't wait for each future.
Streams with Scalaz traversal: return immediately; do wait for each future to complete.
Scalaz futures with streams: return immediately; do wait for each future to complete.
And:
Lists with standard library traversal: don't wait.
Lists with Scalaz traversal: don't wait.
Scalaz futures with lists: do wait for each future to complete.
Does this make any sense? Is there a "correct" behavior for this operation on lists and streams? Is there some reason that the "most asynchronous" behavior—i.e., don't consume the collection before returning, and don't wait for each future to complete before moving on to the next—isn't represented here?
I cannot answer it all, but i try on some parts:
Is there some reason that the "most asynchronous" behavior—i.e., don't
consume the collection before returning, and don't wait for each
future to complete before moving on to the next—isn't represented
here?
If you have dependent calculations and a limited number of threads, you can experience deadlocks. For example you have two futures depending on a third one (all three in the list of futures) and only two threads, you can experience a situation where the first two futures block all two threads and the third one never gets executed. (Of course, if your pool size is one, i.e. zou execute one calculation after the other, you can get similar situations)
To solve this, you need one thread per future, without any limitation. This works for small lists of futures, but not for big one. So if you run all in parallel, you will get a situation where small examples will run in all cases and bigger one will deadlock. (Example: Developer tests run fine, production deadlocks).
Is there a "correct" behavior for this operation on lists and streams?
I think it is impossible with futures. If you know something more of the dependencies, or when you know for sure that the calculations will not block, a more concurrent solution might be possible. But executing lists of futures looks for me "broken by design". Best solution seems one, that will already fail for small examples for deadlocks (i.e. execute one Future after the other).
Scalaz futures with lists: do wait for each future to complete.
I think scalaz uses for comprehensions internally for traversal. With for comprehensions, it is not guaranteed that the calculations are independent. So I guess that Scalaz is doing the right thing here with for comprehensions: Doing one calculation after the other. In the case of futures, this will always work, given you have unlimited threads in you operating system.
So in other words: You see just an artifact of how for comprehensions (must) work.
I hope this makes some sense.
If I understand the question correctly, I think it really comes down to the semantics of streams vs lists.
Traversing a list does what we'd expect from the docs:
Transforms a TraversableOnce[A] into a Future[TraversableOnce[B]] using the provided function A => Future[B]. This is useful for performing a parallel map. For example, to apply a function to all items of a list in parallel:
With streams, it's up to the developer to decide how they want it to work because it depends on more knowledge of the stream than the compiler has (streams can be infinite, but the type system doesn't know about it). if my stream is reading lines from a file, I want to consume it first, since chaining futures line by line wouldn't actually parallelize things. in this case, I would want the parallel approach.
On the other hand, if my stream is an infinite list generating sequential integers and hunting for the first prime greater than some large number, it would be impossible to consume the stream first in one sweep (the chained Future approach would be required, and we'd probably want to run over batches from the stream).
Rather than trying to figure out a canonical way to handle this, I wonder if there are missing types that would help make the different cases more explicit.

Scala toString: parenthesize or not?

I'd like this thread to be some kind of summary of pros/cons for overriding and calling toString with or without empty parentheses, because this thing still confuses me sometimes, even though I've been into Scala for quite a while.
So which one is preferable over another? Comments from Scala geeks, officials and OCD paranoids are highly appreciated.
Pros to toString:
seems to be an obvious and natural choice at the first glance;
most cases are trivial and just construct Strings on the fly without ever modifying internal state;
another common case is to delegate method call to the wrapped abstraction:
override def toString = underlying.toString
Pros to toString():
definitely not "accessor-like" name (that's how IntelliJ IDEA inspector complains every once in a while);
might imply some CPU or I/O work (in cases where counting every System.arrayCopy call is crucial to performance);
even might imply some mutable state changing (consider an example when first toString call is expensive, so it is cached internally to yield quicker calls in future).
So what's the best practice? Am I still missing something?
Update: this question is related specifically to toString which is defined on every JVM object, so I was hoping to find the best practice, if it ever exists.
Here's what Programming In Scala (section 10.3) has to say:
The recommended convention is to use a parameterless method whenever
there are no parameters and the method accesses mutable state only by
reading fields of the containing object (in particular, it does not
change mutable state). This convention supports the uniform access
principle,1 which says that client code should not be affected by a
decision to implement an attribute as a field or method.
Here's what the (unofficial) Scala Style Guide (page 18) has to say:
Scala allows the omission of parentheses on methods of arity-0 (no
arguments):
reply()
// is the same as
reply
However, this syntax
should only be used when the method in question has no side-effects
(purely-functional). In other words, it would be acceptable to omit
parentheses when calling queue.size, but not when calling println().
This convention mirrors the method declaration convention given above.
The latter does not mention the Uniform Access Principle.
If your toString method can be implemented as a val, it implies the field is immutable. If, however, your class is mutable, toString might not always yield the same result (e.g. for StringBuffer). So Programming In Scala implies that we should use toString() in two different situations:
1) When its value is mutable
2) When there are side-effects
Personally I think it's more common and more consistent to ignore the first of these. In practice toString will almost never have side-effects. So (unless it does), always use toString and ignore the Uniform Access Principle (following the Style Guide): keep parentheses to denote side-effects, rather than mutability.
Yes, you are missing something: Semantics.
If you have a method that simply gives back a value, you shouldn't use parens. The reason is that this blurs the line between vals and defs, satisfying the Uniform Access Principle. E.g. consider the size method for collections. For fixed-sized vectors or arrays this can be just a val, other collections may need to calculate it.
The usage of empty parens should be limited to methods which perform some kind of side effect, e.g. println(), or a method that increases an internal counter, or a method that resets a connection etc.
I would recommend always using toString. Regarding your third "pro" to toString():
Might imply some mutable state changing (consider an example when first toString call is expensive, so it is cached internally to yield quicker calls in future).
First of all, toString generally shouldn't be an expensive operation. But suppose it is expensive, and suppose you do choose to cache the result internally. Even in that case, I'd say use toString, as long as the result of toString is always the same for a given state of the object (disregarding the state of the toString cache).
The only reason I would not recommend using toString without parens is if you have a code profiler/analyzer that makes assumptions based on the presence or absence of parens. In that case, follow the conventions set forth by said profiler. Also, if your toString is that complicated, consider renaming it to something else, like expensiveToString. It is unofficially expected that toString be a straightforward, simple function in most cases.
Not much argumentation in this answer but GenTraversableOnce alone declares the following defs without parentheses:
toArray
toBuffer
toIndexedSeq
toIterable
toIterator
toList
toMap
toSeq
toSet
toStream
toTraversable