I'm trying to implement a scalaz-stream channel that accumulates statistics about the events it receives and, once complete, emits the final statistics.
To give a concrete, simplified example: imagine that you have a Process[Task, String] where each string is a word. I'd like to have a Channel[Task, String, (String, Int)] that, when applied to that initial process, would drain it, count the number of times each word occurs, and emit that.
I realise this is trivial through a fold:
input.foldMap(w => Map(w -> 1))
.flatMap(m => Process.emitAll(m.toSeq))
.maximumBy(_._2)
What I'm trying to write is a collection of standard accumulators that I can then just pipe my processes through - rather than explicitly fold, say, I'd write:
input.through(wordFrequency)
.maximumBy(_._2)
I'm at a bit of a loss though - I can't work out how to do so without sharing state. Writing a Sink that accumulate to a Map[String, Int] is fairly simple, but there's no way to get the final state of the map and emit it once the sink has terminated.
Related
I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this
case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{
override def processBroadcastElement(in2: (String, Double),
context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context,
collector:Collector[MyObject]):Unit={
context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2)
}
override def processElement(obj: MyObject,
readOnlyContext:BroadCastProcessFunction[MyObject, (String,Double),
MyObject]#ReadOnlyContext, collector: Collector[MyObject]):Unit={
val theValue = readOnlyContext.getBroadccastState(broadcastStateDesriptor).get(obj.prop)
//If I print the context of the state here sometimes it is empty.
out.collect(MyObject(new, properties, go, here))
}
}
The state descriptor:
val broadcastStateDescriptor: MapStateDescriptor[String, Double) = new MapStateDescriptor[String, Double]("name_for_this", classOf[String], classOf[Double])
My execution code looks like this.
val streamA :DataStream[MyObject] = ...
val streamB :DataStream[(String,Double)] = ...
val broadcastedStream = streamB.broadcast(broadcastStateDescriptor)
streamA.connect(streamB).process(new Broadcast)
The problem is in the processElement function the state sometimes is empty and sometimes not. The state should always contain data since I am constantly streaming from a file that I know it has data. I do not understand why it is flushing the state and I cannot get the data.
I tried adding some printing in the processBroadcastElement before and after putting the data to the state and the result is the following
0 - 1
1 - 2
2 - 3
.. all the way to 48 where it resets back to 0
UPDATE:
something that I noticed is when I decrease the value of the timeout of the streaming execution context, the results are a bit better. when I increase it then the map is always empty.
env.setBufferTimeout(1) //better results
env.setBufferTimeout(200) //worse result (default is 100)
Whenever two streams are connected in Flink, you have no control over the timing with which Flink will deliver events from the two streams to your user function. So, for example, if there is an event available to process from streamA, and an event available to process from streamB, either one might be processed next. You cannot expect the broadcastedStream to somehow take precedence over the other stream.
There are various strategies you might employ to cope with this race between the two streams, depending on your requirements. For example, you could use a KeyedBroadcastProcessFunction and use its applyToKeyedState method to iterate over all existing keyed state whenever a new broadcast event arrives.
As David mentioned the job could be restarting. I disabled the checkpoints so I could see any possible exception thrown instead of flink silently failing and restarting the job.
It turned out that there was an error while trying to parse the file. So the job kept restarting thus the state was empty and flink kept consuming the stream over and over again.
While trying to become familiar with FS2, I came across a nifty recursive implementation using the Scala collections' Stream, and thought I'd have a go at trying it in FS2:
import fs2.{Pure, Stream}
val fibs: Stream[Pure, Int] = Stream[Pure, Int](0) ++ fibs.fold[Int](1)(_ + _)
println(fibs take 10 toList) // This will hang
What is the reason this hangs in FS2, and what is the best way to get a similar, working solution?
Your issue is that Stream.fold consumes all elements of the stream, producing a single final value from the fold. Note that it only emits one element.
The recursive stream only terminates when 10 elements have been emitted (this is specified by take 10). Since this stream is not productive enough, fold continues to add values without stopping.
The simplest way to fix this is to use a combinator that emits the partial results from the fold; this is scan.
Also, FS2 can infer most of the types in this code, so you don't necessarily need as many type annotations.
The following implementation should work fine:
import fs2.{Pure, Stream}
val fibs: Stream[Pure, Int] = Stream(0) ++ fibs.scan(1)(_ + _)
println(fibs take 10 toList)
I have an Iterator[InputStream] which i map over to retrieve the individual results:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] = streams flatMap (c => transformToResult(c))
This works as expected, meaning I can retrieve values of type MyResultType from the results iterator. The only problem I have is that the individual InputStreams are never being closed. Is there any way to do this?
There is no magic way to close it, or at least to guarantee that it will get closed. Thus you have to close each stream yourself. Take a look at the Loan Pattern which makes it less error prone: Loaner Pattern in Scala.
In your case you don't have a single resource to release but rather a collection of resources, so adjust your custom loan pattern accordingly.
Since you are dealing with Iterator you might have unlimited supply of InputStreams, in that case your transformToResult function would have to close the stream or something else at the element level.
It could look something like this:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] =
streams flatMap (c => yourLoaner(c)(transformToResult))
I am looking through Rng sources to see how they generate a stream of random values.
def stream[AA >: A](s: Size): Rng[EphemeralStream[AA]] =
list(s) map (EphemeralStream(_: _*))
First of all I am confused with this signature. I need a generator of an infinite stream that yields random values lazily. So I do not need the size argument.
I am confused with the implementation too. The generator seems to create a list and then convert it to the stream. However the whole point of the stream is to be lazy. What am I missing ?
I don't know about Stream, but we can do something like this explicitly with scalaz-stream Process:
def infiniteRandomProcess: Process[Rng, Int] =
Process.await(Rng.int)({
i => Process.emit(i) ++ infiniteRandomProcess
})
Then you transform the lazy Process until you get something you're happy to run (or runLog or so on), and you end up with an Rng value (or perhaps a monad transformer that combines this with e.g. Task).
The API might allow converting this to a Stream; there's a lot of it and I tend to stay around the shallow end.
In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.