Collector in Flink. What does it do? - scala

I'm learning Flink and one of the things is confusing for me is the fact of using an object called Collector. For example in the flatmap function. What's this Collector and its method collect? and why for example a map function doesn't need to pass results by explicitly using it ?
here there can be seen some examples of using Collector in the flatmap function:
https://www.programcreek.com/scala/org.apache.flink.util.Collector
also, if I search for where the Collector would be placed in Flink Architecture I don't find any diagram with that mapping

Flink passes a Collector to any user function that has the possibility of emitting an arbitrary number of stream elements. A map function doesn’t use a Collector because it performs a one-to-one transformation, with the return value of the map function being the output. Whereas a flatmap can emit zero, one, or many stream elements for each event, which makes the Collector a convenient way to accommodate this.

as you know, if you want one piece to produce N outputs in the data stream, you can use Collector to encapsulate the output data in flatmap,on the contrary,Map usually produces one-to-one data, so doesn't need to use it.In a sense,Collector has a wide range of internal applications. You can take a look at org.apache.flink.streaming.api.operators.Output(extend from Collector) \org.apache.flink.runtime.operators.shipping.OutputCollector ,they are usually used to collect records and emits them to writer.and so on,collect be called when needs to write data.
Examples (not necessarily accurate):
There are three definitions of Scala source code for flatMap. Let's take a look at the definition of the first one.
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: (T, Collector[R]) => Unit): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in, out) }
}
flatMap(flatMapper)
}
Examples of using this method are as follows:
text.flatMap((input: String, out: Collector[String]) => {
input.split(" ").foreach(out.collect)
})
In this method, we need to send the data manually through Collector
Then let's take a look at the second definition in the source code:
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: T => TraversableOnce[R]): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in) foreach out.collect }
}
flatMap(flatMapper)
}
Instead of using Collector to collect the output, here we output a list directly, and Flink helps us flatten the list. Using TraversableOnce also causes us to return a list anyway, even if it is an empty list, otherwise we cannot match the definition of the function.
text.flatMap(input => {
if (input.size > 15) {
input.split(" ")
} else {
Seq.empty
}
})
You can find a lot of similar places, as long as it is about sending data records, you can almost see Collector.

Related

Scala - how to read method signatures

I am new to Scala and trying to understand a following method:
def method1 = {
val key = "k1"
val value = "v1"
basicSetup() { (a, b, c) =>
val json = s"""{"field1":"$value"}"""
someMethodTest.send(a, b, json, c)
} { (record, avroObject, schema) =>
if (avroObject.get("field1").toString != value) {
failure("failed")
} else {
success
}
}
}
So far I worked on simple methods and understand when is a simple call and return but here it looks like is bundled stuff in it.
Need a help to understand how to read it from basicSetup line (just a general flow, signature and return).
e.g. Why is here 2 blocks of code: basicSetup() { ... } { ...} (how is it being executed?)
private def basicSetup()
(run: (Producer, String, Schema) => Unit)
(verify: (ProducerRecord[String, Array[Byte]], GenericRecord, Schema) => Result) = {
...
...
}
Thanks
It would be helpful to look at the definition of basicSetup, but it looks like a method with three parameter groups, the last two of which are themselves functions (making basicSetup a higher-order function).
The first parameter group is empty ().
The second and third are two "closures" or blocks of code or anonymous functions.
You could rewrite this as
// give names to these blocks
def doSomethingWithABC(a:A, b:B, c:C) = ???
def doSomethingWithAvro(record: R, avro: O, schema: S) = ???
basicSetup()(doSomethingWithABC)(doSomethingWithAvro)
Why is here 2 blocks of code ?
This is syntactic sugar to make function calls (especially higher-order function calls) look more like "built-in" constructs. So you can roll your own control flow methods. Keyword here is DSL.
These two blocks are parameters to basicSetup. They can appear as just bare blocks (without any parameter parentheses) to make it more concise (and natural, once you get used to it).
Update (now that we have the signature):
private def basicSetup()
(run: (Producer, String, Schema) => Unit)
(verify: (ProducerRecord[String, Array[Byte]], GenericRecord, Schema) => Result) = {
Indeed. The function takes three parameter groups.
The first one is actually empty, so you just call it with (). But it could have some parameters, even optionals, maybe to add configuration.
The second one is your "callback" to "run" (after this basic setup has been completed). It itself is a function that will be called with three parameters, a Producer, a String and a Schema.
The third one is your code to "verify" the results of all that. It looks at three parameters and returns a Result (presumably indicating that all is good or what went wrong).

Consume a paginated Iterable[T] as if it was a continuous Iterable[T]

I'm using a service that returns me paginated resources. It exposes one single call, which is defined by the following interface:
trait Service {
getPage(pageSize: Int, pageCursor: String): AsyncPage[Resource]
}
The getPage function returns an AsyncPage[T] object, which is implemented like this:
/**
* A page of contents that are retrieved asynchronously from their origin
*
* #param content The resource object
* #param nextPageCursor The token representing the next page, or empty if no more pages to consume
* #tparam T The type of resource withing the page
*/
case class AsyncPage[T](
val content: Future[Iterable[T]],
val nextPageCursor : Future[String]
) { }
The contents of the page are retrieved asynchronously from whichever storage system the service uses.
Because of the needs of my application, I don't really care about pages. I'd like to code something that allows me to consume the resources of the service as if it was a single Iterable[T].
However, I want to maintain the lazyness of the service. I don't want to request more pages than necessary. That means that I don't want to request the next page until I haven't consumed all the elements of the previous one.
Whenever I have consumed the whole Iterable[T] of one page, I want the code to request the following page using the getPage(...) function, and providing the pageCursor parameter from the last page nextPageCursor.
Can you guide me on how to achieve that?
Well, if you don't mind blocking, you can do something like this:
class FutureIter[+P](fu: => Future[Iterator[P]]) extends AbstractIterator[P] {
lazy val iter = Await.result(fu)
def hasNext = iter.hasNext
def next = iter.next
}
def fold[T](fs: Stream[Future[Iterator[T]]]): Iterator[T]= fs match {
case hd #:: tail => new FutureIter(hd) ++ fold(tail)
case _ => Iterator.empty
}
val pages = Stream
.iterate(getPage(size, "")) { getPage(size, _.nextPageCursor) }
.map(_.contents.map(_.iterator))
val result: Iterator[T] = fold(pages)
This will block before the first page, and at the end of each subsequent page to load the next batch. I don't think there is a way to do this without blocking, because you can't tell where the page ends until the future is satisfied.
Also, note that the iterator this code produces is infinite, because you didn't mention any criteria when to stop looking for more pages. You can tuck some .takeWhile call onto the pages to correct that.
You may also want to replace Stream with Iterator so that pages you are done with get discarded immediately, rather than getting cached. I just used Stream because that makes fold a little bit nicer (you can't match on iterators, would have to use and ugly if(it.hasNext) ... instead).
BTW, fold looks like it is recursive, but it actually is not: ++ is lazy, so the fold(tail) piece will not be executed until you have iterated all the way through the left-hand-side - well after the outer fold is off the stack.
Since you mentioned akka, you could create a Source[T] which can sort of be though of as an async Iterable[T]:
Source.unfoldAsync[String, T](startPageCursor) { cursor =>
val page = getPage(pageSize, cursor)
for {
nextCursor <- page.nextPageCursor
it <- page.content
} yield Some((nextCursor, it))
}.mapConcat(identity)
This is much cleaner and completely non-blocking. But it is up to your use case if this is suitable.

Inspection error in scala method / play framework / rest

I'm still learning scala so this might be a question with an easy answer, but I've been stuck on writing a single method over and over for almost a day, unable to get this code to compile.
I'm playing with the Play Framework and a reactive mongo template to learn how Scala and Play work.
I have a controller with a few methods, endpoints for a REST service.
The issue is about the following method, which accepts a list of json objects and updates those objects using the mongo reactive driver. The class has one member, citiesFuture which is of type Future[JSONCollection].
The original class code which I'm adding this method to can be found here for context: CityController on github
def updateAll() = Action.async(parse.json) { request =>
Json.fromJson[List[City]](request.body) match {
case JsSuccess(givenCities, _) =>
citiesFuture onComplete[Future[Result]] { cities =>
val updateFutures: List[Future[UpdateWriteResult]] = for {
city <- givenCities
} yield cities.get.update(City.getUniqueQuery(city), Json.obj("$set" -> city))
val promise: Promise[Result] = Promise[Result] {
Future.sequence(updateFutures) onComplete[Result] {
case s#Success(_) =>
var count = 0
for {
updateWriteResult <- s.value
} yield count += updateWriteResult.n
promise success Ok(s"Updated $count cities")
case Failure(_) =>
promise success InternalServerError("Error updating cities")
}
}
promise.future
}
case JsError(errors) =>
Future.successful(BadRequest("Could not build a city from the json provided. " + Errors.show(errors)))
}
}
I've managed to get this far with alot of trial and error, but I'm starting to understand how some of the mechanics of scala and Futures work, I think :) I think I'm close, but my IDE still gives me a single Inspection error just at the single closing curly brace above the line promise.future.
The error reads: Expression of type Unit doesn't conform to expected type Nothing.
I've checked the expected return values for the Promise and onComplete code blocks, but I don't believe they expect Nothing as a return type.
Could somebody please explain to me what I'm missing, and also, I'm sure this can be done better, so let me know if you have any tips I can learn from!
You're kinda on the right track but as #cchantep said, once you're operating in Future-land, it would be very unusual to need to create your own with Promise.future.
In addition, it's actually quite unusual to see onComplete being used - idiomatic Scala generally favors the "higher-level" abstraction of mapping over Futures. I'll attempt to demonstrate how I'd write your function in a Play controller:
Firstly, the "endpoint" just takes care of one thing - interfacing with the outside world - i.e. the JSON-parsing part. If everything converts OK, it calls a private method (performUpdateAll) that actually does the work:
def updateAll() = Action.async(parse.json) { request =>
Json.fromJson[List[City]](request.body) match {
case JsSuccess(givenCities, _) =>
performUpdateAll(givenCities)
case JsError(errors) =>
Future.successful(BadRequest("Could not build a city from the json provided. "))
}
}
Next, we have the private function that performs the update of multiple cities. Again, trying to abide by the Single Responsibility Principle (in a functional sense - one function should do one thing), I've extracted out updateCity which knows how to update exactly one city and returns a Future[UpdateWriteResult]. A nice side-effect of this is code-reuse; you may find you'll be able to use such a function elsewhere.
private def performUpdateAll(givenCities:List[City]):Future[Result] = {
val updateFutures = givenCities.map { city =>
updateCity(city)
}
Future.sequence(updateFutures).map { listOfResults =>
if (listOfResults.forall(_.ok)) {
val count = listOfResults.map(_.n).sum
Ok(s"Updated $count cities")
} else {
InternalServerError("Error updating cities")
}
}
}
As far as I can tell, this will work in exactly the same way as you intended yours to work. But by using Future.map instead of its lower-level counterpart Future.onComplete and matching on Success and Failure you get much more succinct code where (in my opinion) it's much easier to see the intent because there's less boilerplate around it.
We still check that every update worked, with this:
if (listOfResults.forall(_.ok))
which I would argue reads pretty well - all the results have to be OK!
The other little trick I did to tidy up was replace your "counting" logic which used a mutable variable, with a one-liner:
var count = 0
for {
updateWriteResult <- s.value
} yield count += updateWriteResult.n
Becomes:
val count = listOfResults.map(_.n).sum
i.e. convert the list of results to a list of integers (the n in the UpdateWriteResult) and then use the built-in sum function available on lists to do the rest.

Converting thunk to sequence upon iteration

I have a server API that returns a list of things, and does so in chunks of, let's say, 25 items at a time. With every response, we get a list of items, and a "token" that we can use for the following server call to return the next 25, and so on.
Please note that we're using a client library that has been written in stodgy old mutable Java, and doesn't lend itself nicely to all of Scala's functional compositional patterns.
I'm looking for a way to return a lazily evaluated sequence of all server items, by doing a server call with the latest token whenever the local list of items has been exhausted. What I have so far is:
def fetchFromServer(uglyStateObject: StateObject): Seq[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toList ++ (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
List())
}
However, this function does eager evaluation. What I'm looking for is to have ++ concatenate a "strict sequence" and a "lazy sequence", where a thunk will be used to retrieve the next set of items from the server. In effect, I want something like this:
results.asScala.toList ++ Seq.lazy(() => fetchFromServer(uglyStateObject))
Except I don't know what to use in place of Seq.lazy.
Things I've seen so far:
SeqView, but I've seen comments that it shouldn't be used because it re-evaluates all the time?
Streams, but they seem like the abstraction is supposed to generate elements at a time, whereas I want to generate a bunch of elements at a time.
What should I use?
I also suggest you to take a look at scalaz-strem. Here is small example how it may look like
import scalaz.stream._
import scalaz.concurrent.Task
// Returns updated state + fetched data
def fetchFromServer(uglyStateObject: StateObject): (StateObject, Seq[Thing]) = ???
// Initial state
val init: StateObject = new StateObject
val p: Process[Task, Thing] = Process.repeatEval[Task, Seq[Thing]] {
var state = init
Task(fetchFromServer(state)) map {
case (s, seq) =>
state = s
seq
}
} flatMap Process.emitAll
As a matter of fact, in the meantime I already found a slightly different answer that I find more readable (indeed using Streams):
def fetchFromServer(uglyStateObject: StateObject): Stream[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toStream #::: (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
Stream.empty)
}
Thanks everyone for

Is there a FIFO stream in Scala?

I'm looking for a FIFO stream in Scala, i.e., something that provides the functionality of
immutable.Stream (a stream that can be finite and memorizes the elements that have already been read)
mutable.Queue (which allows for added elements to the FIFO)
The stream should be closable and should block access to the next element until the element has been added or the stream has been closed.
Actually I'm a bit surprised that the collection library does not (seem to) include such a data structure, since it is IMO a quite classical one.
My questions:
1) Did I overlook something? Is there already a class providing this functionality?
2) OK, if it's not included in the collection library then it might by just a trivial combination of existing collection classes. However, I tried to find this trivial code but my implementation looks still quite complex for such a simple problem. Is there a simpler solution for such a FifoStream?
class FifoStream[T] extends Closeable {
val queue = new Queue[Option[T]]
lazy val stream = nextStreamElem
private def nextStreamElem: Stream[T] = next() match {
case Some(elem) => Stream.cons(elem, nextStreamElem)
case None => Stream.empty
}
/** Returns next element in the queue (may wait for it to be inserted). */
private def next() = {
queue.synchronized {
if (queue.isEmpty) queue.wait()
queue.dequeue()
}
}
/** Adds new elements to this stream. */
def enqueue(elems: T*) {
queue.synchronized {
queue.enqueue(elems.map{Some(_)}: _*)
queue.notify()
}
}
/** Closes this stream. */
def close() {
queue.synchronized {
queue.enqueue(None)
queue.notify()
}
}
}
Paradigmatic's solution (sightly modified)
Thanks for your suggestions. I slightly modified paradigmatic's solution so that toStream returns an immutable stream (allows for repeatable reads) so that it fits my needs. Just for completeness, here is the code:
import collection.JavaConversions._
import java.util.concurrent.{LinkedBlockingQueue, BlockingQueue}
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] = new LinkedBlockingQueue[Option[A]]() ) {
lazy val toStream: Stream[A] = queue2stream
private def queue2stream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, queue2stream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
In Scala, streams are "functional iterators". People expect them to be pure (no side effects) and immutable. In you case, everytime you iterate on the stream you modify the queue (so it's no pure). This can create a lot of misunderstandings, because iterating twice the same stream, will have two different results.
That being said, you should rather use Java BlockingQueues, rather than rolling your own implementation. They are considered well implemented in term of safety and performances. Here is the cleanest code I can think of (using your approach):
import java.util.concurrent.BlockingQueue
import scala.collection.JavaConversions._
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] ) {
def toStream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, toStream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
object FIFOStream {
def apply[A]() = new LinkedBlockingQueue
}
I'm assuming you're looking for something like java.util.concurrent.BlockingQueue?
Akka has a BoundedBlockingQueue implementation of this interface. There are of course the implementations available in java.util.concurrent.
You might also consider using Akka's actors for whatever it is you are doing. Use Actors to be notified or pushed a new event or message instead of pulling.
1) It seems you're looking for a dataflow stream seen in languages like Oz, which supports the producer-consumer pattern. Such a collection is not available in the collections API, but you could always create one yourself.
2) The data flow stream relies on the concept of single-assignment variables (such that they don't have to be initialized upon declaration point and reading them prior to initialization causes blocking):
val x: Int
startThread {
println(x)
}
println("The other thread waits for the x to be assigned")
x = 1
It would be straightforward to implement such a stream if single-assignment (or dataflow) variables were supported in the language (see the link). Since they are not a part of Scala, you have to use the wait-synchronized-notify pattern just like you did.
Concurrent queues from Java can be used to achieve that as well, as the other user suggested.