Rich function on Flink join, Scala API - scala

I'm struggling with Flink and Scala.
I have a join transformation over a DataSet that pretty much works, but I want to turn it into a RichFuntion, so that I can access a broadcasted set:
val newBoard: DataSet[Cell] = board.rightOuterJoin(neighbours)
.where("coords").equalTo("cellCoords"){
(cell, neighbours) => {
// Do some rich function things, like
// override the open method so I can get
// the broadcasted set
}
}
}.withBroadcastSet(board, "aliveCells")
I have been looking all over the documentation, but I can't find any example of a RichJoinFuntion being used in Scala. I only find examples for rich functions used in map or filter, but the syntax is different for the join transformation (function between parenthesis vs. between brackets).

You can use a RichJoinFunction with the Scala DataSet API as follows
val newBoard: DataSet[Cell] = board.rightOuterJoin(neighbours)
.where("coords").equalTo("cellCoords")
.apply(new YourJoinFunction())
.withBroadcastSet(board, "aliveCells")
class YourJoinFunction extends RichJoinFunction[IN1, IN2, Cell] {
override def join(first: IN1, second: IN2): Cell = {
// Do some rich function things, like
// override the open method so I can get
// the broadcasted set
}
}

Related

Scala How to create an Array of defs/Objects and the call those defs in a foreach?

I have a bunch of Scala objects with def's that do a bunch of processing
Foo\CatProcessing (def processing)
Foo\DogProcessing (def processing)
Foo\BirdProcessing (def processing)
Then I have a my main def that will call all of the individual Foo\obj defProcessing. Passing in common parameter values and such
I am trying to put all the list of objects into an Array or List, and then do a 'Foreach' to loop through the list passing in the parameter values or such. ie
foreach(object in objList){
object.Processing(parametmers)
}
Coming from C#, I could do this via binders or the like, so who would I manage this is in Scala?
for (obj <- objList) {
obj.processing(parameters) // `object` is a reserved keyword in Scala
}
or
objList.foreach(obj => obj.processing(parameters))
They are actually the same thing, the former being "syntactic sugar" for the latter.
In the second case, you can bind the only parameter of the anonymous function passed to the foreach function to _, resulting in the following
objList.foreach(_.processing(parameters))
for comprehensions in Scala can be quite expressive and go beyond simple iteration, if you're curious you can read more about it here.
Since you are coming from C#, if by any chance you have had any exposure to LINQ you will find yourself at home with the Scala Collection API. The official documentation is quite extensive in this regard and you can read more about it here.
As it came up in the comments following my reply, you also need the objects you want to iterate to:
have a common type that
exposes the processing method
Alternatively, Scala allows to use structural typing but that relies on runtime reflection and it's unlikely something you really need or want in this case.
You can achieve it by having a common trait for your objects, as in the following example:
trait Processing {
def processing(): Unit
}
final class CatProcessing extends Processing {
def processing(): Unit = println("cat")
}
final class DogProcessing extends Processing {
def processing(): Unit = println("dog")
}
final class BirdProcessing extends Processing {
def processing(): Unit = println("bird")
}
val cat = new CatProcessing
val dog = new DogProcessing
val bird = new BirdProcessing
for (process <- List(cat, dog, bird)) {
process.processing()
}
You can run the code above and play around with it here on Scastie.
Using a Map instead, you can do it as such. (wonder if this works through other types of lists)
val test = Map("foobar" -> CatProcessing)
test.values.foreach(
(movie) => movie.processing(spark)
)

HList(DValue[A], DValue[B]) to HList(A, B) at library level?

I'm building a data binding library, which has 3 fundamental classes
trait DValue[+T] {
def get:T
}
class DField[T] extends DValue[T] {
// allow writes + notifying observers
}
class DFunction[T](deps:DValue[_]*)(compute :=> T) extends DValue[T] {
def get = compute // internally compute will use values in deps
}
However, in this approach, the definition of DFunction is not quite robust - it requires the user of DFunction to make sure all DValues used in compute are put into the 'deps' list. So I want the user to be able to do something like this:
val dvCount:DValue[Int] = DField(3)
val dvElement:DValue[String] = DField("Hello")
val myFunction = DFunction(dvCount, dvElement) { (count, element) => // compiler knows their type
Range(count).map(_ => element).toSeq
}
As you can see when I'm constructing 'myFunction', the referenced fields and the usage is clearly mapped.
I feel maybe HList would allow me to provide something at library level that'd allow this, but I cannot figure out how, would this be possible with HList? or there's something else that'd help achieve this?
shapeless.ops.hlist.Mapper allows you to do this with a Poly function.
Unfortunately the documentation on it isn't great; you might need to do some source diving to see how to use it

Collector in Flink. What does it do?

I'm learning Flink and one of the things is confusing for me is the fact of using an object called Collector. For example in the flatmap function. What's this Collector and its method collect? and why for example a map function doesn't need to pass results by explicitly using it ?
here there can be seen some examples of using Collector in the flatmap function:
https://www.programcreek.com/scala/org.apache.flink.util.Collector
also, if I search for where the Collector would be placed in Flink Architecture I don't find any diagram with that mapping
Flink passes a Collector to any user function that has the possibility of emitting an arbitrary number of stream elements. A map function doesn’t use a Collector because it performs a one-to-one transformation, with the return value of the map function being the output. Whereas a flatmap can emit zero, one, or many stream elements for each event, which makes the Collector a convenient way to accommodate this.
as you know, if you want one piece to produce N outputs in the data stream, you can use Collector to encapsulate the output data in flatmap,on the contrary,Map usually produces one-to-one data, so doesn't need to use it.In a sense,Collector has a wide range of internal applications. You can take a look at org.apache.flink.streaming.api.operators.Output(extend from Collector) \org.apache.flink.runtime.operators.shipping.OutputCollector ,they are usually used to collect records and emits them to writer.and so on,collect be called when needs to write data.
Examples (not necessarily accurate):
There are three definitions of Scala source code for flatMap. Let's take a look at the definition of the first one.
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: (T, Collector[R]) => Unit): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in, out) }
}
flatMap(flatMapper)
}
Examples of using this method are as follows:
text.flatMap((input: String, out: Collector[String]) => {
input.split(" ").foreach(out.collect)
})
In this method, we need to send the data manually through Collector
Then let's take a look at the second definition in the source code:
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: T => TraversableOnce[R]): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in) foreach out.collect }
}
flatMap(flatMapper)
}
Instead of using Collector to collect the output, here we output a list directly, and Flink helps us flatten the list. Using TraversableOnce also causes us to return a list anyway, even if it is an empty list, otherwise we cannot match the definition of the function.
text.flatMap(input => {
if (input.size > 15) {
input.split(" ")
} else {
Seq.empty
}
})
You can find a lot of similar places, as long as it is about sending data records, you can almost see Collector.

Algorithm mixing

I have a class that extends Iterator and model an complex algorithm (MyAlgorithm1). Thus, the algorithm can advance step by step through the Next method.
class MyAlgorithm1(val c:Set) extends Iterator[Step] {
override def next():Step {
/* ... */
}
/* ... */
}
Now I want apply a different algorithm (MyAlgorithm2) in each pass of the first algorithm. The iterations of algorithm 1 and 2 should be inserted
class MyAlgorithm2(val c:Set) { /* ... */ }
How I can do this in the best way? Perhaps with some Trait?
UPDATE:
MyAlgorithm2 recives a set and transform it. MyAlgorithm1 too, but this is more complex and it's necessary runs step by step. The idea is run one step of MyAlgoirthm1 and then run MyAlgorithm2. Next step same. Really, MyAlgorithm2 simplifies the set and may be useful to simplify work of MyAlgorithm1.
As described, the problem can be solved with either inheritance or trait. For instance:
class MyAlgorithm1(val c:Set) extends Iterator[Step] {
protected var current = Step(c)
override def next():Step = {
current = process(current)
current
}
override def hasNext: Boolean = !current.set.isEmpty
private def process(s: Step): Step = s
}
class MyAlgorithm2(c: Set) extends MyAlgorithm1(c) {
override def next(): Step = {
super.next()
current = process(current)
current
}
private def process(s: Step): Step = s
}
With traits you could do something with abstract override, but designing it so that the result of the simplification gets fed to the first algorithm may be harder.
But, let me propose you are getting at the problem in the wrong way.
Instead of creating a class for the algorithm extending an iterator, you could define your algorithm like this:
class MyAlgorithm1 extends Function1[Step, Step] {
def apply(s: Step): Step = s
}
class MyAlgorithm2 extends Function1[Step, Step] {
def apply(s: Step): Step = s
}
The iterator then could be much more easily defined:
Iterator.iterate(Step(set))(MyAlgorithm1 andThen MyAlgorithm2).takeWhile(_.set.nonEmpty)
Extending Iterator is probably more work than you actually need to do. Let's roll back a bit.
You've got some stateful object of type MyAlgorithm1
val alg1 = new MyAlgorithm1(args)
Now you wish to repeatedly call some function on it, which will alter it's state and return some value. This is best modeled not by having your object implement Iterator, but rather by creating a new object that handles the iteration. Probably the easiest one in the Scala standard library is Stream. Here's an object which creates a stream of result from your algorithm.
val alg1Stream:Stream[Step] = Stream.continually(alg1.next())
Now if you wanted to repeatedly get results from that stream, it would be as easy as
for(step<-alg1Stream){
// do something
}
or equivalently
alg1Stream.forEach{
//do something
}
Now assume we've also encapsulated myAlgorithm2 as a stream
val alg2=new MyAlgorithm2(args)
val alg2Stream:Stream[Step] = Stream.continually(alg2.next())
Then we just need some way to interleave streams, and then we could say
for(step<-interleave(alg1Stream, algStream2)){
// do something
}
Sadly, a quick glance through the standard library, reveals no Stream interleaving function. Easy enough to write one
def interleave[A](stream1:Stream[A], stream2:Stream[A]):Stream[A] ={
var str1 = stream1
var str2 = stream2
var streamToUse = 1
Stream.continually{
if(streamToUse == 1){
streamToUse = 2
val out = str1.head
str1 = str1.tail
out
}else{
streamToUse = 1
val out = str2.head
str2 = str1.tail
out
}
}
}
That constructs a stream that repeatedly alternates between two streams, fetching the next result from the appropriate one and then setting up it's state for the next fetch. Note that this interleave only works for infinite streams, and we'd need a more clever one to handle streams that can end, but that's fine for the sake of the problem.
I have a class that extends Iterator and model an complex algorithm (MyAlgorithm1).
Well, stop for a moment there. An algorithm is not an iterator, so it doesn't make sense for it to extend Iterator.
It seems as if you might want to use a fold or a map instead, depending on exactly what it is that you want to do. This is a common pattern in functional programming: You generate a list/sequence/stream of something and then run a function on each element. If you want to run two functions on each element, you can either compose the functions or run another map.

Can I transform this asynchronous java network API into a monadic representation (or something else idiomatic)?

I've been given a java api for connecting to and communicating over a proprietary bus using a callback based style. I'm currently implementing a proof-of-concept application in scala, and I'm trying to work out how I might produce a slightly more idiomatic scala interface.
A typical (simplified) application might look something like this in Java:
DataType type = new DataType();
BusConnector con = new BusConnector();
con.waitForData(type.getClass()).addListener(new IListener<DataType>() {
public void onEvent(DataType t) {
//some stuff happens in here, and then we need some more data
con.waitForData(anotherType.getClass()).addListener(new IListener<anotherType>() {
public void onEvent(anotherType t) {
//we do more stuff in here, and so on
}
});
}
});
//now we've got the behaviours set up we call
con.start();
In scala I can obviously define an implicit conversion from (T => Unit) into an IListener, which certainly makes things a bit simpler to read:
implicit def func2Ilistener[T](f: (T => Unit)) : IListener[T] = new IListener[T]{
def onEvent(t:T) = f
}
val con = new BusConnector
con.waitForData(DataType.getClass).addListener( (d:DataType) => {
//some stuff, then another wait for stuff
con.waitForData(OtherType.getClass).addListener( (o:OtherType) => {
//etc
})
})
Looking at this reminded me of both scalaz promises and f# async workflows.
My question is this:
Can I convert this into either a for comprehension or something similarly idiomatic (I feel like this should map to actors reasonably well too)
Ideally I'd like to see something like:
for(
d <- con.waitForData(DataType.getClass);
val _ = doSomethingWith(d);
o <- con.waitForData(OtherType.getClass)
//etc
)
If you want to use a for comprehension for this, I'd recommend looking at the Scala Language Specification for how for comprehensions are expanded to map, flatMap, etc. This will give you some clues about how this structure relates to what you've already got (with nested calls to addListener). You can then add an implicit conversion from the return type of the waitForData call to a new type with the appropriate map, flatMap, etc methods that delegate to addListener.
Update
I think you can use scala.Responder[T] from the standard library:
Assuming the class with the addListener is called Dispatcher[T]:
trait Dispatcher[T] {
def addListener(listener: IListener[T]): Unit
}
trait IListener[T] {
def onEvent(t: T): Unit
}
implicit def dispatcher2Responder[T](d: Dispatcher[T]):Responder[T] = new Responder[T} {
def respond(k: T => Unit) = d.addListener(new IListener[T] {
def onEvent(t:T) = k
})
}
You can then use this as requested
for(
d <- con.waitForData(DataType.getClass);
val _ = doSomethingWith(d);
o <- con.waitForData(OtherType.getClass)
//etc
) ()
See the Scala wiki and this presentation on using Responder[T] for a Comet chat application.
I have very little Scala experience, but if I were implementing something like this I'd look to leverage the actor mechanism rather than using callback listener classes. Actors were made for asynchronous communication, they nicely separate those different parts of your app for you. You can also have them send messages to multiple listeners.
We'll have to wait for a "real" Scala programmer to flesh this idea out, though. ;)