Flink Scala - Extending WindowFunction - scala

I am trying to figure out how to write my own WindowFunction but having issues, and I can not figure out why. The issue I am having is with the apply function, as it does not recognize MyWindowFunction as a valid input, so I can not compile. The data I am streaming contains (timestamp,x,y) where x and y are 0 and 1 for testing. extractTupleWithoutTs simply returns a tuple (x,y). I have been running the code with simple sum and reduce functions with success. Grateful for any help :) Using Flink 1.3
Imports:
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
Rest of the code:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val text = env.socketTextStream("localhost", 9999).assignTimestampsAndWatermarks(new TsExtractor)
val tuple = text.map( str => extractTupleWithoutTs(str))
val counts = tuple.keyBy(0).timeWindow(Time.seconds(5)).apply(new MyWindowFunction())
counts.print()
env.execute("Window Stream")
MyWindow function which is basically copy paste from example with changes of the types.
class MyWindowFunction extends WindowFunction[(Int, Int), Int, Int, TimeWindow] {
def apply(key: Int, window: TimeWindow, input: Iterable[(Int, Int)], out: Collector[Int]): () = {
var count = 0
for (in <- input) {
count = count + 1
}
out.collect(count)
}
}

The problem is the third type parameter of the WindowFunction, i.e., the type of the key. The key is declared with an index in the keyBy method (keyBy(0)). Therefore, the type of the key cannot be determined at compile time. The same problem arises, if you declare the key as a string, i.e., keyBy("f0").
There are two options to resolve this:
Use a KeySelector function in keyBy to extract the key (something like keyBy(_._1)). The return type of the KeySelector function is known at compile time such that you can use a correctly typed WindowFunction with an Int key.
Change the type of the third type parameter of the WindowFunction to org.apache.flink.api.java.tuple.Tuple, i.e., WindowFunction[(Int, Int), Int, org.apache.flink.api.java.tuple.Tuple, TimeWindow]. Tuple is a generic holder for the keys extracted by keyBy. In your case it will be a org.apache.flink.api.java.tuple.Tuple1. In WindowFunction.apply() you can cast Tuple to Tuple1 and access the key field by Tuple1.f0.

Related

Apache Spark - Is there a problem with passing parameters to custom Aggregator constructor?

I created an Aggregator to be used an udaf, which uses three columns in a dataframe to calculate its results, but it also needs other two parameters, common to every row. Initially, I defined the input type like this (simplifying unnecessary details)
case class In(a: Long, b: Double, c: Double, d: Long, e: Double)
class MyUDAF extends Aggregator[In, Buf, Long] {
...
}
and passed those extra parameters using lit from org.apache.spark.sql.functions:
val myudaf = udaf(new MyUDAF, ExpressionEncoder[In])
val df: DataFrame = _ // suppose there's an actual DataFrame here
df.withColumn("result", myudaf(col("a"), col("b"), col("c"), lit(100L), lit(10.0)))
It worked perfectly, but I didn't like this approach of passing those two parameters as columns, since I had to keep them inside the reduction buffers (the merge method takes only buffers as parameters). So I decided to include those in MyUDAF constructor, and use it like this:
case class In(a: Long, b: Double, c: Double)
class MyUDAF(d: Long, e: Double) extends Aggregator[In, Buf, Long] {
...
}
val myudaf = udaf(new MyUDAF(100L, 10.0), ExpressionEncoder[In])
val df: DataFrame = _
df.withColumn("result", myudaf(col("a"), col("b"), col("c")))
This also worked perfectly on local tests. But I'm new to Spark, so I don't know if this practice brings possible errors. Unfortunately, I currently don't have access to more machines to create a cluster and check for myself if something goes wrong in a more complex scenario. So the question is: could the act of using data different from those contained in input Rows and buffers (like values from constructor) cause any problems, errors or side effects? Is my second approach ok?

Shapeless lenses usage with a string definition

I would like use shapeless lenses to access value of the case class field by a String definition.
I know this code works.
case class Test(id: String, calc: Long)
val instance = Test("123232", 3434L)
val lens = lens[Test] >> 'id
val valueOfFieldId = lens.get(instance)
But what I am trying to do is:
val fieldName = "id"
val lens = lens[Test] >> fieldName.witness
//I typed .witness because it was expecting a witness (if I am not wrong)
val valueOfFieldId = lens.get(instance)
But with this code, I am getting this error.
Could not find implicit value for parameter mkLens: shapeless.MkFieldLens[A$A148.this.Test,A$A148.this.str.type]
def get$$instance$$lll = lll;/* ###worksheet### generated $$end$$ */ lazy val lens = lens[Test] >> str.witness
Is it possible to get the value of case class field with a String definition?
Thanks.
You are supposed to use Symbol ('id) here rather than String ("id").
Creating Symbol from String
Symbol(fieldName)
is runtime operation and Shapeless operates in compile time.
Why can't you use symbols?

Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type.
def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {
val rMirror = runtimeMirror(getClass.getClassLoader)
val myClass = typeOf[T].typeSymbol.asClass
val cMirror = rMirror.reflectClass(myClass)
// The primary constructor is the first one
val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))
cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
}
I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.
def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
createCaseClass[T](inMap)
}
I'm looking for something like this-
case class MyType(c1: String, c2: Long)
val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}
Thoughts and suggestions to resolve this is appreciated.
However, I'm not sure what's the best way to pass the case class
It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.
If you want to operate on statically typed objects please use statically typed Dataset.
From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types
You can see with:
df.schema.toString
Basic types like Int,Double, are stored like:
StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)
Complex types like case class are transformed to a combination of nested types:
StructType(StructField(..),StructField(..),StructType(..))
Sample code:
case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema
//Output:
org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))
The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes
A basic debug technic is print to string:
val myUdf = udf( (r:Row) => r.schema.toString )
then, to see was happen:
df.take(1).foreach(println) //

Unable to run spark map function that reads a Tuple RDD and returns a Tuple RDD

I have a requirement to generate a paired RDD from another paired RDD. Basically, I am trying to write a map function that does the following.
RDD[Polygon,HashSet[Point]] => RDD[Polygon,Integer]
Here is the code I have written:
Scala Function that iterates over HashSet and adds up a value from the "Point" Object.
def outCountPerCell( jr: Tuple2[Polygon,HashSet[Point]] ) : Tuple2[Polygon,Integer] = {
val setIter = jr._2.iterator()
var outageCnt: Int = 0
while(setIter.hasNext()) {
outageCnt += setIter.next().getCoordinate().getOrdinate(2).toInt
}
return Tuple2(jr._1,Integer.valueOf(outageCnt))
}
Applying the function on a paired RDD, which is throwing an error:
scala> val mappedJoinResult = joinResult.map((t: Tuple2[Polygon,HashSet[Point]]) => outCountPerCell(t))
<console>:82: error: type mismatch;
found   : ((com.vividsolutions.jts.geom.Polygon, java.util.HashSet[com.vividsolutions.jts.geom.Point])) => (com.vividsolutions.jts.geom.Polygon, Integer)
required: org.apache.spark.api.java.function.Function[(com.vividsolutions.jts.geom.Polygon, java.util.HashSet[com.vividsolutions.jts.geom.Point]),?]
       val mappedJoinResult = joinResult.map((t: Tuple2[Polygon,HashSet[Point]]) => outCountPerCell(t))
Can someone take a look and see what I am missing, or share any example code that uses custom function inside map() operation.
The problem here is that the joinResult is a JavaPairRDD from the Java API. This data structure's map is expecting Java type lambdas (Function) which are not (at least trivially) interchangeable with Scala lambdas.
So there are two solutions: try to convert the given method into a Java Function to be passed to map or simply use the Scala RDD as the developers intended:
Setup Dummy Data
Here I create some standin classes and make a Java RDD with a similar structure to OP's:
scala> case class Polygon(name: String)
defined class Polygon
scala> case class Point(ordinate: Int)
defined class Point
scala> :pa
// Entering paste mode (ctrl-D to finish)
/* More idiomatic method */
def outCountPerCell( jr: (Polygon,java.util.HashSet[Point])) : (Polygon, Integer) =
{
val count = jr._2.asScala.map(_.ordinate).sum
(jr._1, count)
}
// Exiting paste mode, now interpreting.
outCountPerCell: (jr: (Polygon, java.util.HashSet[Point]))(Polygon, Integer)
scala> val hs = new java.util.HashSet[Point]()
hs: java.util.HashSet[Point] = []
scala> hs.add(Point(2))
res13: Boolean = true
scala> hs.add(Point(3))
res14: Boolean = true
scala> val javaRDD = new JavaPairRDD(sc.parallelize(Seq((Polygon("a"), hs))))
javaRDD: org.apache.spark.api.java.JavaPairRDD[Polygon,java.util.HashSet[Point]] = org.apache.spark.api.java.JavaPairRDD#14fc37a
Use Scala RDD
The underlying Scala RDD can be retrieved from the Java RDD by using .rdd:
scala> javaRDD.rdd.map(outCountPerCell).foreach(println)
(Polygon(a),5)
Even better, use mapValues with Scala RDD
Since only the second part of the tuples are changing this problem can be cleanly solved with .mapValues:
scala> javaRDD.rdd.mapValues(_.asScala.map(_.ordinate).sum).foreach(println)
(Polygon(a),5)

Scala: a template for function to accept only a certain arity and a certain output?

I have a class, where all of its functions have the same arity and same type of output. (Why? Each function is a separate processor that is applied to a Spark DataFrame and yields another DataFrame).
So, the class looks like this:
class Processors {
def p1(df: DataFrame): DataFrame {...}
def p2(df: DataFrame): DataFrame {...}
def p3(df: DataFrame): DataFrame {...}
...
}
I then apply all the methods of a given DataFrame by mapping over Processors.getClass.getMethod, which allows me to add more processors without changing anything else in the code.
What I'd like to do is define a template to the methods under Processors which will restrict all of them to accept only one DataFrame and return a DataFrame. Is there a way to do this?
Implementing a restriction on what kind of functions can be added to a "list" is possible by using an appropriate container class instead of a generic class to hold the methods that are restricted. The container of restricted methods can then be part of some new class or object or part of the main program.
What you lose below by using containers (e.g. a Map with string keys and restricted values) to hold specific kinds of functions is compile-time checking of the names of the methods. e.g. calling triple vs trilpe
The restriction of a function to take a type T and return that same type T can be defined as a type F[T] using Function1 from the scala standard library. Function1[A,B] allows any single-parameter function with input type A and output type B, but we want these input/output types to be the same, so:
type F[T] = Function1[T,T]
For a container, I will demonstrate scala.collection.mutable.ListMap[String,F[T]] assuming the following requirements:
string names reference the functions (doThis, doThat, instead of 1, 2, 3...)
functions can be added to the list later (mutable)
though you could choose some other mutable or immutable collection class (e.g. Vector[F[T]] if you only want to number the methods) and still benefit from the restriction of what kind of functions future developers can include into the container.
An abstract type can be defined as:
type TaskMap[T] = ListMap[String, F[T]]
For your specific application you would then instantiate this as:
val Processors:TaskMap[Dataframe] = ListMap(
"p1" -> ((df: DataFrame) => {...code for p1 goes here...}),
"p2" -> ((df: DataFrame) => {...code for p2 goes here...}),
"p3" -> ((df: DataFrame) => {...code for p3 goes here...})
)
and then to call one of these functions you use
Processors("p2")(someDF)
For simplicity of demonstration, let's forget about Dataframes for a moment and consider whether this scheme works with integers.
Consider the short program below. The collection "myTasks" can only contain functions from Int to Int. All of the lines below have been tested in the scala interpreter, v2.11.6, so you can follow along line by line.
import scala.collection.mutable.ListMap
type F[T] = Function1[T,T]
type TaskMap[T] = ListMap[String, F[T]]
val myTasks: TaskMap[Int] = ListMap(
"negate" -> ((x:Int)=>(-x)),
"triple" -> ((x:Int)=>(3*x))
)
we can add a new function to the container that adds 7 and name it "add7"
myTasks += ( "add7" -> ((x:Int)=>(x+7)) )
and the scala interpreter responds with:
res0: myTasks.type = Map(add7 -> <function1>, negate -> <function1>, triple -> <function1>)
but we can't add a function named "half" because it would return a Float, and a Float is not an Int and should trigger a type error
myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
Here we get this error message:
scala> myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
<console>:12: error: type mismatch;
found : Double
required: Int
myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
^
In a compiled application, this would be found at compile time.
How to call the functions stored this way is a bit more verbose for single calls, but can be very convenient.
Suppose we want to call "triple" on 10.
We can't write
triple(10)
<console>:9: error: not found: value triple
Instead it is
myTasks("triple")(10)
res4: Int = 30
Where this notation becomes more useful is if you have a list of tasks to perform but only want to allow tasks listed in myTasks.
Suppose we want to run all the tasks on the input data "10"
myTasks mapValues { _ apply 10 }
res9: scala.collection.Map[String,Int] =
Map(add7 -> 17, negate -> -10, triple -> 30)
Suppose we want to triple, then add7, then negate
If each result is desired separately, as above, that becomes:
List("triple","add7","negate") map myTasks.apply map { _ apply 10 }
res11: List[Int] = List(30, 17, -10)
But "triple, then add 7, then negate" could also be describing a series of steps to do 10, i.e. we want -((3*10)+7)" and scala can do that too
val myProgram = List("triple","add7","negate")
myProgram map myTasks.apply reduceLeft { _ andThen _ } apply 10
res12: Int = -37
opening the door to writing an interpreter for your own customizable set of tasks because we can also write
val magic = myProgram map myTasks.apply reduceLeft { _ andThen _ }
and magic is then a function from int to int that can take aribtrary ints or otherwise do work as a function should.
scala> magic(1)
res14: Int = -10
scala> magic(2)
res15: Int = -13
scala> magic(3)
res16: Int = -16
scala> List(10,20,30) map magic
res17: List[Int] = List(-37, -67, -97)
Is this what you mean?
class Processors {
type Template = DataFrame => DataFrame
val p1: Template = ...
val p2: Template = ...
val p3: Template = ...
def applyAll(df: DataFrame): DataFrame =
p1(p2(p3(df)))
}