Find count in WindowedStream - Flink - scala

I am pretty new in the world of Streams and I am facing some issues in my first try.
More specifically, I am trying to implement a count and groupBy functionality in a sliding window using Flink.
I 've done it in a normal DateStream but I am not able to make it work in a WindowedStream.
Do you have any suggestion on how can I do it?
val parsedStream: DataStream[(String, Response)] = stream
.mapWith(_.decodeOption[Response])
.filter(_.isDefined)
.map { record =>
(
s"${record.get.group.group_country}, ${record.get.group.group_state}, ${record.get.group.group_city}",
record.get
)
}
val result: DataStream[((String, Response), Int)] = parsedStream
.map((_, 1))
.keyBy(_._1._1)
.sum(1)
// The output of result is
// ((us, GA, Atlanta,Response()), 14)
// ((us, SA, Atlanta,Response()), 4)
result
.keyBy(_._1._1)
.timeWindow(Time.seconds(5))
//the following part doesn't compile
.apply(
new WindowFunction[(String, Int), (String, Int), String, TimeWindow] {
def apply(
key: Tuple,
window: TimeWindow,
values: Iterable[(String, Response)],
out: Collector[(String, Int)]
) {}
}
)
Compilation Error:
overloaded method value apply with alternatives:
[R](function: (String, org.apache.flink.streaming.api.windowing.windows.TimeWindow, Iterable[((String, com.flink.Response), Int)], org.apache.flink.util.Collector[R]) => Unit)(implicit evidence$28: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R] <and>
[R](function: org.apache.flink.streaming.api.scala.function.WindowFunction[((String, com.flink.Response), Int),R,String,org.apache.flink.streaming.api.windowing.windows.TimeWindow])(implicit evidence$27: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R]
cannot be applied to (org.apache.flink.streaming.api.functions.windowing.WindowFunction[((String, com.flink.Response), Int),(String, com.flink.Response),String,org.apache.flink.streaming.api.windowing.windows.TimeWindow]{def apply(key: String,window: org.apache.flink.streaming.api.windowing.windows.TimeWindow,input: Iterable[((String, com.flink.Response), Int)],out: org.apache.flink.util.Collector[(String, com.flink.Response)]): Unit})
.apply(

This is a simpler example that we can work on
val source: DataStream[(JsonField, Int)] = env.fromElements(("hello", 1), ("hello", 2))
val window2 = source
.keyBy(0)
.timeWindow(Time.minutes(1))
.apply(new WindowFunction[(JsonField, Int), Int, String, TimeWindow] {})

I have tried Your code and found the errors, it seems that you have an error when declaring the types for your WindowFunction.
The documentation says that the expected types for WindowFunction are WindowFunction[IN, OUT, KEY, W <: Window]. Now, if you take a look at Your code, Your IN is the type of the datastream that You are calculating windows on. The type of the stream is ((String, Response), Int) and not as declared in the code (String, Int).
If You will change the part that is not compiling to :
.apply(new WindowFunction[((String, Response), Int), (String, Response), String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[((String, Response), Int)], out: Collector[(String, Response)]): Unit = ???
})
EDIT: As for the second example the error occurs because of the same reason in general. When You are using keyBy with Tuple You have two possible functions to use keyBy(fields: Int*), which uses integer to access field of the tuple using index provided (this is what You have used). And also keyBy(fun: T => K) where You provide a function to extract the key that will be used.
But there is one important difference between those functions one of them returns key as JavaTuple and the other one returns the key with its exact type.
So basically If You change the String to Tuple in Your simplified example it should compile clearly.

Related

Flink Cogroup - value map is not a member of Object

I try to run the example Scala code of CoGroup function, which is provided in the Flink website , but it throw error "value map is not a member of Object".
Here is my code
val iVals: DataSet[(String, Int)] = env.fromCollection(Seq(("a",1),("b",2),("c",3)))
val dVals: DataSet[(String, Int)] = env.fromCollection(Seq(("a",11),("b",22)))
val output = iVals.coGroup(dVals).where(0).equalTo(0) {
(iVals, dVals, out: Collector[Double]) =>
val ints = iVals map { _._2 } toSet
for (dVal <- dVals) {
for (i <- ints) {
out.collect(dVal._2 * i)
}
}
}
output.print()
I don't know what cause the error or is there any library I miss to import? Thanks.
Have you tried adding the type annotations for iVals and dVals? It seems that Scala is inferring the type Object, hence the error. (Why, I don't know).
What I mean is:
(iVals: Iterator[(String, Int)], dVals: Iterator[(String, Int)], out: Collector[Double]) =>

How to use math.sqrt for DStream[(Double,Double)]?

For the streaming data DStream[(Double, Double)], how do I estimate the root mean squared error? See my code below. The line math.sqrt(summse) is where I have a problem (the code does not compile):
def calculateRMSE(output: DStream[(Double, Double)], n: DStream[Long]): Double = {
val summse = output.foreachRDD { rdd =>
rdd.map {
case pair: (Double, Double) =>
val err = math.abs(pair._1 - pair._2);
err*err
}.reduce(_ + _)
}
math.sqrt(summse)
}
UPDATE:
The code doesn't compile: Cannot resolve reference sqrt with such signature. Expected: Double, Actual: Unit
The method foreachRDD(...) returns unit so that is expected. According to the docs the result is written back to the this (output) DStream. I guess it's that you'll have to apply sqrt to.

Flatten an RDD - Nested lists in value of key value pair

It took me awhile to figure this out and I wanted to share my solution. Improvements are definitely welcome.
References: Flattening a Scala Map in an RDD, Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)
I have an RDD of the form: RDD[(Int, List[(String, List[(String, Int, Float)])])]
Key: Int
Value: List[(String, List[(String, Int, Float)])]
With a goal of flattening to: RDD[(Int, String, String, Int, Float)]
binHostCountByDate.foreach(println)
Gives the example:
(516361, List((2013-07-15, List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))), (2013-08-15, List((p.secure.com,1,1.0)))))
The final RDD should give the following
(516361,2013-07-15,s2.rf.ru,1,0.5)
(516361,2013-07-15,s1.rf.ru,1,0.5)
(516361,2013-08-15,p.secure.com,1,1.0)
It's a simple one-liner (and with destructuring in the for-comprehension we can use better names than _1, _2._1 etc which makes it easier to be sure we're getting the right result
// Use a outer list in place of an RDD for test purposes
val t = List((516361,
List(("2013-07-15", List(("s2.rf.ru,",1,0.5), ("s1.rf.ru",1,0.5))),
("2013-08-15", List(("p.secure.com,",1,1.0))))))
t flatMap {case (k, xs) => for ((d, ys) <- xs; (dom, a,b) <-ys) yield (k, d, dom, a, b)}
//> res0: List[(Int, String, String, Int, Double)] =
List((516361,2013-07-15,s2.rf.ru,,1,0.5),
(516361,2013-07-15,s1.rf.ru,1,0.5),
(516361,2013-08-15,p.secure.com,,1,1.0))
My approach is as follows:
I flatten the first key value pair. This "removes" the first list.
val binHostCountForDate = binHostCountByDate.flatMapValues(identity)
Gives me an RDD of the form: RDD[(Int, (String, List[(String, Int, Float)])]
binHostCountForDate.foreach(println)
(516361,(2013-07-15,List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))))
(516361,(2013-08-15,List(p.secure.com,1,1.0))
Now I map the first two items into a tuple creating a new key and the second tuple as the value. Then apply the same procedure as above to flatten on the new key value pair.
val binDataRemapKey = binHostCountForDate.map(f =>((f._1, f._2._1), f._2._2)).flatMapValues(identity)
This gives the flattened RDD: RDD[(Int, String),(String, Int, Float)]
If this form is fine then we are done but we can go one step further and remove the tuples to get the final form we were originally looking for.
val binData = binDataRemapKey.map(f => (f._1._1, f._1._2, f._2._1, f._2._2, f._2._3))
This gives us the final form of: RDD[(Int, String, String, Int, Float)]
We now have a flattened RDD that has preserved the parents of each list.

Spark Streaming groupByKey and updateStateByKey implementation

I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. The goal is to "sessionize" the web traffic similar to this blog post
The only difference is that I want to "sessionize" each page the IP hits, instead of the entire session. I was able to do this reading from a file of fake web traffic using Spark in batch mode, but now I want to do it in a streaming context.
Log files are read from Kafka and parsed into K/V pairs of (String, (String, Long, Long)) or
(IP, (requestPage, time, time)).
I then call groupByKey() on this K/V pair. In batch mode, this would produce a:
(String, CollectionBuffer((String, Long, Long), ...) or
(IP, CollectionBuffer((requestPage, time, time), ...)
In a StreamingContext, it produces a:
(String, ArrayBuffer((String, Long, Long), ...) like so:
(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))
However, as the next microbatch (DStream) arrives, this information is discarded.
Ultimately what I want is for that ArrayBuffer to fill up over time as a given IP continues to interact and to run some computations on its data to "sessionize" the page time.
I believe the operator to make that happen is "updateStateByKey." I'm having some trouble with this operator (I'm new to both Spark & Scala);
any help is appreciated.
Thus far:
val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey)
def updateGroupByKey(
a: Seq[(String, ArrayBuffer[(String, Long, Long)])],
b: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
}
I think you are looking for something like this:
def updateGroupByKey(
newValues: Seq[(String, ArrayBuffer[(String, Long, Long)])],
currentValue: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
//Collect the values
val buffs: Seq[ArrayBuffer[(String, Long, Long)]] = (for (v <- newValues) yield v._2)
val buffs2 = if (currentValue.isEmpty) buffs else currentValue.get._2 :: buffs
//Convert state to buffer
if (buffs2.isEmpty) None else {
val key = if (currentValue.isEmpty) newValues(0)._1 else currentValue.get._1
Some((key, buffs2.foldLeft(new ArrayBuffer[(String, Long, Long)])((v, a) => v++a)))
}
}
Gabor's answer got me started down the right path, but here is an answer that produces the expected output.
First, for the output I want:
(100.40.49.235,List((/,1418934075000,1418934075000), (/,1418934105000,1418934105000), (/contactus.html,1418934174000,1418934174000)))
I don't need groupByKey(). updateStateByKey already accumulates the values into a Seq, so the addition of groupByKey is unnecessary (and expensive). Spark users strongly suggest not using groupByKey.
Here is the code that worked:
def updateValues( newValues: Seq[(String, Long, Long)],
currentValue: Option[Seq[ (String, Long, Long)]]
): Option[Seq[(String, Long, Long)]] = {
Some(currentValue.getOrElse(Seq.empty) ++ newValues)
}
val grouped = ipTimeStamp.updateStateByKey(updateValues)
Here updateStateByKey is passed a function (updateValues) that has the accumulation of values over time (newValues) as well as an option for the current value in the stream (currentValue). It then returns the combination of these.getOrElse is required as currentValue may occasionally be empty. Credit to https://twitter.com/granturing for the correct code.

Retrieving a list of tuples with the Scala anorm stream API

When I read in the Play! docs I find a way to parse the result to a List[(String, String)]
Like this:
// Create an SQL query
val selectCountries = SQL("Select * from Country")
// Transform the resulting Stream[Row] as a List[(String,String)]
val countries = selectCountries().map(row =>
row[String]("code") -> row[String]("name")
).toList
I want to do this, but my tuple will contain more data.
I'm doing like this:
val getObjects = SQL("SELECT a, b, c, d, e, f, g FROM table")
val objects = getObjects().map(row =>
row[Long]("a") -> row[Long]("b") -> row[Long]("c") -> row[String]("d") -> row[Long]("e") -> row[Long]("f") -> row[String]("g")).toList
Each tuple I get will be in this format, ofcourse, thats what I'm asking for in the code above:
((((((Long, Long), Long), String), Long), Long), String)
But I want this:
(Long, Long, Long, String, Long, Long, String)
What I'm asking is how should I parse the result to generate a tuple like the last one above. I want to do like they do in the documentation to List[(String, String)] but with more data.
Thanks
You are getting ((((((Long, Long), Long), String), Long), Long), String) because of ->, each call to this wraps two elements into a pair. So with each -> you got a tuple, then you took this tuple and made a new one, etc. You need to change arrow with a comma and wrap into ():
val objects = getObjects().map(row =>
(row[Long]("a"), row[Long]("b"), ..., row[String]("g")).toList
But remember that currently tuples can have no more then 22 elements in it.