Spark Streaming groupByKey and updateStateByKey implementation - scala

I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. The goal is to "sessionize" the web traffic similar to this blog post
The only difference is that I want to "sessionize" each page the IP hits, instead of the entire session. I was able to do this reading from a file of fake web traffic using Spark in batch mode, but now I want to do it in a streaming context.
Log files are read from Kafka and parsed into K/V pairs of (String, (String, Long, Long)) or
(IP, (requestPage, time, time)).
I then call groupByKey() on this K/V pair. In batch mode, this would produce a:
(String, CollectionBuffer((String, Long, Long), ...) or
(IP, CollectionBuffer((requestPage, time, time), ...)
In a StreamingContext, it produces a:
(String, ArrayBuffer((String, Long, Long), ...) like so:
(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))
However, as the next microbatch (DStream) arrives, this information is discarded.
Ultimately what I want is for that ArrayBuffer to fill up over time as a given IP continues to interact and to run some computations on its data to "sessionize" the page time.
I believe the operator to make that happen is "updateStateByKey." I'm having some trouble with this operator (I'm new to both Spark & Scala);
any help is appreciated.
Thus far:
val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey)
def updateGroupByKey(
a: Seq[(String, ArrayBuffer[(String, Long, Long)])],
b: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
}

I think you are looking for something like this:
def updateGroupByKey(
newValues: Seq[(String, ArrayBuffer[(String, Long, Long)])],
currentValue: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
//Collect the values
val buffs: Seq[ArrayBuffer[(String, Long, Long)]] = (for (v <- newValues) yield v._2)
val buffs2 = if (currentValue.isEmpty) buffs else currentValue.get._2 :: buffs
//Convert state to buffer
if (buffs2.isEmpty) None else {
val key = if (currentValue.isEmpty) newValues(0)._1 else currentValue.get._1
Some((key, buffs2.foldLeft(new ArrayBuffer[(String, Long, Long)])((v, a) => v++a)))
}
}

Gabor's answer got me started down the right path, but here is an answer that produces the expected output.
First, for the output I want:
(100.40.49.235,List((/,1418934075000,1418934075000), (/,1418934105000,1418934105000), (/contactus.html,1418934174000,1418934174000)))
I don't need groupByKey(). updateStateByKey already accumulates the values into a Seq, so the addition of groupByKey is unnecessary (and expensive). Spark users strongly suggest not using groupByKey.
Here is the code that worked:
def updateValues( newValues: Seq[(String, Long, Long)],
currentValue: Option[Seq[ (String, Long, Long)]]
): Option[Seq[(String, Long, Long)]] = {
Some(currentValue.getOrElse(Seq.empty) ++ newValues)
}
val grouped = ipTimeStamp.updateStateByKey(updateValues)
Here updateStateByKey is passed a function (updateValues) that has the accumulation of values over time (newValues) as well as an option for the current value in the stream (currentValue). It then returns the combination of these.getOrElse is required as currentValue may occasionally be empty. Credit to https://twitter.com/granturing for the correct code.

Related

Find count in WindowedStream - Flink

I am pretty new in the world of Streams and I am facing some issues in my first try.
More specifically, I am trying to implement a count and groupBy functionality in a sliding window using Flink.
I 've done it in a normal DateStream but I am not able to make it work in a WindowedStream.
Do you have any suggestion on how can I do it?
val parsedStream: DataStream[(String, Response)] = stream
.mapWith(_.decodeOption[Response])
.filter(_.isDefined)
.map { record =>
(
s"${record.get.group.group_country}, ${record.get.group.group_state}, ${record.get.group.group_city}",
record.get
)
}
val result: DataStream[((String, Response), Int)] = parsedStream
.map((_, 1))
.keyBy(_._1._1)
.sum(1)
// The output of result is
// ((us, GA, Atlanta,Response()), 14)
// ((us, SA, Atlanta,Response()), 4)
result
.keyBy(_._1._1)
.timeWindow(Time.seconds(5))
//the following part doesn't compile
.apply(
new WindowFunction[(String, Int), (String, Int), String, TimeWindow] {
def apply(
key: Tuple,
window: TimeWindow,
values: Iterable[(String, Response)],
out: Collector[(String, Int)]
) {}
}
)
Compilation Error:
overloaded method value apply with alternatives:
[R](function: (String, org.apache.flink.streaming.api.windowing.windows.TimeWindow, Iterable[((String, com.flink.Response), Int)], org.apache.flink.util.Collector[R]) => Unit)(implicit evidence$28: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R] <and>
[R](function: org.apache.flink.streaming.api.scala.function.WindowFunction[((String, com.flink.Response), Int),R,String,org.apache.flink.streaming.api.windowing.windows.TimeWindow])(implicit evidence$27: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R]
cannot be applied to (org.apache.flink.streaming.api.functions.windowing.WindowFunction[((String, com.flink.Response), Int),(String, com.flink.Response),String,org.apache.flink.streaming.api.windowing.windows.TimeWindow]{def apply(key: String,window: org.apache.flink.streaming.api.windowing.windows.TimeWindow,input: Iterable[((String, com.flink.Response), Int)],out: org.apache.flink.util.Collector[(String, com.flink.Response)]): Unit})
.apply(
This is a simpler example that we can work on
val source: DataStream[(JsonField, Int)] = env.fromElements(("hello", 1), ("hello", 2))
val window2 = source
.keyBy(0)
.timeWindow(Time.minutes(1))
.apply(new WindowFunction[(JsonField, Int), Int, String, TimeWindow] {})
I have tried Your code and found the errors, it seems that you have an error when declaring the types for your WindowFunction.
The documentation says that the expected types for WindowFunction are WindowFunction[IN, OUT, KEY, W <: Window]. Now, if you take a look at Your code, Your IN is the type of the datastream that You are calculating windows on. The type of the stream is ((String, Response), Int) and not as declared in the code (String, Int).
If You will change the part that is not compiling to :
.apply(new WindowFunction[((String, Response), Int), (String, Response), String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[((String, Response), Int)], out: Collector[(String, Response)]): Unit = ???
})
EDIT: As for the second example the error occurs because of the same reason in general. When You are using keyBy with Tuple You have two possible functions to use keyBy(fields: Int*), which uses integer to access field of the tuple using index provided (this is what You have used). And also keyBy(fun: T => K) where You provide a function to extract the key that will be used.
But there is one important difference between those functions one of them returns key as JavaTuple and the other one returns the key with its exact type.
So basically If You change the String to Tuple in Your simplified example it should compile clearly.

How to extract timed-out sessions using mapWithState

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.
This was my old code:
val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
val parsed = Utils.parseJSON(eventRecord)
val member_id = parsed.getOrElse("member_id", "")
val timestamp = parsed.getOrElse("timestamp", "").toLong
//The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
(member_id, (timestamp, timestamp, List(eventRecord)))
})
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
reduceByKey((a, b) => {
//transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
}).updateStateByKey(Utils.updateState)
The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.
This is my first attempt to transform the old code to the new version:
val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()
I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.
Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.
What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:
def updateState(key: String,
value: Option[(Long, Long, Long, List[String])],
state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
(Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
}
value match {
case Some(currentValue) =>
val result = state
.getOption()
.map(currentState => reduce(currentState, currentValue))
.getOrElse(currentValue)
state.update(result)
None
case _ if state.isTimingOut() => state.getOption()
}
}
This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.
This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:
latestSessionInfo
.mapWithState(spec)
.filter(_.isDefined)
After filter, you'll only have states which have timed out.

Flatten an RDD - Nested lists in value of key value pair

It took me awhile to figure this out and I wanted to share my solution. Improvements are definitely welcome.
References: Flattening a Scala Map in an RDD, Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)
I have an RDD of the form: RDD[(Int, List[(String, List[(String, Int, Float)])])]
Key: Int
Value: List[(String, List[(String, Int, Float)])]
With a goal of flattening to: RDD[(Int, String, String, Int, Float)]
binHostCountByDate.foreach(println)
Gives the example:
(516361, List((2013-07-15, List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))), (2013-08-15, List((p.secure.com,1,1.0)))))
The final RDD should give the following
(516361,2013-07-15,s2.rf.ru,1,0.5)
(516361,2013-07-15,s1.rf.ru,1,0.5)
(516361,2013-08-15,p.secure.com,1,1.0)
It's a simple one-liner (and with destructuring in the for-comprehension we can use better names than _1, _2._1 etc which makes it easier to be sure we're getting the right result
// Use a outer list in place of an RDD for test purposes
val t = List((516361,
List(("2013-07-15", List(("s2.rf.ru,",1,0.5), ("s1.rf.ru",1,0.5))),
("2013-08-15", List(("p.secure.com,",1,1.0))))))
t flatMap {case (k, xs) => for ((d, ys) <- xs; (dom, a,b) <-ys) yield (k, d, dom, a, b)}
//> res0: List[(Int, String, String, Int, Double)] =
List((516361,2013-07-15,s2.rf.ru,,1,0.5),
(516361,2013-07-15,s1.rf.ru,1,0.5),
(516361,2013-08-15,p.secure.com,,1,1.0))
My approach is as follows:
I flatten the first key value pair. This "removes" the first list.
val binHostCountForDate = binHostCountByDate.flatMapValues(identity)
Gives me an RDD of the form: RDD[(Int, (String, List[(String, Int, Float)])]
binHostCountForDate.foreach(println)
(516361,(2013-07-15,List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))))
(516361,(2013-08-15,List(p.secure.com,1,1.0))
Now I map the first two items into a tuple creating a new key and the second tuple as the value. Then apply the same procedure as above to flatten on the new key value pair.
val binDataRemapKey = binHostCountForDate.map(f =>((f._1, f._2._1), f._2._2)).flatMapValues(identity)
This gives the flattened RDD: RDD[(Int, String),(String, Int, Float)]
If this form is fine then we are done but we can go one step further and remove the tuples to get the final form we were originally looking for.
val binData = binDataRemapKey.map(f => (f._1._1, f._1._2, f._2._1, f._2._2, f._2._3))
This gives us the final form of: RDD[(Int, String, String, Int, Float)]
We now have a flattened RDD that has preserved the parents of each list.

Retrieving a list of tuples with the Scala anorm stream API

When I read in the Play! docs I find a way to parse the result to a List[(String, String)]
Like this:
// Create an SQL query
val selectCountries = SQL("Select * from Country")
// Transform the resulting Stream[Row] as a List[(String,String)]
val countries = selectCountries().map(row =>
row[String]("code") -> row[String]("name")
).toList
I want to do this, but my tuple will contain more data.
I'm doing like this:
val getObjects = SQL("SELECT a, b, c, d, e, f, g FROM table")
val objects = getObjects().map(row =>
row[Long]("a") -> row[Long]("b") -> row[Long]("c") -> row[String]("d") -> row[Long]("e") -> row[Long]("f") -> row[String]("g")).toList
Each tuple I get will be in this format, ofcourse, thats what I'm asking for in the code above:
((((((Long, Long), Long), String), Long), Long), String)
But I want this:
(Long, Long, Long, String, Long, Long, String)
What I'm asking is how should I parse the result to generate a tuple like the last one above. I want to do like they do in the documentation to List[(String, String)] but with more data.
Thanks
You are getting ((((((Long, Long), Long), String), Long), Long), String) because of ->, each call to this wraps two elements into a pair. So with each -> you got a tuple, then you took this tuple and made a new one, etc. You need to change arrow with a comma and wrap into ():
val objects = getObjects().map(row =>
(row[Long]("a"), row[Long]("b"), ..., row[String]("g")).toList
But remember that currently tuples can have no more then 22 elements in it.

Why is headOption faster

I made a change to some code and it got 4.5x faster. I'm wondering why. It used to be essentially:
def doThing(queue: Queue[(String, String)]): Queue[(String, String)] = queue match {
case Queue((thing, stuff), _*) => doThing(queue.tail)
case _ => queue
}
and I changed it to this to get a huge speed boost:
def doThing(queue: Queue[(String, String)]): Queue[(String, String)] = queue.headOption match {
case Some((thing, stuff)) => doThing(queue.tail)
case _ => queue
}
What does _* do and why is it so expensive compared to headOption?
My guess after running scalac with -Xprint:all is that at the end of patmat in the queue match { case Queue((thing, stuff), _*) => doThing(queue.tail) } example I see the following methods being called (edited for brevity):
val o9 = scala.collection.immutable.Queue.unapplySeq[(String, String)](x1);
if (o9.isEmpty.unary_!)
if (o9.get.!=(null).&&(o9.get.lengthCompare(1).>=(0)))
{
val p2: (String, String) = o9.get.apply(0);
val p3: Seq[(String, String)] = o9.get.drop(1);
So lengthCompare compare the length of the collection in a possibly optimized way. For Queue, it creates an iterator and iterates one time. So that should be somewhat fast. On the other hand drop(1) also constructs an iterator, skips one element and adds the rest of the elements to the result queue, so that would be linear in the size of the collection.
The headOption example is more straightforward, it checks if the list is empty (two comparisons), and if not returns a Some(head), which then just has its _1 and _2 assigned to thing and stuff. So no iterators are created and nothing linear in the length of the collection.
There should be no significant difference between your code samples.
case Queue((thing, stuff), _*) is actually translated by compiler to call of head (apply(0)) method. You could use scalac -Xprint:patmat to investigate this:
<synthetic> val p2: (String, String) = o9.get.apply(0);
if (p2.ne(null))
matchEnd6(doThing(queue.tail))
The cost of head and cost of headOption are almost the same.
Methods head, tail and dequeue could cause reverce on internal List of Queue (with cost O(n)). In both you code samples there will be at most 2 reverce calls.
You should use dequeue like this to get at most a single reverce call:
def doThing(queue: Queue[(String, String)]): Queue[(String, String)] =
if (queue.isEmpty) queue
else queue.dequeue match {
case (e, q) => doThing(q)
}
You could also replace (thing, stuff) with _. In this case compiler will generate only call of lengthCompare without head or tail:
if (o9.get != null && o9.get.lengthCompare(1) >= 0)
_* is used to specify varargs arguments, so what you are doing in the first version is deconstructing the Queue into a pair of Strings, and an appropriate number of further pairs of Strings - ie you are deconstructing the whole Queue even though you only care about the first element.
If you just remove the asterisk, giving
def doThing(queue: Queue[(String, String)]): Queue[(String, String)] = queue match {
case Queue((thing, stuff), _) => doThing(queue.tail)
case _ => queue
}
then you are only deconstructing the Queue into a pair of Strings, and a remainder (which thus does not need to be fully deconstructed). This should run in comparable time to your second version (haven't timed it myself, though).