How to extract timed-out sessions using mapWithState - scala

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.
This was my old code:
val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
val parsed = Utils.parseJSON(eventRecord)
val member_id = parsed.getOrElse("member_id", "")
val timestamp = parsed.getOrElse("timestamp", "").toLong
//The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
(member_id, (timestamp, timestamp, List(eventRecord)))
})
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
reduceByKey((a, b) => {
//transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
}).updateStateByKey(Utils.updateState)
The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.
This is my first attempt to transform the old code to the new version:
val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()
I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.

Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.
What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:
def updateState(key: String,
value: Option[(Long, Long, Long, List[String])],
state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
(Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
}
value match {
case Some(currentValue) =>
val result = state
.getOption()
.map(currentState => reduce(currentState, currentValue))
.getOrElse(currentValue)
state.update(result)
None
case _ if state.isTimingOut() => state.getOption()
}
}
This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.
This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:
latestSessionInfo
.mapWithState(spec)
.filter(_.isDefined)
After filter, you'll only have states which have timed out.

Related

Looking for an alternate form of windowAll() that keeps data on the same node for aggregations

I have a highly parallelized aggregation with a lot of keys I am running across multiple nodes. I am then wanting to do a summary aggregation across all values similar to the code below:
val myStream = sourceStream
.keyBy( 0 )
.window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(_ + _)
.addSink(new OtherSink)
val summaryStream = myStream
.map(Row.fromOtherRow(_))
// parallelism is 1 by definition
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(_ + _)
.addSink(new RowSink)
This works fine, but I notice the node that ends up doing the windowAll() gets a tremendous amount of inbound network traffic as well as a significant spike on that node's CPU. This is obviously because all of the data is being aggregated together and the parallelism is '1'.
Are there any current or planned provisions in Flink to do more of a two tier summary aggregation that would keep all of the data on each node, pre-aggregate it before send on the results to a second tier for the final aggregation? Here is some psuedo code to what I would have hoped to find:
val myStream = sourceStream
.keyBy( 0 )
.window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(_ + _)
.addSink(new OtherSink)
val summaryStream = myStream
.map(Row.fromOtherRow(_))
// parallelism would be at the default for the env
.windowLocal(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(_ + _)
// parallelism is 1 by definition
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(_ + _)
.addSink(new RowSink)
I called it 'windowLocal()', but I am sure there could be a better name. It would be non-keyed just like windowAll(). The key benefits is it would reduce the network and CPU and Memory hit windowAll() has, by distributing this across all of the nodes you are running. I currently have to allocate more resources to my nodes to accommodate this summarization.
If this can be accomplished in some other way with the current version I would love to hear about it. I already thought about using a random value for a key for the second tier, but I beleive that would result in a full rebalance of the data, so it solves my CPU and Memory issue, but not the network. I am looking for something in the same vein as rescale() where the data stays local to the task manager or the slot.
Incremental Window Aggregation with FoldFunction
The following example shows how an incremental FoldFunction can be combined with a WindowFunction to extract the number of events in the window and return also the key and end time of the window.
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.fold (
("", 0L, 0),
(acc: (String, Long, Int), r: SensorReading) => { ("", 0L, acc._3 + 1) },
( key: String,
window: TimeWindow,
counts: Iterable[(String, Long, Int)],
out: Collector[(String, Long, Int)] ) =>
{
val count = counts.iterator.next()
out.collect((key, window.getEnd, count._3))
}
)
Incremental Window Aggregation with ReduceFunction
The following example shows how an incremental ReduceFunction can be combined with a WindowFunction to return the smallest event in a window along with the start time of the window.
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.reduce(
(r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
( key: String,
window: TimeWindow,
minReadings: Iterable[SensorReading],
out: Collector[(Long, SensorReading)] ) =>
{
val min = minReadings.iterator.next()
out.collect((window.getStart, min))
}
)
You want more see here
enter code herehttps://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/windows.html

How to log flow rate in Akka Stream?

I have an Akka Stream application with a single flow/graph. I want to measure the flow rate at the source and log it every 5 seconds, like 'received 3 messages in the last 5 seconds'. I tried with,
someOtherFlow
.groupedWithin(Integer.MAX_VALUE, 5 seconds)
.runForeach(seq =>
log.debug(s"received ${seq.length} messages in the last 5 seconds")
)
but it only outputs when there are messages, no empty list when there are 0 messages. I want the 0's as well. Is this possible?
You could try something like
src
.conflateWithSeed(_ ⇒ 1){ case (acc, _) ⇒ acc + 1 }
.zip(Source.tick(5.seconds, 5.seconds, NotUsed))
.map(_._1)
which should batch your elements until the tick releases them. This is inspired from an example in the docs.
On a different note, if you need this for monitoring purposes, you could leverage a 3rd party tool for this purpose - e.g. Kamon.
A sample akka stream logging.
implicit val system: ActorSystem = ActorSystem("StreamLoggingActorSystem")
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val adapter: LoggingAdapter = Logging(system, "customLogger")
implicit val ec: ExecutionContextExecutor = system.dispatcher
def randomInt = Random.nextInt()
val source = Source.repeat(NotUsed).map(_ ⇒ randomInt)
val logger = source
.groupedWithin(Integer.MAX_VALUE, 5.seconds)
.log(s"in the last 5 seconds number of messages received : ", _.size)
.withAttributes(
Attributes.logLevels(
onElement = Logging.WarningLevel,
onFinish = Logging.InfoLevel,
onFailure = Logging.DebugLevel
)
)
val sink = Sink.ignore
val result: Future[Done] = logger.runWith(sink)
result.onComplete{
case Success(_) =>
println("end of stream")
case Failure(_) =>
println("stream ended with failure")
}
source code is here.
Extending Stefano's answer a little I created the following flows:
def flowRate[T](metric: T => Int = (_: T) => 1, outputDelay: FiniteDuration = 1 second): Flow[T, Double, NotUsed] =
Flow[T]
.conflateWithSeed(metric(_)){ case (acc, x) ⇒ acc + metric(x) }
.zip(Source.tick(outputDelay, outputDelay, NotUsed))
.map(_._1.toDouble / outputDelay.toUnit(SECONDS))
def printFlowRate[T](name: String, metric: T => Int = (_: T) => 1,
outputDelay: FiniteDuration = 1 second): Flow[T, T, NotUsed] =
Flow[T]
.alsoTo(flowRate[T](metric, outputDelay)
.to(Sink.foreach(r => log.info(s"Rate($name): $r"))))
The first converts the flow into a rate per second. You can supply a metric which gives a value to each object passing through. Say you want to measure the rate of characters in a flow of strings then you could pass _.length. The second parameter is the delay between flow rate reports (defaults to one second).
The second flow can be used inline to print the flow rate for debugging purposes without modifying the value passing through the stream. eg
stringFlow
.via(printFlowRate[String]("Char rate", _.length, 10 seconds))
.map(_.toLowercase) // still a string
...
which will show every 10 seconds the average the rate (per second) of characters.
N.B. The above flowRate would however be lagging one outputDelay period behind, because the zip will consume from the conflate and then wait for a tick (which can be easily verified by putting a log after the conflateWithSeed). To obtain a non lagging flow rate (metric), one could duplicate the tick, in order to force the zip to consume a second fresh element from the conflate, and then aggregate both ticks, i.e.:
Flow[T]
.conflateWithSeed(metric(_)){case (acc, x) => acc + metric(x) }
.zip(Source.tick(outputDelay, outputDelay, NotUsed)
.mapConcat(_ => Seq(NotUsed, NotUsed))
)
.grouped(2).map {
case Seq((a, _), (b, _)) => a + b
}
.map(_.toDouble / outputDelay.toUnit(SECONDS))

Anyone had issues with subtraction of a long within an RDD

I am having an issue with the subtraction of a long within an RDD to filter out items in the RDD that are within a certain time range.
So my code filters an RDD of case class auctions, with an object of successfulAuctions(Long, Int, String):
auctions.filter(it => relevantAuctions(it, successfulAuctions))
The successfulAuctions object is made up of a timestamp: Long, an itemID: Int, and a direction: String (BUY/SELL).
The relevantAuctions function basically uses tail recursion to find the auctions in a time range for the exact item and direction.
#tailrec
def relevantAuctions(auction: Auction, successfulAuctions: List[(Long, String, String)]): Boolean = successfulAuctions match {
case sample :: xs => if (isRelevantAuction(auction, sample) ) true else relevantAuctions(auction, xs)
case Nil => false
}
This then feeds into another method in the if statement that checks the timestamp in the sample is within a 10ms range, and the item ID is the same, as is the direction.
def isRelevantAuction(auction: Auction, successfulAuction: (Long, String, String)): Boolean = {
(successfulAuction.timestampNanos - auction.timestampNanos) >= 0 &&
(successfulAuction.timestampNanos - auction.timestampNanos) < 10000000L &&
auction.itemID == successfulAuction.itemID &&
auction.direction== successfulAuction.direction
}
I am having issues where the range option is not entirely working. The timestamps I am receiving back are not within the required range. Although the Item ID and direction seems to be working successfully.
The results I am getting are as follows, when I have a timestamp of 1431651108749267459 for the successful auction, I am receiving other auctions of a time GREATER than this, where it should be less.
The auctions I am receiving have the timestamps of:
1431651108749326603
1431651108749330732
1431651108749537901
Has anyone experienced this phenomenon?
Thanks!

Spark Streaming groupByKey and updateStateByKey implementation

I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. The goal is to "sessionize" the web traffic similar to this blog post
The only difference is that I want to "sessionize" each page the IP hits, instead of the entire session. I was able to do this reading from a file of fake web traffic using Spark in batch mode, but now I want to do it in a streaming context.
Log files are read from Kafka and parsed into K/V pairs of (String, (String, Long, Long)) or
(IP, (requestPage, time, time)).
I then call groupByKey() on this K/V pair. In batch mode, this would produce a:
(String, CollectionBuffer((String, Long, Long), ...) or
(IP, CollectionBuffer((requestPage, time, time), ...)
In a StreamingContext, it produces a:
(String, ArrayBuffer((String, Long, Long), ...) like so:
(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))
However, as the next microbatch (DStream) arrives, this information is discarded.
Ultimately what I want is for that ArrayBuffer to fill up over time as a given IP continues to interact and to run some computations on its data to "sessionize" the page time.
I believe the operator to make that happen is "updateStateByKey." I'm having some trouble with this operator (I'm new to both Spark & Scala);
any help is appreciated.
Thus far:
val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey)
def updateGroupByKey(
a: Seq[(String, ArrayBuffer[(String, Long, Long)])],
b: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
}
I think you are looking for something like this:
def updateGroupByKey(
newValues: Seq[(String, ArrayBuffer[(String, Long, Long)])],
currentValue: Option[(String, ArrayBuffer[(String, Long, Long)])]
): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
//Collect the values
val buffs: Seq[ArrayBuffer[(String, Long, Long)]] = (for (v <- newValues) yield v._2)
val buffs2 = if (currentValue.isEmpty) buffs else currentValue.get._2 :: buffs
//Convert state to buffer
if (buffs2.isEmpty) None else {
val key = if (currentValue.isEmpty) newValues(0)._1 else currentValue.get._1
Some((key, buffs2.foldLeft(new ArrayBuffer[(String, Long, Long)])((v, a) => v++a)))
}
}
Gabor's answer got me started down the right path, but here is an answer that produces the expected output.
First, for the output I want:
(100.40.49.235,List((/,1418934075000,1418934075000), (/,1418934105000,1418934105000), (/contactus.html,1418934174000,1418934174000)))
I don't need groupByKey(). updateStateByKey already accumulates the values into a Seq, so the addition of groupByKey is unnecessary (and expensive). Spark users strongly suggest not using groupByKey.
Here is the code that worked:
def updateValues( newValues: Seq[(String, Long, Long)],
currentValue: Option[Seq[ (String, Long, Long)]]
): Option[Seq[(String, Long, Long)]] = {
Some(currentValue.getOrElse(Seq.empty) ++ newValues)
}
val grouped = ipTimeStamp.updateStateByKey(updateValues)
Here updateStateByKey is passed a function (updateValues) that has the accumulation of values over time (newValues) as well as an option for the current value in the stream (currentValue). It then returns the combination of these.getOrElse is required as currentValue may occasionally be empty. Credit to https://twitter.com/granturing for the correct code.

Bucketed Sink in scalaz-stream

I am trying to make a sink that would write a stream to bucketed files: when a particular condition is reached (time, size of file, etc.) is reached, the current output stream is closed and a new one is opened to a new bucket file.
I checked how the different sinks were created in the io object, but there aren't many examples. So I trIed to follow how resource and chunkW were written. I ended up with the following bit of code, where for simplicity, buckets are just represented by an Int for now, but would eventually be some type output streams.
val buckets: Channel[Task, String, Int] = {
//recursion to step through the stream
def go(step: Task[String => Task[Int]]): Process[Task, String => Task[Int]] = {
// Emit the value and repeat
def next(msg: String => Task[Int]) =
Process.emit(msg) ++
go(step)
Process.await[Task, String => Task[Int], String => Task[Int]](step)(
next
, Process.halt // TODO ???
, Process.halt) // TODO ???
}
//starting bucket
val acquire: Task[Int] = Task.delay {
val startBuck = nextBucket(0)
println(s"opening bucket $startBuck")
startBuck
}
//the write step
def step(os: Int): Task[String => Task[Int]] =
Task.now((msg: String) => Task.delay {
write(os, msg)
val newBuck = nextBucket(os)
if (newBuck != os) {
println(s"closing bucket $os")
println(s"opening bucket $newBuck")
}
newBuck
})
//start the Channel
Process.await(acquire)(
buck => go(step(buck))
, Process.halt, Process.halt)
}
def write(bucket: Int, msg: String) { println(s"$bucket\t$msg") }
def nextBucket(b: Int) = b+1
There are a number of issues in this:
step is passed the bucket once at the start and this never changes during the recursion. I am not sure how in the recursive go to create a new step task that will use the bucket (Int) from the previous task, as I have to provide a String to get to that task.
the fallback and cleanup of the await calls do not receive the result of rcv (if there is one). In the io.resource function, it works fine as the resource is fixed, however, in my case, the resource might change at any step. How would I go to pass the reference to the current open bucket to these callbacks?
Well one of the options (i.e. time) may be to use simple go on the sink. This one uses time based, essentially reopening file every single hour:
val metronome = Process.awakeEvery(1.hour).map(true)
def writeFileSink(file:String):Sink[Task,ByteVector] = ???
def timeBasedSink(prefix:String) = {
def go(index:Int) : Sink[Task,ByteVector] = {
metronome.wye(write(prefix + "_" + index))(wye.interrupt) ++ go(index + 1)
}
go(0)
}
for the other options (i.e. bytes written) you can use similar technique, just keep signal of bytes written and combine it with Sink.