Given distinct events, with some mapping to the same ID, how can I find the most recent event for each ID and sum the results using Algebird? - scala

I have been struggling for days to find a solution and so I'm hoping someone with more Algebird experience can help!
I have a stream of events I'm aggregating using Algebird, where each event represents an attempt to perform some task. Consider the following data structure to represent each attempt:
class TaskAttempt {
val taskId: String
val time: Int
val type: String
val value: Long
val valueUnit: String
}
I am aggregating these attempts from a stream, and there is no guarantee that an attempt to perform a task will succeed. In the case that an attempt fails, I expect additional attempts for the same task. The aggregation I'm trying to build does the following:
Collect only the most recent attempt (based on the TaskAttempt.time field) for each task ID. Assume larger values for TaskAttempt.time mean the event happened more recently. All previous events for each task will be ignored.
Sum the TaskAttempt.value field from the TaskAttempt instances collected in step 1 into a Map(type -> Map(valueUnit -> valueSum)). This means that in the end, all values from each most recent task attempt will be summed if their type and valueUnit fields are equal.
I was hoping to accomplish the above using something like the following, but .flatMap() cannot be called on an Algebird Preparer after calling .reduce() because the latter returns a MonoidAggregator rather than a Preparer. Regardless, here is some non-working code to further show what I'd like to accomplish:
Preparer[NetworkAttemptSubmissionPrediction]
// Aggregate attempts into Sets
.flatMap { attempt =>
for {
a <- attempt
} yield Set(a)
}
// Reduce by grouping TaskAttempt's by taskId and then collecting the
// attempts with the largest value for time for each taskId
.reduce {
(
l1: Set[TaskAttempt],
l2: Set[TaskAttempt]
) =>
(l1 ++ l2)
.groupBy(_.taskId)
.flatMap(entry: (String -> List[TaskAttempt]) => entry._2.maxBy(_.time))
.toSet
}
// Map the remaining filtered attempts to the required Map
.flatMap { attempt =>
for {
value <- attempt.value
} yield Map(
attempt.type -> Map(attempt.valueUnit -> value)
)
}
.sum
Ultimately, I must provide the framework I'm using for the stream aggregation (internal tool built on top of Twitter's Summingbird) with a MonoidAggregator[TaskAttempt, Map[String, Map[String, Long]], Map[String, Map[String, Long]] that aggregates the data as described above. How can I accomplish this? Any other ideas for how I could make this work?

I decided that rather than attempting to dedupe, I should avoid the need to dedupe altogether. I did this by adding additional "negative" task attempts to the topic which negate failed "positive" task attempts that come before them in the stream. By doing this, I can sum all of the events in the stream without worry of double counting due to the presence of multiple attempts for a single task.

Related

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

Sink for line-by-line file IO with backpressure

I have a file processing job that currently uses akka actors with manually managed backpressure to handle the processing pipeline, but I've never been able to successfully manage the backpressure at the input file reading stage.
This job takes an input file and groups lines by an ID number present at the start of each line, and then once it hits a line with a new ID number, it pushes the grouped lines to a processing actor via message, and then continues with the new ID number, all the way until it reaches the end of the file.
This seems like it would be a good use case for Akka Streams, using the File as a sink, but I'm still not sure of three things:
1) How can I read the file line by line?
2) How can I group by the ID present on every line? I currently use very imperative processing for this, and I don't think I'll have the same ability in a stream pipeline.
3) How can I apply backpressure, such that I don't keep reading lines into memory faster than I can process the data downstream?
Akka streams' groupBy is one approach. But groupBy has a maxSubstreams param which would require that you to know that max # of ID ranges up front. So: the solution below uses scan to identify same-ID blocks, and splitWhen to split into substreams:
object Main extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
def extractId(s: String) = {
val a = s.split(",")
a(0) -> a(1)
}
val file = new File("/tmp/example.csv")
private val lineByLineSource = FileIO.fromFile(file)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024))
.map(_.utf8String)
val future: Future[Done] = lineByLineSource
.map(extractId)
.scan( (false,"","") )( (l,r) => (l._2 != r._1, r._1, r._2) )
.drop(1)
.splitWhen(_._1)
.fold( ("",Seq[String]()) )( (l,r) => (r._2, l._2 ++ Seq(r._3) ))
.concatSubstreams
.runForeach(println)
private val reply = Await.result(future, 10 seconds)
println(s"Received $reply")
Await.ready(system.terminate(), 10 seconds)
}
extractId splits lines into id -> data tuples. scan prepends id -> data tuples with a start-of-ID-range flag. The drop drops the primer element to scan. splitwhen starts a new substream for each start-of-range. fold concatenates substreams to lists and removes the start-of-ID-range boolean, so that each substream produces a single element. In place of the fold you probably want a custom SubFlow which processes a streams of rows for a single ID and emits some result for the ID range. concatSubstreams merges the per-ID-range substreams produced by splitWhen back into a single stream that's printed by runForEach .
Run with:
$ cat /tmp/example.csv
ID1,some input
ID1,some more input
ID1,last of ID1
ID2,one line of ID2
ID3,2nd before eof
ID3,eof
Output is:
(ID1,List(some input, some more input, last of ID1))
(ID2,List(one line of ID2))
(ID3,List(2nd before eof, eof))
It appears that the easiest way to add "back pressure" to your system without introducing huge modifications is to simply change the mailbox type of the input groups consuming Actor to BoundedMailbox.
Change the type of the Actor that consumes your lines to BoundedMailbox with high mailbox-push-timeout-time:
bounded-mailbox {
mailbox-type = "akka.dispatch.BoundedDequeBasedMailbox"
mailbox-capacity = 1
mailbox-push-timeout-time = 1h
}
val actor = system.actorOf(Props(classOf[InputGroupsConsumingActor]).withMailbox("bounded-mailbox"))
Create iterator from your file, create grouped (by id) iterator from that iterator. Then just cycle through the data, sending groups to consuming Actor. Note, that send will block in this case, when Actor's mailbox gets full.
def iterGroupBy[A, K](iter: Iterator[A])(keyFun: A => K): Iterator[Seq[A]] = {
def rec(s: Stream[A]): Stream[Seq[A]] =
if (s.isEmpty) Stream.empty else {
s.span(keyFun(s.head) == keyFun(_)) match {
case (prefix, suffix) => prefix.toList #:: rec(suffix)
}
}
rec(iter.toStream).toIterator
}
val lines = Source.fromFile("input.file").getLines()
iterGroupBy(lines){l => l.headOption}.foreach {
lines:Seq[String] =>
actor.tell(lines, ActorRef.noSender)
}
That's it!
You probably want to move file reading stuff to separate thread, as it's gonna block. Also by adjusting mailbox-capacity you can regulate amount of consumed memory. But if reading batch from the file is always faster than processing, it seems reasonable to keep capacity small, like 1 or 2.
upd iterGroupBy implemented with Stream, tested not to produce StackOverflow.

How can we avoid MapPartition related issues?

val counts = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("points"+point)
point.indices.map(i => i,point(i)))
})
}).countByValue()
val count = parsed.mapPartitions(iter => {
iter.flatMap(point => {
println("pointsssss" + point.deep)
point.indices.map(i => (i, point(i)))
})
}).countByValue()
When I execute count.foreach(println), I also get output from counts. How can I avoid this problem ?
The reason you see both print statements is that countByValue is itself an action and not a transformation, and it triggers evaluation of the RDD (in this case, both of them). From the docs:
def countByValue(): Map[T, Long]
Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master, equivalent to running a single reduce task.
Your next code, count.foreach(println) happens thus outside of Spark, in normal Scala Collections, in the master node.
Check the logic if this is not the behavior you want, I have the suspicion that you want countByKey() instead (also an action).

How to use accumulator to count the records that have no matching items in leftOuterJoin?

Spark accumulators are a great way to get useful information about an operation over an RDD.
My problem is the following: I want to perform a join between two datasets, called, e.g. events and items (where the events are unique and involve items, and both are keyed by item_id which is primary for items)
What works is this:
val joinedRDD = events.leftOuterJoin(items)
One possible way to know how many events did not have matching items is to write:
val numMissingItems = joinedRDD.map(x => if (x._2._2.isDefined) 0 else 1).sum
My question is: is there a way to obtain this count with an accumulator? I dont want to go through the RDD just to do the count.
Indeed, you could use the cogroup signature and then do the logic that leftOuterJoin performs by your self, and in the no match case increment the accumulator. However, its important to note, that since this is a transformation, it is possible (for example if a task fails / is recomputed) that your accumulator may over count the number of records, although generally not by a lot. Its up to you if that is acceptable.
Answering based on #Holden's answer, to the request of #Francis Toth:
This is based on spark's leftOuterJoin, where the only addition is the part missingRightRecordsAcc += 1.
Function definition:
object JoinerWithAccumulation {
def leftOuterJoinWithAccumulator[K: ClassTag, V, W](left: PairRDDFunctions[K, V],
right: RDD[(K, W)],
missingRightRecordsAcc: Accumulator[Int])
: RDD[(K, (V, Option[W]))] = {
left.cogroup(right).flatMapValues { pair =>
if (pair._2.isEmpty) {
pair._1.iterator.map(v => { missingRightRecordsAcc += 1; (v, None)})
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
}
}
}
}
Usage:
val events = sc.textFile("...").parse...keyBy(_.getItemId)
val items = sc.textFile("...").parse...keyBy(_.getId)
val acc = sc.accumulator(0)
val joined = JoinerWithAccumulation.leftOuterJoinWithAccumulator(eventsKV,adsKV,acc)
println(acc.value) // 0, since there were no actions performed on the rdd 'joined'
println(joined.count) // = events.count ; this triggers an action
println(acc.value) // = number of records in joined without a matching record from 'items'
(The hardest part was to get the function definition right, with the ClassTag etc..)

Cats a List of State Monads "fail fast" on <...>.sequence method?

let's say we have a list of states and we want to sequence them:
import cats.data.State
import cats.instances.list._
import cats.syntax.traverse._
trait MachineState
case object ContinueRunning extends MachineState
case object StopRunning extends MachineState
case class Machine(candy: Int)
val addCandy: Int => State[Machine, MachineState] = amount =>
State[Machine, MachineState] { machine =>
val newCandyAmount = machine.candy + amount
if(newCandyAmount > 10)
(machine, StopRunning)
else
(machine.copy(newCandyAmount), ContinueRunning)
}
List(addCandy(1),
addCandy(2),
addCandy(5),
addCandy(10),
addCandy(20),
addCandy(50)).sequence.run(Machine(0)).value
Result would be
(Machine(10),List(ContinueRunning, ContinueRunning, ContinueRunning, StopRunning, StopRunning, StopRunning))
It's obvious that 3 last steps are redundant. Is there a way to make this sequence stop early? Here when StopRunning gets returned I would like to stop. For example a list of Either's would fail fast and stop sequence early if needed (because it acts like a monad).
For the record - I do know that it is possible to simply write a tail recursion that checks each state that is being runned and if some condition is satisfied - stop the recursion. I just want to know if there is a more elegant way of doing this? The recursion solution seems like a lot of boilerplate to me, am I wrong or not?
Thank you!:))
There are 2 things here needed to be done.
The first is understanding what is actually happening:
State takes some state value, threads in between many composed calls and in the process produces some output value as well
in your case Machine is the state threaded between calls, while MachineState is the output of a single operation
sequence (usually) takes a collection (here List) of some parametric stuff here State[Machine, _] and turns nesting on the left side (here: List[State[Machine, _]] -> State[Machine, List[_]]) (_ is the gap that you'll be filling with your type)
the result is that you'll thread state (Machine(0)) through all the functions, while you combine the output of each of them (MachineState) into list of outputs
// ammonite
// to better see how many times things are being run
# {
val addCandy: Int => State[Machine, MachineState] = amount =>
State[Machine, MachineState] { machine =>
val newCandyAmount = machine.candy + amount
println("new attempt with " + machine + " and " + amount)
if(newCandyAmount > 10)
(machine, StopRunning)
else
(machine.copy(newCandyAmount), ContinueRunning)
}
}
addCandy: Int => State[Machine, MachineState] = ammonite.$sess.cmd24$$$Lambda$2669/1733815710#25c887ca
# List(addCandy(1),
addCandy(2),
addCandy(5),
addCandy(10),
addCandy(20),
addCandy(50)).sequence.run(Machine(0)).value
new attempt with Machine(0) and 1
new attempt with Machine(1) and 2
new attempt with Machine(3) and 5
new attempt with Machine(8) and 10
new attempt with Machine(8) and 20
new attempt with Machine(8) and 50
res25: (Machine, List[MachineState]) = (Machine(8), List(ContinueRunning, ContinueRunning, ContinueRunning, StopRunning, StopRunning, StopRunning))
In other words, what you want is circuit breaking then .sequence might not be what you want.
As a matter of the fact, you probably want something else - combine a list of A => (A, B) functions into one function which stops next computation if the result of a computation is StopRunning (in your code nothing tells the code what is the condition of circuit break and how it should be performed). I would suggest doing it explicitly with some other function, e.g.:
# {
List(addCandy(1),
addCandy(2),
addCandy(5),
addCandy(10),
addCandy(20),
addCandy(50))
.reduce { (a, b) =>
a.flatMap {
// flatMap and map uses MachineState
// - the second parameter is the result after all!
// we are pattern matching on it to decide if we want to
// proceed with computation or stop it
case ContinueRunning => b // runs next computation
case StopRunning => State.pure(StopRunning) // returns current result without modifying it
}
}
.run(Machine(0))
.value
}
new attempt with Machine(0) and 1
new attempt with Machine(1) and 2
new attempt with Machine(3) and 5
new attempt with Machine(8) and 10
res23: (Machine, MachineState) = (Machine(8), StopRunning)
This will eliminate the need for running code within addCandy - but you cannot really get rid of code that combines states together, so this reduce logic will be applied on runtime n-1 times (where n is the size of your list) and that cannot be helped.
BTW If you take a closer look at Either you will find that it also computes n results and only then combines them so that it looks like it's circuit breaking, but in fact isn't. Sequence is combining a result of "parallel" computations but won't interrupt them if any of them failed.