How to avoid to broadcast a large lookup table in Spark - scala

Can you help me to avoid broadcasting of a large lookup table? I have a table with measurements:
Measurement Value
x1 5.1
x2 8.9
x1 9.1
x3 4.4
x2 2.1
...
And a list of pairs:
P1 P2
x1 x2
x2 x3
...
The task is to get all values for both elements of every pair and put them into a magic function. That's how I solved it by broadcasting the large table with the measurements.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
// create data frames
val dfm = sqc.createDataFrame(measurements)
val dfc = sqc.createDataFrame(candidates)
// broadcast lookup table
val lookup = sc.broadcast(dfm.rdd.map(r => (r(0), r(1))).collect())
// udf: run magic test with every candidate
val magic: ((String, String) => Double) = (c1: String, c2: String) => {
val lt = lookup.value
val c1v = lt.filter(_._1 == c1).map(_._2).map(_.asInstanceOf[Double])
val c2v = lt.filter(_._1 == c2).map(_._2).map(_.asInstanceOf[Double])
new Foo().magic(c1v, c2v)
}
val sq1 = udf(magic)
val dfks = dfc.withColumn("magic", sq1(col("c1"), col("c2")))
As you can guess I'm not pretty happy with the solution. For every pair I filter the lookup table twice, this isn't fast nor elegant. I'm using Spark 1.6.1.

An alternative would be to use RDD and join. Not sure what's better in term of performance though.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
val rdm = sc.parallelize(measurements).map(r => (r.measurement, r.value)).groupByKey().cache()
val rdc = sc.parallelize(candidates).map(r => (r.c1, r.c2)).cache()
val firstColJoin = rdc.join(rdm).values
val secondColJoin = firstColJoin.join(rdm).values
secondColJoin.map { case (c1v, c2v) => new Foo().magic(c1v, c2v) }

Thank you for all comments. I read the comments, did some research and studied zero323 posts.
My current solution is using two joins and an UserDefinedAggregateFunction:
object GroupValues extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = ArrayType(DoubleType)
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[Double])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[Double](0) :+ input.getDouble(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[Double](0) ++ buffer2.getSeq[Double](0))
}
def evaluate(buffer: Row) = buffer.getSeq[Double](0)
}
// join data for candidate one
val j1 = dfc.join(dfm, dfc("c1") === dfm("measurement"))
// aggregate all c1 values to an array
val j1v = j1.groupBy(col("c1"), col("c2")).agg(GroupValues(col("value"))
.alias("c1-values"))
// join data for candidate two
val j2 = j1v.join(dfm, j1v("c2") === dfm("measurement"))
// aggregate all c2 values to an array
val j2v = j2.groupBy(col("c1"), col("c2"), col("c1-values"))
.agg(GroupValues(col("value")).alias("c2-values"))
Next step would be to use collect_list instead of UserDefinedAggregateFunction.

Related

Write this function in a FP way

I have this function that I wrote to basically compute the distance to each node from a source node. Nodes are strings and neighbours(node) return the node neighbours.
It works fine, but I'd like something more in the spirit of Scala.
How could I rewrite it? (maybe with recursion?)
def distanceMap(source: String): Map[String, Int] = {
val distances = scala.collection.mutable.Map.empty[String, Int]
var batch = List(source)
var newBatch: List[String] = Nil
val seen = scala.collection.mutable.Set(source)
var distance = 1
while (!batch.isEmpty) {
newBatch = batch.flatMap(neighbours(_)).filterNot(seen(_))
for (neighbour <- newBatch) {
seen.add(neighbour)
distances(neighbour) = distance
}
batch = newBatch
distance += 1
}
distances.toMap
}
Try recursion with accumulators and helper function
def distanceMap(source: String): Map[String, Int] = {
def loop(
distances: Map[String, Int],
batch: List[String],
seen: Set[String],
distance: Int
): Map[String, Int] =
if (batch.isEmpty) distances
else {
val newBatch = batch.flatMap(neighbours).filterNot(seen(_))
loop(
distances ++ newBatch.map(neighbour => neighbour -> distance),
newBatch,
seen ++ newBatch,
distance + 1
)
}
loop(Map(), List(source), Set(source), 1)
}

generate list of case class with int field without repeat

I want to generate a List of some class which contains several fields. One of them is Int type and it doesn’t have to repeat. Could you help me to write the code?
I tried next:
case class Person(name: String, age: Int)
implicit val genPerson: Gen[Person] =
for {
name <- arbitrary[String]
age <- Gen.posNum[Int]
} yield Person(name, age)
implicit val genListOfPerson: Gen[scala.List[Person]] = Gen.listOfN(3, genPerson)
The problem is that I got an instance of a person with equal age.
If you're requiring that no two Persons in the generated list have the same age, you can
implicit def IntsArb: Arbitrary[Int] = Arbitrary(Gen.choose[Int](0, Int.MaxValue))
implicit val StringArb: Arbitrary[String] = Arbitrary(Gen.listOfN(5, Gen.alphaChar).map(_.mkString))
implicit val PersonGen = Arbitrary(Gen.resultOf(Person.apply _))
implicit val PersonsGen: Arbitrary[List[Person]] =
Arbitrary(Gen.listOfN(3, PersonGen.arbitrary).map { persons =>
val grouped: Map[Int, List[Person]] = persons.groupBy(_.age)
grouped.values.map(_.head) // safe because groupBy
})
Note that this will return a List with no duplicate ages but there's no guarantee that the list will have size 3 (it is guaranteed that the list will be nonempty, with size at most 3).
If having a list of size 3 is important, at the risk of generation failing if the "dice are against you", you can have something like:
def uniqueAges(persons: List[Person], target: Int): Gen[List[Person]] = {
val grouped: Map[Int, List[Person]] = persons.groupBy(_.age)
val uniquelyAged = grouped.values.map(_.head)
val n = uniquelyAged.size
if (n == target) Gen.const(uniquelyAged)
else {
val existingAges = grouped.keySet
val genPerson = PersonGen.arbitrary.retryUntil { p => !existingAges(p.age) }
Gen.listOf(target - n, genPerson)
.flatMap(l => uniqueAges(l, target - n))
.map(_ ++ uniquelyAged)
}
}
implicit val PersonsGen: Arbitrary[List[Person]] =
Arbitrary(Gen.listOfN(3, PersonGen.arbitrary).flatMap(l => uniqueAges(l, 3)))
You can do it as follows:
implicit def IntsArb: Arbitrary[Int] = Arbitrary(Gen.choose[Int](0, Int.MaxValue))
implicit val StringArb: Arbitrary[String] = Arbitrary(Gen.listOfN(5, Gen.alphaChar).map(_.mkString))
implicit val PersonGen = Arbitrary(Gen.resultOf(Person.apply _))
implicit val PersonsGen: Arbitrary[List[Person]] = Arbitrary(Gen.listOfN(3, PersonGen.arbitrary))

Calculate a mode for multiple columns

I would like to calculate a mode for multiple columns in the same time in Spark and use this calculated values to impute missings in a DataFrame. I found how to calculate e.g. a mean, but a mode is more complex I think.
Here is a mean calculation:
val multiple_mean = df.na.fill(df.columns.zip(
df.select(intVars.map(mean(_)): _*).first.toSeq
).toMap)
I am able to calculate a mode in brute force way:
var list = ArrayBuffer.empty[Float]
for(column <- df.columns){
list += df.select(column).groupBy(col(column)).count().orderBy(desc("count")).first.toSeq(0).asInstanceOf[Float]
}
val multiple_mode = df.na.fill(df.columns.zip(list.toSeq).toMap)
What way would be the best if we consider a performance?
Thank you for any help.
You could use UserDefinedAggregateFunction. The code below is tested in spark 1.6.2
First create a class which extends UserDefinedAggregateFunction.
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class ModeUDAF extends UserDefinedAggregateFunction{
override def dataType: DataType = StringType
override def inputSchema: StructType = new StructType().add("input", StringType)
override def deterministic: Boolean = true
override def bufferSchema: StructType = new StructType().add("mode", MapType(StringType, LongType))
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map.empty[Any, Long]
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val buff0 = buffer.getMap[Any, Long](0)
val inp = input.get(0)
buffer(0) = buff0.updated(inp, buff0.getOrElse(inp, 0L) + 1L)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val mp1 = buffer1.getMap[Any, Long](0)
val mp2 = buffer2.getMap[Any, Long](0)
buffer1(0) = mp1 ++ mp2.map { case (k, v) => k -> (v + mp1.getOrElse(k, 0L)) }
}
override def evaluate(buffer: Row): Any = {
lazy val st = buffer.getMap[Any, Long](0).toStream
val mode = st.foldLeft(st.head){case (e, s) => if (s._2 > e._2) s else e}
mode._1
}
}
Afterwords you could use it with your dataframe in the following manner.
val modeColumnList = List("some", "column", "names") // or df.columns.toList
val modeAgg = new ModeUDAF()
val aggCols = modeColumnList.map(c => modeAgg(df(c)))
val aggregatedModeDF = df.agg(aggCols.head, aggCols.tail: _*)
aggregatedModeDF.show()
Also you could use .collect on the final dataframe to collect the result in a scala data structure.
Note: The performance of this solution depends on the cardinality of the input column.

Filtering a collection based on an arbitrary number of options

How can the following Scala function be refactored to use idiomatic best practices?
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]): Seq[Int] = {
var output = ids
if (idsMustBeInThisListIfItExists.isDefined) {
output = output.intersect(idsMustBeInThisListIfItExists.get)
}
if (idsMustAlsoBeInThisListIfItExists.isDefined) {
output = output.intersect(idsMustAlsoBeInThisListIfItExists.get)
}
output
}
Expected IO:
val ids = Seq(1,2,3,4,5)
val output1 = getFilteredList(ids, None, Some(Seq(3,5))) // 3, 5
val output2 = getFilteredList(ids, None, None) // 1,2,3,4,5
val output3 = getFilteredList(ids, Some(Seq(1,2)), None) // 1,2
val output4 = getFilteredList(ids, Some(Seq(1)), Some(Seq(5))) // 1,5
Thank you for your time.
Here's a simple way to do this:
implicit class SeqAugmenter[T](val seq: Seq[T]) extends AnyVal {
def intersect(opt: Option[Seq[T]]): Seq[T] = {
opt.fold(seq)(seq intersect _)
}
}
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]
): Seq[Int] = {
ids intersect
idsMustBeInThisListIfItExists intersect
idsMustAlsoBeInThisListIfItExists
}
Yet another way without for comprehensions and implicits:
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]): Seq[Int] = {
val output1 = ids.intersect(idsMustBeInThisListIfItExists.getOrElse(ids))
val output2 = output1.intersect(idsMustAlsoBeInThisListIfItExists.getOrElse(output1))
output2
}
Another similar way, without implicits.
def getFilteredList[A](ids: Seq[A],
idsMustBeInThisListIfItExists: Option[Seq[A]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[A]]): Seq[A] = {
val a = intersect(Some(ids), idsMustBeInThisListIfItExists)(ids)
val b = intersect(Some(a), idsMustAlsoBeInThisListIfItExists)(a)
b
}
def intersect[A](ma: Option[Seq[A]], mb: Option[Seq[A]])(default: Seq[A]) = {
(for {
a <- ma
b <- mb
} yield {
a.intersect(b)
}).getOrElse(default)
}

Akka Streams custom merge

I'm new to akka-streams and not sure how to approach this problem.
I have 3 source streams are sorted by a sequence ID. I want to group the values together which have the same ID. Values in each stream may be missing or duplicated. If one stream is a faster producer than the rest, it should get backpressured.
case class A(id: Int)
case class B(id: Int)
case class C(id: Int)
case class Merged(as: List[A], bs: List[B], cs: List[C])
import akka.stream._
import akka.stream.scaladsl._
val as = Source(List(A(1), A(2), A(3), A(4), A(5)))
val bs = Source(List(B(1), B(2), B(3), B(4), B(5)))
val cs = Source(List(C(1), C(1), C(3), C(4)))
val merged = ???
// value 1: Merged(List(A(1)), List(B(1)), List(C(1), C(1)))
// value 2: Merged(List(A(2)), List(B(2)), Nil)
// value 3: Merged(List(A(3)), List(B(3)), List(C(3)))
// value 4: Merged(List(A(4)), List(B(4)), List(C(4)))
// value 5: Merged(List(A(5)), List(B(5)), Nil)
// (end of stream)
this question is old but I was trying to find a solution for that and I only encountered the rocks to the path at the lightbend forum, but not a working use case. So I decided to implement and post here my example.
I created 3 sources sourceA, sourceB, and sourceC which emit events in a different speed using .throttle(). Then I created a RunnableGraph where I merge the sources using Merge and I the output to my WindowGroupEventFlow Flow that I implemented based on a sliding window of number of events. This is the graph:
sourceA ~> mergeShape.in(0)
sourceB ~> mergeShape.in(1)
sourceC ~> mergeShape.in(2)
mergeShape.out ~> windowFlowShape ~> sinkShape
The classes that I am using on the sources are these:
object Domain {
sealed abstract class Z(val id: Int, val value: String)
case class A(override val id: Int, override val value: String = "A") extends Z(id, value)
case class B(override val id: Int, override val value: String = "B") extends Z(id, value)
case class C(override val id: Int, override val value: String = "C") extends Z(id, value)
case class ABC(override val id: Int, override val value: String) extends Z(id, value)
}
and this is the WindowGroupEventFlow Flow that I created to group the events:
// step 0: define the shape
class WindowGroupEventFlow(maxBatchSize: Int) extends GraphStage[FlowShape[Domain.Z, Domain.Z]] {
// step 1: define the ports and the component-specific members
val in = Inlet[Domain.Z]("WindowGroupEventFlow.in")
val out = Outlet[Domain.Z]("WindowGroupEventFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[Domain.Z]
var count = 0
// var result = ""
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
count += 1
// If window finished we have to dequeue all elements
if (count >= maxBatchSize) {
println("************ window finished - dequeuing elements ************")
var result = Map[Int, Domain.Z]()
val list = batch.dequeueAll(_ => true).to[collection.immutable.Iterable]
list.foreach { tuple =>
if (result.contains(tuple.id)) {
val abc = result.get(tuple.id)
val value = abc.get.value + tuple.value
val z: Domain.Z = Domain.ABC(tuple.id, value)
result += (tuple.id -> z)
} else {
val z: Domain.Z = Domain.ABC(tuple.id, tuple.value)
result += (tuple.id -> z)
}
}
val finalResult: collection.immutable.Iterable[Domain.Z] = result.map(p => p._2)
emitMultiple(out, finalResult)
count = 0
} else {
pull(in) // send demand upstream signal, asking for another element
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
}
// step 2: construct a new shape
override def shape: FlowShape[Domain.Z, Domain.Z] = FlowShape[Domain.Z, Domain.Z](in, out)
}
and this is how I run everything:
object WindowGroupEventFlow {
def main(args: Array[String]): Unit = {
run()
}
def run() = {
implicit val system = ActorSystem("WindowGroupEventFlow")
import Domain._
val sourceA = Source(List(A(1), A(2), A(3), A(1), A(2), A(3), A(1), A(2), A(3), A(1))).throttle(3, 1 second)
val sourceB = Source(List(B(1), B(2), B(1), B(2), B(1), B(2), B(1), B(2), B(1), B(2))).throttle(2, 1 second)
val sourceC = Source(List(C(1), C(2), C(3), C(4))).throttle(1, 1 second)
// Step 1 - setting up the fundamental for a stream graph
val windowRunnableGraph = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - create shapes
val mergeShape = builder.add(Merge[Domain.Z](3))
val windowEventFlow = Flow.fromGraph(new WindowGroupEventFlow(5))
val windowFlowShape = builder.add(windowEventFlow)
val sinkShape = builder.add(Sink.foreach[Domain.Z](x => println(s"sink: $x")))
// Step 3 - tying up the components
sourceA ~> mergeShape.in(0)
sourceB ~> mergeShape.in(1)
sourceC ~> mergeShape.in(2)
mergeShape.out ~> windowFlowShape ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = windowRunnableGraph.run()
}
}
you can see on the output how I am grouping the elements with same ID:
sink: ABC(1,ABC)
sink: ABC(2,AB)
************ window finished - dequeuing elements ************
sink: ABC(3,A)
sink: ABC(1,BA)
sink: ABC(2,CA)
************ window finished - dequeuing elements ************
sink: ABC(2,B)
sink: ABC(3,AC)
sink: ABC(1,BA)
************ window finished - dequeuing elements ************
sink: ABC(2,AB)
sink: ABC(3,A)
sink: ABC(1,BA)