Parallel FP Growth in Spark

Parallel FP Growth in Spark - scala

I am trying to understand the "add" and "extract" methods of the FPTree class:
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala).
What is the purpose of 'summaries' variable?
where is the Group list?
I assume it is the following, am I correct:
val numParts = if (numPartitions > 0) numPartitions else data.partitions.length
val partitioner = new HashPartitioner(numParts)
What will 'summaries contain for 3 transactions of {a,b,c} , {a,b} , {b,c} where all are frequent?
def add(t: Iterable[T], count: Long = 1L): FPTree[T] = {
require(count > 0)
var curr = root
curr.count += count
t.foreach { item =>
val summary = summaries.getOrElseUpdate(item, new Summary)
summary.count += count
val child = curr.children.getOrElseUpdate(item, {
val newNode = new Node(curr)
newNode.item = item
summary.nodes += newNode
newNode
})
child.count += count
curr = child
}
this
}
def extract(
minCount: Long,
validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = {
summaries.iterator.flatMap { case (item, summary) =>
if (validateSuffix(item) && summary.count >= minCount) {
Iterator.single((item :: Nil, summary.count)) ++
project(item).extract(minCount).map { case (t, c) =>
(item :: t, c)
}
} else {
Iterator.empty
}
}
}

After a bit experiments, it is pretty straight forward:
1+2) The partition is indeed the Group representative.
It is also how the conditional transactions calculated:
private def genCondTransactions[Item: ClassTag](
transaction: Array[Item],
itemToRank: Map[Item, Int],
partitioner: Partitioner): mutable.Map[Int, Array[Int]] = {
val output = mutable.Map.empty[Int, Array[Int]]
// Filter the basket by frequent items pattern and sort their ranks.
val filtered = transaction.flatMap(itemToRank.get)
ju.Arrays.sort(filtered)
val n = filtered.length
var i = n - 1
while (i >= 0) {
val item = filtered(i)
val part = partitioner.getPartition(item)
if (!output.contains(part)) {
output(part) = filtered.slice(0, i + 1)
}
i -= 1
}
output
}
The summaries is just a helper to save the count of items in transaction
The extract/project will generate the FIS by using up/down recursion and dependent FP-Trees (project), while checking summaries if traversal that path is needed.
summaries of node 'a' will have {b:2,c:1} and children of node 'a' are 'b' and 'c'.

Related

Mapping over a collection that might return multiple values or a single value

I'm currently mapping over a collection for validation, and I need to return back a single or multiple validation errors:
val errors: Seq[Option[ProductErrors]] = products.map {
if(....) Some(ProductError(...))
else if(...) Some(ProductError(..))
else None
}
errors.flatten
So currently I am returning an Option[ProductError] per map iteration, but in some cases I need to return multiple ProductError's, how can I acheive this?
e.g.
if(...) {
val p1 = Some(ProductError(...))
val p2 = Some(ProductError(....))
}

case class ProductErrors(msg: String = "anything")
val products = (1 to 10).toList
def convert(p: Int): Seq[ProductErrors] = {
if (p < 5) Seq(ProductErrors("less than 5"))
else if (p < 8 && p % 2 == 1) Seq(ProductErrors("element is odd"), ProductErrors("less than 8"))
else Seq()
}
val errors = products.map(convert)
// errors.flatten.size
// val res8: Int = 8
// you can just use flatMap here
products.flatMap(convert).size // 8

Need to merge the Map and List with the same dates

I have a Map where key = LocalDateTime and value = Group
def someGroup(/.../): List[Group] = {
someCode.map {
/.../
}.map(group => (group.completedDt, group)).toMap
/.../
}
And there is also List [Group], where Group (completedDt: LocalDateTime, cost: Int), in which always cost = 0
An example of what I have:
map: [(2021-04-01T00:00:00.000, 500), (2021-04-03T00:00:00.000, 1000), (2021-04-05T00:00:00.000, 750)]
list: ((2021-04-01T00:00:00.000, 0),(2021-04-02T00:00:00.000, 0),(2021-04-03T00:00:00.000, 0),(2021-04-04T00:00:00.000, 0),(2021-04-05T00:00:00.000, 0))
The expected result is:
list ((2021-04-01T00:00:00.000, 500),(2021-04-02T00:00:00.000, 0),(2021-04-03T00:00:00.000, 1000),(2021-04-04T00:00:00.000, 0),(2021-04-05T00:00:00.000, 750))
Thanks in advance!

Assuming that if there's a time appearing in both that you want to combine the costs:
type Group = (LocalDateTime, Int) // completedDt, cost
val groupMap: Map[LocalDateTime, Group] = ???
val groupList: List[Group] = ???
val combined =
groupList.foldLeft(groupMap) { (acc, group) =>
val completedDt = group._1
if (acc.contains(completedDt)) {
val nv = completedDt -> (acc(completedDt)._2 + group._2)
acc.updated(completedDt, nv)
} else acc + (completedDt -> group)
}.values.toList.sortBy(_._1) // You might need to define an Ordering[LocalDateTime]
The notation in your question leads me to think Group is just a pair, not a case class. It's also worth noting that I'm not sure what having the map be Map[LocalDateTime, Group] vs. Map[LocalDateTime, Int] (and thus by definition a collection of Group) buys you.
EDIT: if you have a general collection of collections of Group, you can
val groupLists: List[List[Group]] = ???
groupList.foldLeft(Map.empty[LocalDateTime, Group]) { (acc, lst) =>
lst.foldLeft(acc) { (m, group) =>
val completedDt = group._1
if (m.contains(completedDt)) {
val nv = completedDt -> (acc(completedDt)._2 + group._2)
m.updated(completedDt, nv)
} else m + (completedDt -> group)
}
}.values.toList.sortBy(_._2)

How to dynamically provide N codecs to process fields as a VectorCodec for a record of binary fields that do not contain size bytes

Considering this function in Decoder:
final def decodeCollect[F[_], A](dec: Decoder[A], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
What I really need is dec: Vector[Decoder[A]], like this:
final def decodeCollect[F[_], A](dec: Vector[Decoder[A]], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
to process a binary format that has fields that are not self describing. Early in the file are description records, and from these come field sizes that have to be applied later in data records. So I want to build up a list of decoders and apply it N times, where N is the number of decoders.
I could write a new function modeled on decodeCollect, but it takes an implicit Factory, so I probably would have to compile the scodec library and add it.
Is there a simpler approach using what exists in the scodec library? Either a way to deal with the factory or a different approach?

I finally hacked a solution in the codec codebase. Now that that door is open, I'll add whatever I need until I succeed.
final def decodeNCollect[F[_], A](dec: Vector[Decoder[A]])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
val bldr = cbf.newBuilder
var remaining = buffer
var count = 0
val maxCount = dec.length
var error: Option[Err] = None
while (count < maxCount && remaining.nonEmpty) {
dec(count).decode(remaining) match {
case Attempt.Successful(DecodeResult(value, rest)) =>
bldr += value
count += 1
remaining = rest
case Attempt.Failure(err) =>
error = Some(err.pushContext(count.toString))
remaining = BitVector.empty
}
}
Attempt.fromErrOption(error, DecodeResult(bldr.result, remaining))
}
final def encodeNSeq[A](encs: Vector[Encoder[A]])(seq: collection.immutable.Seq[A]): Attempt[BitVector] = {
if (encs.length != seq.length)
return Attempt.failure(Err("encodeNSeq: length of coders and items does not match"))
val buf = new collection.mutable.ArrayBuffer[BitVector](seq.size)
((seq zip (0 until encs.length)): Seq[(A, Int)]) foreach { case (a, i) =>
encs(i).encode(a) match {
case Attempt.Successful(aa) => buf += aa
case Attempt.Failure(err) => return Attempt.failure(err.pushContext(buf.size.toString))
}
}
def merge(offset: Int, size: Int): BitVector = size match {
case 0 => BitVector.empty
case 1 => buf(offset)
case n =>
val half = size / 2
merge(offset, half) ++ merge(offset + half, half + (if (size % 2 == 0) 0 else 1))
}
Attempt.successful(merge(0, buf.size))
}
private[codecs] final class VectorNCodec[A](codecs: Vector[Codec[A]]) extends Codec[Vector[A]] {
def sizeBound = SizeBound(0, Some(codecs.length.toLong))
def encode(vector: Vector[A]) = Encoder.encodeNSeq(codecs)(vector)
def decode(buffer: BitVector) =
Decoder.decodeNCollect[Vector, A](codecs)(buffer)
override def toString = s"vector($codecs)"
}
def vectorOf[A](valueCodecs: Vector[Codec[A]]): Codec[Vector[A]] =
provide(valueCodecs.length).
flatZip { count => new VectorNCodec(valueCodecs) }.
narrow[Vector[A]]({ case (cnt, xs) =>
if (xs.size == cnt) Attempt.successful(xs)
else Attempt.failure(Err(s"Insufficient number of elements: decoded ${xs.size} but should have decoded $cnt"))
}, xs => (xs.size, xs)).
withToString(s"vectorOf($valueCodecs)")

Scala - Recursive method is return different values

I have implemented a calculation to obtain the node score of each nodes.
The formula to obtain the value is:
The children list can not be empty or a flag must be true;
The iterative way works pretty well:
class TreeManager {
def scoreNo(nodes:List[Node]): List[(String, Double)] = {
nodes.headOption.map(node => {
val ranking = node.key.toString -> scoreNode(Some(node)) :: scoreNo(nodes.tail)
ranking ::: scoreNo(node.children)
}).getOrElse(Nil)
}
def scoreNode(node:Option[Node], score:Double = 0, depth:Int = 0):Double = {
node.map(n => {
var nodeScore = score
for(child <- n.children){
if(!child.children.isEmpty || child.hasInvitedSomeone == Some(true)){
nodeScore = scoreNode(Some(child), (nodeScore + scala.math.pow(0.5, depth)), depth+1)
}
}
nodeScore
}).getOrElse(score)
}
}
But after i've refactored this piece of code to use recursion, the results are totally wrong:
class TreeManager {
def scoreRecursive(nodes:List[Node]): List[(Int, Double)] = {
def scoreRec(nodes:List[Node], score:Double = 0, depth:Int = 0): Double = nodes match {
case Nil => score
case n =>
if(!n.head.children.isEmpty || n.head.hasInvitedSomeone == Some(true)){
score + scoreRec(n.tail, score + scala.math.pow(0.5, depth), depth + 1)
} else {
score
}
}
nodes.headOption.map(node => {
val ranking = node.key -> scoreRec(node.children) :: scoreRecursive(nodes.tail)
ranking ::: scoreRecursive(node.children)
}).getOrElse(Nil).sortWith(_._2 > _._2)
}
}
The Node is an object of a tree and it's represented by the following class:
case class Node(key:Int,
children:List[Node] = Nil,
hasInvitedSomeone:Option[Boolean] = Some(false))
And here is the part that i'm running to check results:
object Main {
def main(bla:Array[String]) = {
val xx = new TreeManager
val values = List(
Node(10, List(Node(11, List(Node(13))),
Node(12,
List(
Node(14, List(
Node(15, List(Node(18))), Node(17, hasInvitedSomeone = Some(true)),
Node(16, List(Node(19, List(Node(20)))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true)))
val resIterative = xx.scoreNo(values)
//val resRecursive = xx.scoreRec(values)
println("a")
}
}
The iterative way is working because i've checked it but i didn't get why recursive return wrong values.
Any idea?
Thank in advance.

The recursive version never recurses on children of the nodes, just on the tail. Whereas the iterative version correctly both recurse on the children and iterate on the tail.
You'll notice your "iterative" version is also recursive btw.

Maximum Length for scala queue

I'm curious if Scala has some gem hidden in its collection classes that I can use. Basically I'm looking for something like a FIFO queue, but that has an upper-limit on its size such that when the limit is hit, the oldest (first) element is removed from the queue. I've implemented this myself in Java in the past, but I'd rather use something standard if possible.

An often preferable alternative to subclassing is the (unfortunately named) "pimp my library" pattern. You can use it to add an enqueueFinite method to Queue, like so:
import scala.collection.immutable.Queue
class FiniteQueue[A](q: Queue[A]) {
def enqueueFinite[B >: A](elem: B, maxSize: Int): Queue[B] = {
var ret = q.enqueue(elem)
while (ret.size > maxSize) { ret = ret.dequeue._2 }
ret
}
}
implicit def queue2finitequeue[A](q: Queue[A]) = new FiniteQueue[A](q)
Whenever queue2finitequeue is in scope, you can treat Queue objects as though they have the enqueueFinite method:
val maxSize = 3
val q1 = Queue(1, 2, 3)
val q2 = q1.enqueueFinite(5, maxSize)
val q3 = q2.map(_+1)
val q4 = q3.enqueueFinite(7, maxSize)
The advantage of this approach over subclassing is that enqueueFinite is available to all Queues, including those that are constructed via operations like enqueue, map, ++, etc.
Update: As Dylan says in the comments, enqueueFinite needs also to take a parameter for the maximum queue size, and drop elements as necessary. I updated the code.

Why don't you just subclass a FIFO queue? Something like this should work: (pseudocode follows...)
class Limited(limit:Int) extends FIFO {
override def enqueue() = {
if (size >= limit) {
//remove oldest element
}
super.enqueue()
}
}

Here is an immutable solution:
class FixedSizeFifo[T](val limit: Int)
( private val out: List[T], private val in: List[T] )
extends Traversable[T] {
override def size = in.size + out.size
def :+( t: T ) = {
val (nextOut,nextIn) = if (size == limit) {
if( out.nonEmpty) {
( out.tail, t::in )
} else {
( in.reverse.tail, List(t) )
}
} else ( out, t::in )
new FixedSizeFifo( limit )( nextOut, nextIn )
}
private lazy val deq = {
if( out.isEmpty ) {
val revIn = in.reverse
( revIn.head, new FixedSizeFifo( limit )( revIn.tail, List() ) )
} else {
( out.head, new FixedSizeFifo( limit )( out.tail, in ) )
}
}
override lazy val head = deq._1
override lazy val tail = deq._2
def foreach[U]( f: T => U ) = ( out ::: in.reverse ) foreach f
}
object FixedSizeFifo {
def apply[T]( limit: Int ) = new FixedSizeFifo[T]( limit )(List(),List())
}
An example:
val fifo = FixedSizeFifo[Int](3) :+ 1 :+ 2 :+ 3 :+ 4 :+ 5 :+ 6
println( fifo ) //prints: FixedSizeFifo(4, 5, 6)
println( fifo.head ) //prints: 4
println( fifo.tail :+ 7 :+8 ) //prints: FixedSizeFifo(6, 7, 8)

This is the approach I toke with extending Scala's standard mutable.Queue class.
class LimitedQueue[A](maxSize: Int) extends mutable.Queue[A] {
override def +=(elem: A): this.type = {
if (length >= maxSize) dequeue()
appendElem(elem);
this
}
}
And simple use-case
var q2 = new LimitedQueue[Int](2)
q2 += 1
q2 += 2
q2 += 3
q2 += 4
q2 += 5
q2.foreach { n =>
println(n)
}
You'll get only 4 and 5 in the console as the old elements were dequeued beforehand.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parallel FP Growth in Spark - scala

Related

Mapping over a collection that might return multiple values or a single value

Need to merge the Map and List with the same dates

How to dynamically provide N codecs to process fields as a VectorCodec for a record of binary fields that do not contain size bytes

Scala - Recursive method is return different values

Maximum Length for scala queue

Categories

Resources