Class State get loss between function calls in Flink - scala

I have this class:
case class IDADiscretizer(
nAttrs: Int,
nBins: Int = 5,
s: Int = 5) extends Serializable {
private[this] val log = LoggerFactory.getLogger(this.getClass)
private[this] val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
private[this] val randomReservoir = SamplingUtils.reservoirSample((1 to s).toList.iterator, 1)
def updateSamples(v: LabeledVector): Vector[IntervalHeapWrapper] = {
val attrs = v.vector.map(_._2)
val label = v.label
// TODO: Check for missing values
attrs
.zipWithIndex
.foreach {
case (attr, i) =>
if (V(i).getNbSamples < s) {
V(i) insertValue attr // insert
} else {
if (randomReservoir(0) <= s / (i + 1)) {
//val randVal = Random nextInt s
//V(i) replace (randVal, attr)
V(i) insertValue attr
}
}
}
V
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints: Vector[Vector[Double]] = V map (_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): (DataSet[Vector[IntervalHeapWrapper]], Vector[Vector[Double]]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
}
Using flink, I would like to get the cutpoints after the call of discretize, but it seems the information stored in V get loss. Do I have to use Broadcast like in this question? is there a better way to access the state of class?
I've tried to call cutpoints in two ways, one with is:
def discretize(data: DataSet[LabeledVector]) = data map (x => updateSamples(x))
Then, called from outside:
val a = IDADiscretizer(nAttrs = 4)
val r = a.discretize(dataSet)
r.print
val cuts = a.cutPoints
Here, cuts is empty so I tried to compute the discretization as well as the cutpoints inside discretize:
def discretize(data: DataSet[LabeledVector]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
And use it like this:
val a = IDADiscretizer(nAttrs = 4)
val (d, c) = a.discretize(dataSet)
c foreach println
But the same happends.
Finally, I've also tried to make V completely public:
val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
Still empty
What am I doing wrong?
Related questions:
Keep keyed state across multiple transformations
Flink State backend keys atomicy and distribution
Flink: does state access across stream?
Flink: Sharing state in CoFlatMapFunction
Answer
Thanks to #TillRohrmann what I finally did was:
private[this] def computeCutPoints(x: LabeledVector) = {
val attrs = x.vector.map(_._2)
val label = x.label
attrs
.zipWithIndex
.foldLeft(V) {
case (iv, (v, i)) =>
iv(i) insertValue v
iv
}
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints(data: DataSet[LabeledVector]): Seq[Seq[Double]] =
data.map(computeCutPoints _)
.collect
.last.map(_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): DataSet[LabeledVector] =
data.map(updateSamples _)
And then use it like this:
val a = IDADiscretizer(nAttrs = 4)
val d = a.discretize(dataSet)
val cuts = a.cutPoints(dataSet)
d.print
cuts foreach println
I do not know if it is the best way, but at least is working now.

The way Flink works is that the user defines operators/user defined functions which operate on input data coming from a source function. In order to execute a program the user code is sent to the Flink cluster where it is executed. The results of the computation has to be output to some storage system via a sink function.
Due to this, it is not possible to mix local and distributed computations easily as you are trying with your solution. What discretize does is to define a map operator which transforms the input DataSet data. This operation will be executed once you call ExecutionEnvironment#execute or DataSet#print, for example. Now the user code and the definition for IDADiscretizer is sent to the cluster where they are instantiated. Flink will update the values in an instance of IDADiscretizer which is not the same instance as the one you have on the client.

Related

Scala count number of times function returns each value, functionally

I want to count up the number of times that a function f returns each value in it's range (0 to f_max, inclusive) when applied to a given list l, and return the result as an array, in Scala.
Currently, I accomplish as follows:
def count (l: List): Array[Int] = {
val arr = new Array[Int](f_max + 1)
l.foreach {
el => arr(f(el)) += 1
}
return arr
}
So arr(n) is the number of times that f returns n when applied to each element of l. This works however, it is imperative style, and I am wondering if there is a clean way to do this purely functionally.
Thank you
how about a more general approach:
def count[InType, ResultType](l: Seq[InType], f: InType => ResultType): Map[ResultType, Int] = {
l.view // create a view so we don't create new collections after each step
.map(f) // apply your function to every item in the original sequence
.groupBy(x => x) // group the returned values
.map(x => x._1 -> x._2.size) // count returned values
}
val f = (i:Int) => i
count(Seq(1,2,3,4,5,6,6,6,4,2), f)
l.foldLeft(Vector.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc.updated(result, acc(result) + 1)
}
Alternatively, a good balance of performance and external purity would be:
def count(l: List[???]): Vector[Int] = {
val arr = l.foldLeft(Array.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc(result) += 1
}
arr.toVector
}

Scala: Safe access to index in a List[DataFrame]

I receive a List[DataFrame] and I want to store each df in a variable. Some values always exist in the list:
val routes = dataframes(0)
val stops = dataframes(1)
But other ones may also come so the size list is variable.
How could I perform a safely access to a index of list that may be out of bounds? I thought that with Some() and handling the result it would works:
val fare_attributes : Option[DataFrame] = Some(dataframes(10))
fare_attributes match {
case Some(fare) => upload())
println("fare_attributes uploaded")
case None => println("No fare_attributes found")
}
But I receive: java.lang.IndexOutOfBoundsException: 2
You can use .lift on your list:
val fare_attributes : Option[DataFrame] = dataframes.lift(10)
I think you will have to rely on checking the length of the list before accessing the indexed value. You may want to implement some wrapper function to do so. So that it isn't done repeatedly.
You can be a bit "elegant" about it with currying with two parameter lists. So that your code is a bit concise. Here's a sample which you may improve.
def safeList(list: List[Int])(index: Int): Int = {
if (index < list.length) list(index)
else 0
}
val x = List(1, 2 ,3 )
val y = safeList(x)(_)
val a = y(0) // returns 1
val b = y(1) // returns 2
val c = y(4) // returns 0

How to aggregateByKey with custom class for frequency distribution?

I am trying to create a frequency distribution.
My data is in the following pattern (ColumnIndex, (Value, countOfValue)) of type (Int, (Any, Long)). For instance, (1, (A, 10)) means for column index 1, there are 10 A's.
My goal is to get the top 100 values for all my index's or Keys.
Right away I can make it less compute intensive for my workload by doing an initial filter:
val freqNumDist = numRDD.filter(x => x._2._2 > 1)
Now I found an interesting example of a class, here which seems to fit my use case:
class TopNList (val maxSize:Int) extends Serializable {
val topNCountsForColumnArray = new mutable.ArrayBuffer[(Any, Long)]
var lowestColumnCountIndex:Int = -1
var lowestValue = Long.MaxValue
def add(newValue:Any, newCount:Long): Unit = {
if (topNCountsForColumnArray.length < maxSize -1) {
topNCountsForColumnArray += ((newValue, newCount))
} else if (topNCountsForColumnArray.length == maxSize) {
updateLowestValue
} else {
if (newCount > lowestValue) {
topNCountsForColumnArray.insert(lowestColumnCountIndex, (newValue, newCount))
updateLowestValue
}
}
}
def updateLowestValue: Unit = {
var index = 0
topNCountsForColumnArray.foreach{ r =>
if (r._2 < lowestValue) {
lowestValue = r._2
lowestColumnCountIndex = index
}
index+=1
}
}
}
So Now What I was thinking was putting together an aggregateByKey to use this class in order to get my top 100 values! The problem is that I am unsure of how to use this class in aggregateByKey in order to accomplish this goal.
val initFreq:TopNList = new TopNList(100)
def freqSeq(u: (TopNList), v:(Double, Long)) = (
u.add(v._1, v._2)
)
def freqComb(u1: TopNList, u2: TopNList) = (
u2.topNCountsForColumnArray.foreach(r => u1.add(r._1, r._2))
)
val freqNumDist = numRDD.filter(x => x._2._2 > 1).aggregateByKey(initFreq)(freqSeq, freqComb)
The obvious problem is that nothing is returned by the functions I am using. So I am wondering how to modify this class or do I need to think about this in a whole new light and just cherry pick some of the functions out of this class and add them to the functions I am using for the aggregateByKey?
I'm either thinking about classes wrong or the entire aggregateByKey or both!
Your projections implementations (freqSeq, freqComb) return Unit while you expect them to return TopNList
If intentially keep the style of your solution, the relevant impl should be
def freqSeq(u: TopNList, v:(Any, Long)) : TopNList = {
u.add(v._1, v._2) // operation gives void result (Unit)
u // this one of TopNList type
}
def freqComb(u1: TopNList, u2: TopNList) : TopNList = {
u2.topNCountsForColumnArray.foreach (r => u1.add (r._1, r._2) )
u1
}
Just take a look on aggregateByKey signature of PairRDDFunctions, what does it expect for
def aggregateByKey[U](zeroValue : U)(seqOp : scala.Function2[U, V, U], combOp : scala.Function2[U, U, U])(implicit evidence$3 : scala.reflect.ClassTag[U]) : org.apache.spark.rdd.RDD[scala.Tuple2[K, U]] = { /* compiled code */ }

Union-Find (or Disjoint Set) data structure in Scala

I am looking for an existing implementation of a union-find or disjoint set data structure in Scala before I attempt to roll my own as the optimisations look somewhat complicated.
I mean this kind of thing - where the two operations union and find are optimised.
Does anybody know of anything existing? I've obviously tried googling around.
I had written one for myself some time back which I believe performs decently. Unlike other implementations, the find is O(1) and union is O(log(n)). If you have a lot more union operations than find, then this might not be very useful. I hope you find it useful:
package week2
import scala.collection.immutable.HashSet
import scala.collection.immutable.HashMap
/**
* Union Find implementaion.
* Find is O(1)
* Union is O(log(n))
* Implementation is using a HashTable. Each wrap has a set which maintains the elements in that wrap.
* When 2 wraps are union, then both the set's are clubbed. O(log(n)) operation
* A HashMap is also maintained to find the Wrap associated with each node. O(log(n)) operation in mainitaining it.
*
* If the input array is null at any index, it is ignored
*/
class UnionFind[T](all: Array[T]) {
private var dataStruc = new HashMap[T, Wrap]
for (a <- all if (a != null))
dataStruc = dataStruc + (a -> new Wrap(a))
var timeU = 0L
var timeF = 0L
/**
* The number of Unions
*/
private var size = dataStruc.size
/**
* Unions the set containing a and b
*/
def union(a: T, b: T): Wrap = {
val st = System.currentTimeMillis()
val first: Wrap = dataStruc.get(a).get
val second: Wrap = dataStruc.get(b).get
if (first.contains(b) || second.contains(a))
first
else {
// below is to merge smaller with bigger rather than other way around
val firstIsBig = (first.set.size > second.set.size)
val ans = if (firstIsBig) {
first.set = first.set ++ second.set
second.set.foreach(a => {
dataStruc = dataStruc - a
dataStruc = dataStruc + (a -> first)
})
first
} else {
second.set = second.set ++ first.set
first.set.foreach(a => {
dataStruc = dataStruc - a
dataStruc = dataStruc + (a -> second)
})
second
}
timeU = timeU + (System.currentTimeMillis() - st)
size = size - 1
ans
}
}
/**
* true if they are in same set. false if not
*/
def find(a: T, b: T): Boolean = {
val st = System.currentTimeMillis()
val ans = dataStruc.get(a).get.contains(b)
timeF = timeF + (System.currentTimeMillis() - st)
ans
}
def sizeUnion: Int = size
class Wrap(e: T) {
var set = new HashSet[T]
set = set + e
def add(elem: T) {
set = set + elem
}
def contains(elem: T): Boolean = set.contains(elem)
}
}
Here is a simple, short and somewhat efficient mutable implementation of UnionFind:
import scala.collection.mutable
class UnionFind[T]:
private val map = new mutable.HashMap[T, mutable.HashSet[T]]
private var size = 0
def distinct = size
def addFresh(a: T): Unit =
assert(!map.contains(a))
val set = new mutable.HashSet[T]
set += a
map(a) = set
size += 1
def setEqual(a: T, b: T): Unit =
val ma = map(a)
val mb = map(b)
if !ma.contains(b) then
// redirect the elements of the smaller set to the bigger set
if ma.size > mb.size
then
ma ++= mb
mb.foreach { x => map(x) = ma }
else
mb ++= ma
ma.foreach { x => map(x) = mb }
size = size - 1
def isEqual(a: T, b: T): Boolean =
map(a).contains(b)
Remarks:
An immutable implementation of UnionFind can be useful when rollback or backtracking or proofs are necessary
An mutable implementation can avoid garbage collection for speedup
One could also consider a persistent datastructure -- works like an immutable implementation, but is using internally some mutable state for speed

workaround for prepending to a LinkedHashMap in Scala?

I have a LinkedHashMap which I've been using in a typical way: adding new key-value
pairs to the end, and accessing them in order of insertion. However, now I have a
special case where I need to add pairs to the "head" of the map. I think there's
some functionality inside the LinkedHashMap source for doing this, but it has private
accessibility.
I have a solution where I create a new map, add the pair, then add all the old mappings.
In Java syntax:
newMap.put(newKey, newValue)
newMap.putAll(this.map)
this.map = newMap
It works. But the problem here is that I then need to make my main data structure
(this.map) a var rather than a val.
Can anyone think of a nicer solution? Note that I definitely need the fast lookup
functionality provided by a Map collection. The performance of a prepending is not
such a big deal.
More generally, as a Scala developer how hard would you fight to avoid a var
in a case like this, assuming there's no foreseeable need for concurrency?
Would you create your own version of LinkedHashMap? Looks like a hassle frankly.
This will work but is not especially nice either:
import scala.collection.mutable.LinkedHashMap
def prepend[K,V](map: LinkedHashMap[K,V], kv: (K, V)) = {
val copy = map.toMap
map.clear
map += kv
map ++= copy
}
val map = LinkedHashMap('b -> 2)
prepend(map, 'a -> 1)
map == LinkedHashMap('a -> 1, 'b -> 2)
Have you taken a look at the code of LinkedHashMap? The class has a field firstEntry, and just by taking a quick peek at updateLinkedEntries, it should be relatively easy to create a subclass of LinkedHashMap which only adds a new method prepend and updateLinkedEntriesPrepend resulting in the behavior you need, e.g. (not tested):
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = e
}
}
Here is a sample implementation I threw together real quick (that is, not thoroughly tested!):
class MyLinkedHashMap[A, B] extends LinkedHashMap[A,B] {
def prepend(key: A, value: B): Option[B] = {
val e = findEntry(key)
if (e == null) {
val e = new Entry(key, value)
addEntry(e)
updateLinkedEntriesPrepend(e)
None
} else {
// The key already exists, so we might as well call LinkedHashMap#put
put(key, value)
}
}
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = firstEntry
}
}
}
Tested like this:
object Main {
def main(args:Array[String]) {
val x = new MyLinkedHashMap[String, Int]();
x.prepend("foo", 5)
x.prepend("bar", 6)
x.prepend("olol", 12)
x.foreach(x => println("x:" + x._1 + " y: " + x._2 ));
}
}
Which, on Scala 2.9.0 (yeah, need to update) results in
x:olol y: 12
x:bar y: 6
x:foo y: 5
A quick benchmark shows order of magnitude in performance difference between the extended built-in class and the "map rewrite" approach (I used the code from Debilski's answer in "ExternalMethod" and mine in "BuiltIn"):
benchmark length us linear runtime
ExternalMethod 10 1218.44 =
ExternalMethod 100 1250.28 =
ExternalMethod 1000 19453.59 =
ExternalMethod 10000 349297.25 ==============================
BuiltIn 10 3.10 =
BuiltIn 100 2.48 =
BuiltIn 1000 2.38 =
BuiltIn 10000 3.28 =
The benchmark code:
def timeExternalMethod(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) prepend(map, (i, i))
r -= 1
}
}
def timeBuiltIn(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) map.prepend(i, i)
r -= 1
}
}
Using a scala benchmarking template.