Maximum enclosing circle algorithm with Spark - scala

I'm new to spark and scala, I'm trying to implement the maximum enclosing circle algorithm. For input I have csv file with id,x,y:
id x y
1 1 0
2 1 2
1 0 0
3 5 10
...
Needs to find a max enclosing circle for each id.
I implemented the solution but it is not optimal.
val data = csv
.filter(_ != header)
.map(_.split(","))
.map(col => (col(0), Location(col(2).toInt, col(3).toInt)))
val idLoc = data.groupByKey()
val ids = idLoc.keys.collect().toList.par
ids.foreach {
case id =>
val locations = data.filter(_._1 == id).values.cache()
val maxEnclCircle = findMaxEnclCircle(findEnclosedPoints(locations, eps))
}
def findMaxEnclCircle(centroids: RDD[(Location, Long)]): Location = {
centroids.max()(new Ordering[(Location, Long)]() {
override def compare(x: (Location, Long), y: (Location, Long)): Int =
Ordering[Long].compare(x._2, y._2)
})._1
}
def findEnclosedPoints(locations: RDD[Location], eps: Double): RDD[(Location, Long)] = {
locations.cartesian(locations).filter {
case (a, b) => isEnclosed(a, b, eps)
}.map { case (a, b) => (a, 1L) }.reduceByKey(_ + _)
}
As you can see I keep a list of id in the memory. How can I improve the code to get rid of it?
Thanks.

The main problem with your code above is that you are not using the cluster! When you do collect() all your data is sent to the single master and all computations are done there. For efficiency, use aggregateByKey() to shuffle all the point with the same id to the same executor, then do the computation there.

Related

How to dynamically provide N codecs to process fields as a VectorCodec for a record of binary fields that do not contain size bytes

Considering this function in Decoder:
final def decodeCollect[F[_], A](dec: Decoder[A], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
What I really need is dec: Vector[Decoder[A]], like this:
final def decodeCollect[F[_], A](dec: Vector[Decoder[A]], limit: Option[Int])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
to process a binary format that has fields that are not self describing. Early in the file are description records, and from these come field sizes that have to be applied later in data records. So I want to build up a list of decoders and apply it N times, where N is the number of decoders.
I could write a new function modeled on decodeCollect, but it takes an implicit Factory, so I probably would have to compile the scodec library and add it.
Is there a simpler approach using what exists in the scodec library? Either a way to deal with the factory or a different approach?
I finally hacked a solution in the codec codebase. Now that that door is open, I'll add whatever I need until I succeed.
final def decodeNCollect[F[_], A](dec: Vector[Decoder[A]])(buffer: BitVector)(implicit cbf: Factory[A, F[A]]): Attempt[DecodeResult[F[A]]] = {
val bldr = cbf.newBuilder
var remaining = buffer
var count = 0
val maxCount = dec.length
var error: Option[Err] = None
while (count < maxCount && remaining.nonEmpty) {
dec(count).decode(remaining) match {
case Attempt.Successful(DecodeResult(value, rest)) =>
bldr += value
count += 1
remaining = rest
case Attempt.Failure(err) =>
error = Some(err.pushContext(count.toString))
remaining = BitVector.empty
}
}
Attempt.fromErrOption(error, DecodeResult(bldr.result, remaining))
}
final def encodeNSeq[A](encs: Vector[Encoder[A]])(seq: collection.immutable.Seq[A]): Attempt[BitVector] = {
if (encs.length != seq.length)
return Attempt.failure(Err("encodeNSeq: length of coders and items does not match"))
val buf = new collection.mutable.ArrayBuffer[BitVector](seq.size)
((seq zip (0 until encs.length)): Seq[(A, Int)]) foreach { case (a, i) =>
encs(i).encode(a) match {
case Attempt.Successful(aa) => buf += aa
case Attempt.Failure(err) => return Attempt.failure(err.pushContext(buf.size.toString))
}
}
def merge(offset: Int, size: Int): BitVector = size match {
case 0 => BitVector.empty
case 1 => buf(offset)
case n =>
val half = size / 2
merge(offset, half) ++ merge(offset + half, half + (if (size % 2 == 0) 0 else 1))
}
Attempt.successful(merge(0, buf.size))
}
private[codecs] final class VectorNCodec[A](codecs: Vector[Codec[A]]) extends Codec[Vector[A]] {
def sizeBound = SizeBound(0, Some(codecs.length.toLong))
def encode(vector: Vector[A]) = Encoder.encodeNSeq(codecs)(vector)
def decode(buffer: BitVector) =
Decoder.decodeNCollect[Vector, A](codecs)(buffer)
override def toString = s"vector($codecs)"
}
def vectorOf[A](valueCodecs: Vector[Codec[A]]): Codec[Vector[A]] =
provide(valueCodecs.length).
flatZip { count => new VectorNCodec(valueCodecs) }.
narrow[Vector[A]]({ case (cnt, xs) =>
if (xs.size == cnt) Attempt.successful(xs)
else Attempt.failure(Err(s"Insufficient number of elements: decoded ${xs.size} but should have decoded $cnt"))
}, xs => (xs.size, xs)).
withToString(s"vectorOf($valueCodecs)")

Class State get loss between function calls in Flink

I have this class:
case class IDADiscretizer(
nAttrs: Int,
nBins: Int = 5,
s: Int = 5) extends Serializable {
private[this] val log = LoggerFactory.getLogger(this.getClass)
private[this] val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
private[this] val randomReservoir = SamplingUtils.reservoirSample((1 to s).toList.iterator, 1)
def updateSamples(v: LabeledVector): Vector[IntervalHeapWrapper] = {
val attrs = v.vector.map(_._2)
val label = v.label
// TODO: Check for missing values
attrs
.zipWithIndex
.foreach {
case (attr, i) =>
if (V(i).getNbSamples < s) {
V(i) insertValue attr // insert
} else {
if (randomReservoir(0) <= s / (i + 1)) {
//val randVal = Random nextInt s
//V(i) replace (randVal, attr)
V(i) insertValue attr
}
}
}
V
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints: Vector[Vector[Double]] = V map (_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): (DataSet[Vector[IntervalHeapWrapper]], Vector[Vector[Double]]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
}
Using flink, I would like to get the cutpoints after the call of discretize, but it seems the information stored in V get loss. Do I have to use Broadcast like in this question? is there a better way to access the state of class?
I've tried to call cutpoints in two ways, one with is:
def discretize(data: DataSet[LabeledVector]) = data map (x => updateSamples(x))
Then, called from outside:
val a = IDADiscretizer(nAttrs = 4)
val r = a.discretize(dataSet)
r.print
val cuts = a.cutPoints
Here, cuts is empty so I tried to compute the discretization as well as the cutpoints inside discretize:
def discretize(data: DataSet[LabeledVector]) = {
val r = data map (x => updateSamples(x))
val c = cutPoints
(r, c)
}
And use it like this:
val a = IDADiscretizer(nAttrs = 4)
val (d, c) = a.discretize(dataSet)
c foreach println
But the same happends.
Finally, I've also tried to make V completely public:
val V = Vector.tabulate(nAttrs)(i => new IntervalHeapWrapper(nBins, i))
Still empty
What am I doing wrong?
Related questions:
Keep keyed state across multiple transformations
Flink State backend keys atomicy and distribution
Flink: does state access across stream?
Flink: Sharing state in CoFlatMapFunction
Answer
Thanks to #TillRohrmann what I finally did was:
private[this] def computeCutPoints(x: LabeledVector) = {
val attrs = x.vector.map(_._2)
val label = x.label
attrs
.zipWithIndex
.foldLeft(V) {
case (iv, (v, i)) =>
iv(i) insertValue v
iv
}
}
/**
* Return the cutpoints for the discretization
*
*/
def cutPoints(data: DataSet[LabeledVector]): Seq[Seq[Double]] =
data.map(computeCutPoints _)
.collect
.last.map(_.getBoundaries.toVector)
def discretize(data: DataSet[LabeledVector]): DataSet[LabeledVector] =
data.map(updateSamples _)
And then use it like this:
val a = IDADiscretizer(nAttrs = 4)
val d = a.discretize(dataSet)
val cuts = a.cutPoints(dataSet)
d.print
cuts foreach println
I do not know if it is the best way, but at least is working now.
The way Flink works is that the user defines operators/user defined functions which operate on input data coming from a source function. In order to execute a program the user code is sent to the Flink cluster where it is executed. The results of the computation has to be output to some storage system via a sink function.
Due to this, it is not possible to mix local and distributed computations easily as you are trying with your solution. What discretize does is to define a map operator which transforms the input DataSet data. This operation will be executed once you call ExecutionEnvironment#execute or DataSet#print, for example. Now the user code and the definition for IDADiscretizer is sent to the cluster where they are instantiated. Flink will update the values in an instance of IDADiscretizer which is not the same instance as the one you have on the client.

Scala count number of times function returns each value, functionally

I want to count up the number of times that a function f returns each value in it's range (0 to f_max, inclusive) when applied to a given list l, and return the result as an array, in Scala.
Currently, I accomplish as follows:
def count (l: List): Array[Int] = {
val arr = new Array[Int](f_max + 1)
l.foreach {
el => arr(f(el)) += 1
}
return arr
}
So arr(n) is the number of times that f returns n when applied to each element of l. This works however, it is imperative style, and I am wondering if there is a clean way to do this purely functionally.
Thank you
how about a more general approach:
def count[InType, ResultType](l: Seq[InType], f: InType => ResultType): Map[ResultType, Int] = {
l.view // create a view so we don't create new collections after each step
.map(f) // apply your function to every item in the original sequence
.groupBy(x => x) // group the returned values
.map(x => x._1 -> x._2.size) // count returned values
}
val f = (i:Int) => i
count(Seq(1,2,3,4,5,6,6,6,4,2), f)
l.foldLeft(Vector.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc.updated(result, acc(result) + 1)
}
Alternatively, a good balance of performance and external purity would be:
def count(l: List[???]): Vector[Int] = {
val arr = l.foldLeft(Array.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc(result) += 1
}
arr.toVector
}

Scala: Safe access to index in a List[DataFrame]

I receive a List[DataFrame] and I want to store each df in a variable. Some values always exist in the list:
val routes = dataframes(0)
val stops = dataframes(1)
But other ones may also come so the size list is variable.
How could I perform a safely access to a index of list that may be out of bounds? I thought that with Some() and handling the result it would works:
val fare_attributes : Option[DataFrame] = Some(dataframes(10))
fare_attributes match {
case Some(fare) => upload())
println("fare_attributes uploaded")
case None => println("No fare_attributes found")
}
But I receive: java.lang.IndexOutOfBoundsException: 2
You can use .lift on your list:
val fare_attributes : Option[DataFrame] = dataframes.lift(10)
I think you will have to rely on checking the length of the list before accessing the indexed value. You may want to implement some wrapper function to do so. So that it isn't done repeatedly.
You can be a bit "elegant" about it with currying with two parameter lists. So that your code is a bit concise. Here's a sample which you may improve.
def safeList(list: List[Int])(index: Int): Int = {
if (index < list.length) list(index)
else 0
}
val x = List(1, 2 ,3 )
val y = safeList(x)(_)
val a = y(0) // returns 1
val b = y(1) // returns 2
val c = y(4) // returns 0

Finite Growable Array in Scala

I would like to be able to grow an Array-like structure up to a maximum size, after which the oldest (1st) element would be dropped off the structure every time a new element is added. I don't know what the best way to do this is, but one way would be to extend the ArrayBuffer class, and override the += operator so that if the maximum size has been reached, the first element is dropped every time a new one is added. I haven't figured out how to properly extend collections yet. What I have so far is:
class FiniteGrowableArray[A](maxLength:Int) extends scala.collection.mutable.ArrayBuffer {
override def +=(elem:A): <insert some return type here> = {
// append element
if(length > maxLength) remove(0)
<returned collection>
}
}
Can someone suggest a better path and/or help me along this one? NOTE: I will need to arbitrarily access elements within the structure multiple times in between the += operations.
Thanks
As others have discussed, you want a ring buffer. However, you also have to decide if you actually want all of the collections methods or not, and if so, what happens when you filter a ring buffer of maximum size N--does it keep its maximum size, or what?
If you're okay with merely being able to view your ring buffer as part of the collections hierarchy (but don't want to use collections efficiently to generate new ring buffers) then you can just:
class RingBuffer[T: ClassManifest](maxsize: Int) {
private[this] val buffer = new Array[T](maxsize+1)
private[this] var i0,i1 = 0
private[this] def i0up = { i0 += 1; if (i0>=buffer.length) i0 -= buffer.length }
private[this] def i0dn = { i0 -= 1; if (i0<0) i0 += buffer.length }
private[this] def i1up = { i1 += 1; if (i1>=buffer.length) i1 -= buffer.length }
private[this] def i1dn = { i1 -= 1; if (i1<0) i1 += buffer.length }
private[this] def me = this
def apply(i: Int) = {
val j = i+i0
if (j >= buffer.length) buffer(j-buffer.length) else buffer(j)
}
def size = if (i1<i0) buffer.length+i1-i0 else i1-i0
def :+(t: T) = {
buffer(i1) = t
i1up; if (i1==i0) i0up
this
}
def +:(t: T) = {
i0dn; if (i0==i1) i1dn
buffer(i0) = t
this
}
def popt = {
if (i1==i0) throw new java.util.NoSuchElementException
i1dn; buffer(i1)
}
def poph = {
if (i1==i0) throw new java.util.NoSuchElementException
val ans = buffer(i0); i0up; ans
}
def seqView = new IndexedSeq[T] {
def apply(i: Int) = me(i)
def length = me.size
}
}
Now you can use this easily directly, and you can jump out to IndexedSeq when needed:
val r = new RingBuffer[Int](4)
r :+ 7 :+ 9 :+ 2
r.seqView.mkString(" ") // Prints 7 9 2
r.popt // Returns 2
r.poph // Returns 7
r :+ 6 :+ 5 :+ 4 :+ 3
r.seqView.mkString(" ") // Prints 6 5 4 3 -- 7 fell off the end
0 +: 1 +: 2 +: r
r.seqView.mkString(" ") // Prints 0 1 2 6 -- added to front; 3,4,5 fell off
r.seqView.filter(_>1) // Vector(2,6)
and if you want to put things back into a ring buffer, you can
class RingBufferImplicit[T: ClassManifest](ts: Traversable[T]) {
def ring(maxsize: Int) = {
val rb = new RingBuffer[T](maxsize)
ts.foreach(rb :+ _)
rb
}
}
implicit def traversable2ringbuffer[T: ClassManifest](ts: Traversable[T]) = {
new RingBufferImplicit(ts)
}
and then you can do things like
val rr = List(1,2,3,4,5).ring(4)
rr.seqView.mkString(" ") // Prints 2,3,4,5