How do I sum up multiple fields of a class? - scala

I have a class Dimensions(Int, Int, Int) and a Shape(String name), put into a Tuple(Shape, Dimensions)
My dataset is:
(Cube, Dimensions(5,5,5))
(Sphere, Dimensions(5,10,15))
(Cube, Dimensions(3,3,3))
I need to return this:
(Cube, Dimensions(8,8,8))
(Sphere, Dimensions(5,10,15))
where I group by the name of the shape then sum up all of the dimension values. Currently I am able to map into a (Name, Int, Int, Int) but I am unsure of how to wrap it back to a Dimension object.
data.map(_._2.map(x => (x.length,x.width,x.height)))
Any help would be appreciated

Assuming there are no very specific special cases and you have a RDD. You just need an aggregateByKey.
case class Dimensions(i1: Int, i2: Int, i3: Int)
val initialRdd: RDD[(Shape, Dimensions)] = ???
def combineDimensions(dimensions1: Dimensions, dimensions2: Dimensions): Dimensions =
Dimensions(
dimensions1.i1 + dimensions2.i1,
dimensions1.i2 + dimensions2.i2,
dimensions1.i3 + dimensions2.i3
)
val finalRdd: RDD[(Shape, Dimensions)] =
initialRdd
.aggregateByKey(Dimensions(0, 0, 0))(
{ case (accDimensions, dimensions) =>
combineDimensions(accDimensions, dimensions)
},
{ case (partitionDimensions1, partitionDimensions2) =>
combineDimensions(partitionDimensions1, partitionDimensions1)
}
)

Related

Scala - how to make the SortedSet with custom ordering hold multiple different objects that have the same value by which we sort?

as mentioned in the title I have a SortedSet with custom ordering. The set holds objects of class Edge (representing an edge in a graph). Each Edge has a cost associated with it as well as it's start and end point.
case class Edge(firstId : Int, secondId : Int, cost : Int) {}
My ordering for SortedSet of edges looks like this (it's for the A* algorithm) :
object Ord {
val edgeCostOrdering: Ordering[Edge] = Ordering.by { edge : Edge =>
if (edge.secondId == goalId) graphRepresentation.calculateStraightLineCost(edge.firstId, goalId) else edge.cost + graphRepresentation.calculateStraightLineCost(edge.secondId, goalId)
}
}
However after I apply said ordering to the set and I try to sort edges that have different start/end points but the same cost - only the last encountered edge retains in the set.
For example :
val testSet : SortedSet[Edge] = SortedSet[Edge]()(edgeOrder)
val testSet2 = testSet + Edge(1,4,2)
val testSet3 = testSet2 + Edge(3,2,2)
println(testSet3)
Only prints (3,2,2)
Aren't these distinct objects? They only share the same value for one field so shouldn't the Set be able to handle this?
Consider using a mutable.PriorityQueue instead, it can keep multiple elements that have the same order. Here is a simpler example where we order pairs by the second component:
import collection.mutable.PriorityQueue
implicit val twoOrd = math.Ordering.by{ (t: (Int, Int)) => t._2 }
val p = new PriorityQueue[(Int, Int)]()(twoOrd)
p += ((1, 2))
p += ((42, 2))
Even though both pairs are mapped to 2, and therefore have the same priority, the queue does not lose any elements:
p foreach println
(1,2)
(42,2)
To retain all the distinct Edges with the same ordering cost value in the SortedSet, you can modify your Ordering.by's function to return a Tuple that includes the edge Ids as well:
val edgeCostOrdering: Ordering[Edge] = Ordering.by { edge: Edge =>
val cost = if (edge.secondId == goalId) ... else ...
(cost, edge.firstId, edge.secondId)
}
A quick proof of concept below:
import scala.collection.immutable.SortedSet
case class Foo(a: Int, b: Int)
val fooOrdering: Ordering[Foo] = Ordering.by(_.b)
val ss = SortedSet(Foo(2, 2), Foo(2, 1), Foo(1, 2))(fooOrdering)
// ss: scala.collection.immutable.SortedSet[Foo] = TreeSet(Foo(2,1), Foo(1,2))
val fooOrdering: Ordering[Foo] = Ordering.by(foo => (foo.b, foo.a))
val ss = SortedSet(Foo(2, 2), Foo(2, 1), Foo(1, 2))(fooOrdering)
// ss: scala.collection.immutable.SortedSet[Foo] = TreeSet(Foo(1,2), Foo(2,1), Foo(2,2))

How to sort a list of scala objects by sort order of other list?

I am having following 2 lists in scala.
case class Parents(name: String, savings: Double)
case class Children(parentName: String, debt: Double)
val parentList:List[Parents] = List(Parents("Halls",1007D), Parents("Atticus",8000D), Parents("Aurilius",900D))
val childrenList:List[Children] = List(Children("Halls",9379.40D), Children("Atticus",9.48D), Children("Aurilius",1100.75D))
val sortedParentList:List[Parents] = parentList.sortBy(_.savings).reverse
// sortedParentList = List(Parents(Atticus,8000.0), Parents(Halls,1007.0), Parents(Aurilius,900.0))
now my parenList is Sorted By savings in decreasing order, I want my childrenList to be sorted in the way that it follows parentList Order.
i.e. expected order will be following
// sortedParentList = List(Children(Atticus,9.48D), Children(Halls,9379.40D), Children(Aurilius,1100.75D))
Well, if you know both lists are initially in the same order (you can always ensure that by sorting both by name), you can just sort them both in one go:
val (sortedParentList, sortedChildrenList) = (parents zip children)
.sortBy(-_._1.savings)
.unzip
Or you can define the ordering ahead of time, and use it to sort both lists:
val order = parentList.map(p => p.name -> -p.savings).toMap
val sortedParentList = parentList.sortBy(order(_.name))
val sortedChildrenList = childrenList.sortBy(order(_.parentName))
Or you can sort parents first (maybe, they are already sorted), and then define the order:
val order = sortedParentList.zipWithIndex.map { case(p, idx) => p.name -> idx }.toMap
val sortedChildrenList = childrenList.sortBy(c => order(c.parentName))
case class Parents(name: String, savings: Double)
case class Children(parentName: String, debt: Double)
val familiesList: List[(Parents, Children)] = List(
Parents("Halls",1007D) -> Children("Halls",9379.40D),
Parents("Atticus",8000D) -> Children("Atticus",9.48D),
Parents("Aurilius",900D) -> Children("Aurilius",1100.75D))
val (sortedParents, sortedChildren) = familiesList.sortBy {
case (parents, _) => -parents.savings
}.unzip

In Scala, is there a way to map over a collection while passing a value along fold-style?

In Scala, is there a way to map over a collection while passing a value along fold-style? Something like:
case class TxRecord(name: String, amount: Int)
case class TxSummary(name: String, amount: Int, balance: Int)
val txRecords: Seq[TxRecord] = txRecordService.getSortedTxRecordsOfUser("userId")
val txSummarys: Seq[TxSummary] = txRecords.foldMap(0)((sum, txRecord) =>
(sum + txRecord.amount, TxSummary(txRecord.name, txRecord.amount, sum + txRecord.amount)))

Maximum enclosing circle algorithm with Spark

I'm new to spark and scala, I'm trying to implement the maximum enclosing circle algorithm. For input I have csv file with id,x,y:
id x y
1 1 0
2 1 2
1 0 0
3 5 10
...
Needs to find a max enclosing circle for each id.
I implemented the solution but it is not optimal.
val data = csv
.filter(_ != header)
.map(_.split(","))
.map(col => (col(0), Location(col(2).toInt, col(3).toInt)))
val idLoc = data.groupByKey()
val ids = idLoc.keys.collect().toList.par
ids.foreach {
case id =>
val locations = data.filter(_._1 == id).values.cache()
val maxEnclCircle = findMaxEnclCircle(findEnclosedPoints(locations, eps))
}
def findMaxEnclCircle(centroids: RDD[(Location, Long)]): Location = {
centroids.max()(new Ordering[(Location, Long)]() {
override def compare(x: (Location, Long), y: (Location, Long)): Int =
Ordering[Long].compare(x._2, y._2)
})._1
}
def findEnclosedPoints(locations: RDD[Location], eps: Double): RDD[(Location, Long)] = {
locations.cartesian(locations).filter {
case (a, b) => isEnclosed(a, b, eps)
}.map { case (a, b) => (a, 1L) }.reduceByKey(_ + _)
}
As you can see I keep a list of id in the memory. How can I improve the code to get rid of it?
Thanks.
The main problem with your code above is that you are not using the cluster! When you do collect() all your data is sent to the single master and all computations are done there. For efficiency, use aggregateByKey() to shuffle all the point with the same id to the same executor, then do the computation there.

Scala Ordinal Method Call Aliasing

In Spark SQL we have Row objects which contain a list of records that make up a row (think Seq[Any]). A Rowhas ordinal accessors such as .getInt(0) or getString(2).
Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing.
Say for example I have the following code
def doStuff(row: Row) = {
//extract some items from the row into a tuple;
(row.getInt(0), row.getString(1)) //tuple of ID, Name
}
The question becomes how could I create aliases for these fields in a Row object?
I was thinking I could create methods which take a implicit Row object;
def id(implicit row: Row) = row.getInt(0)
def name(implicit row: Row) = row.getString(1)
I could then rewrite the above as;
def doStuff(implicit row: Row) = {
//extract some items from the row into a tuple;
(id, name) //tuple of ID, Name
}
Is there a better/neater approach?
You could implicitly add those accessor methods to row:
implicit class AppRow(r:Row) extends AnyVal {
def id:String = r.getInt(0)
def name:String = r.getString(1)
}
Then use it as:
def doStuff(row: Row) = {
val value = (row.id, row.name)
}
Another option is to convert Row into a domain-specific case class, which IMHO leads to more readable code:
case class Employee(id: Int, name: String)
val yourRDD: SchemaRDD = ???
val employees: RDD[Employee] = yourRDD.map { row =>
Employee(row.getInt(0), row.getString(1))
}
def doStuff(e: Employee) = {
(e.name, e.id)
}