scala reduce a complex structure - scala

I have the following case classes
case class AdsWeight(ads: Seq[LimitedAd], finalWeight: Double)
case class LimitedAd(
id: Long,
tid: Long,
mt: String,
oe: String,
bid: Double,
score: Double,
ts: Double
Now given val records: Seq[AdsWeight], how can I replace the scores in every LimitedAd with score * finalWeight, and then concat only the LimitedAd to output?
val ad1 = LimitedAd(1, 100, "mt1", "ot1", 0.1, 0.01, 0.001)
val ad2 = LimitedAd(2, 200, "mt2", "ot2", 0.2, 0.02, 0.002)
val ad3 = LimitedAd(3, 300, "mt3", "ot4", 0.3, 0.03, 0.003)
val ad4 = LimitedAd(4, 400, "mt4", "ot4", 0.4, 0.04, 0.004)
val ads1 = AdsWeight(Seq(ad1, ad2), 0.9)
val ads2 = AdsWeight(Seq(ad3, ad4), 0.8)
val records: Seq[AdsWeight] = Seq(ads1, ads2)
and get the output
(1, 100, "mt1", "ot1", 0.1, 0.009, 0.001), (2, 200, "mt2", "ot2", 0.2, 0.018, 0.002)
(3, 300, "mt3", "ot3", 0.3, 0.024, 0.003), (4, 400, "mt4", "ot4", 0.4, 0.032, 0.004)

scala> val res = records.flatMap(r => => ad.copy(score = ad.score * r.finalWeight)))
val res: Seq[LimitedAd] = List(LimitedAd(1,100,mt1,ot1,0.1,0.009000000000000001,0.001), LimitedAd(2,200,mt2,ot2,0.2,0.018000000000000002,0.002), LimitedAd(3,300,mt3,ot4,0.3,0.024,0.003), LimitedAd(4,400,mt4,ot4,0.4,0.032,0.004))
scala> res.foreach(println)


Aggregating sum for RDD in Scala (Spark)

If I have a variable such as books: RDD[(String, Integer, Integer)], how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price).
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
[("book1", 120, 110),
("book2", 5, 10)]
With an RDD you can use reduceByKey.
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == Book(name, i + b.i, j + b.j) else throw Exception
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
val aggRdd = => (, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
// Book(book1,120,110)
// Book(book2,5,10)
Just use Dataset:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+
Try converting it first to a key-tuple RDD and then performing a reduceByKey: => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))

Spark dataframe to nested map

How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spark's DAG. I believe this operation will be quicker than a join later on (Spark dynamic DAG is a lot slower and different from hard coded DAG) as the transformed values were created during the train step of a custom estimator. Now I just want to apply them really quick during predict step of the pipeline.
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
This gives the wrong type of map
val inputToMap = => Map(*))
I would rather want something like:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)
Edit: removed collect operation from final map
If you are using Spark 2+, here's a suggestion:
val inputToMap =
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
val cols = inputToMap.columns
val localData = inputToMap.collect { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
I'm not sure I follow the motivation, but I think this is the transformation that would get you the result you're after:
// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()
// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
.collect { case Array(k, v) => (k, v) }
// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
.map { case (k, v) => k -> => (r.getAs[String](k), r.getAs[Double](v))) }
.mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })
// prints:
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))

How to evaluate binary key-value?

I am writing a external merge sort for big input files in Binary using Scala.
I generate input using gensort and evaluate output using valsort from this website:
I will read 100 bytes at a time, first 10 bytes for Key(List[Byte]) and the rest 90 bytes for Value(List[Byte])
After sorting, my output is evaluated by valsort, and it's wrong.
But when I using input in ASCII, my output is right.
So I wonder how to sort binary inputs in the right way?
Valsort said that my first unordered record is 56, here is what I printed out:
50 --> Key(List(-128, -16, 5, -10, -83, 23, -107, -109, 42, -11))
51 --> Key(List(-128, -16, 5, -10, -83, 23, -107, -109, 42, -11))
52 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
53 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
54 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
55 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
56 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
57 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
58 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
59 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
60 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
61 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
62 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
This is my external sorting code:
package externalsorting
import{BufferedOutputStream, File, FileOutputStream}
import java.nio.channels.FileChannel
import java.util.Calendar
import scala.collection.mutable
import readInput._
import scala.collection.mutable.ListBuffer
* Created by hminle on 12/5/2016.
object ExternalSortingExample extends App{
val dir: String = "C:\\ShareUbuntu\\testMerge"
val listFile: List[File] = Utils.getListOfFiles(dir)
listFile foreach(x => println(x.getName))
var fileChannelsInput: List[(FileChannel, Boolean)] ={input => (Utils.getFileChannelFromInput(input), false)}
val tempDir: String = dir + "/tmp/"
val tempDirFile: File = new File(tempDir)
val isSuccessful: Boolean = tempDirFile.mkdir()
if(isSuccessful) println("Create temp dir successfully")
else println("Create temp dir failed")
var fileNameCounter: Int = 0
val chunkSize = 100000
// Split big input files into small chunks
if(Utils.estimateAvailableMemory() > 400000){
val fileChannel = fileChannelsInput(0)._1
val (chunks, isEndOfFileChannel) = Utils.getChunkKeyAndValueBySize(chunkSize, fileChannel)
fileChannelsInput = fileChannelsInput.drop(1)
} else {
val sortedChunk: List[(Key, Value)] = Utils.getSortedChunk(chunks)
val fileName: String = tempDir + "partition-" + fileNameCounter
Utils.writePartition(fileName, sortedChunk)
fileNameCounter += 1
} else {
println(Thread.currentThread().getName +"There is not enough available free memory to continue processing" + Utils.estimateAvailableMemory())
val listTempFile: List[File] = Utils.getListOfFiles(tempDir)
val start = Calendar.getInstance().getTime
val tempFileChannels: List[FileChannel] =
val binaryFileBuffers: List[BinaryFileBuffer] =
binaryFileBuffers foreach(x => println(x.toString))
val pq1: ListBuffer[BinaryFileBuffer] = ListBuffer.empty
val outputDir: String = dir + "/mergedOutput"
val bos = new BufferedOutputStream(new FileOutputStream(outputDir))
// Start merging temporary files
while(pq1.length > 0){
val pq2 = pq1.toList.sortWith(_.head()._1 < _.head()._1)
val buffer: BinaryFileBuffer = pq2.head
val keyVal: (Key, Value) = buffer.pop()
val byteArray: Array[Byte] = Utils.flattenKeyValue(keyVal).toArray[Byte]
pq1 -= buffer
This is BinaryFileBuffer.scala --> which is just a wrapper
package externalsorting
import java.nio.channels.FileChannel
import readInput._
* Created by hminle on 12/5/2016.
object BinaryFileBuffer{
def apply(fileChannel: FileChannel): BinaryFileBuffer = {
val buffer: BinaryFileBuffer = new BinaryFileBuffer(fileChannel)
class BinaryFileBuffer(fileChannel: FileChannel) extends Ordered[BinaryFileBuffer] {
private var cache: Option[(Key, Value)] = _
def isEmpty(): Boolean = cache == None
def head(): (Key, Value) = cache.get
def pop(): (Key, Value) = {
val answer = head()
def reload(): Unit = {
this.cache = Utils.get100BytesKeyAndValue(fileChannel)
def close(): Unit = fileChannel.close()
def compare(that: BinaryFileBuffer): Int = {
This is my Utils.scala:
package externalsorting
import{BufferedOutputStream, File, FileOutputStream}
import java.nio.ByteBuffer
import java.nio.channels.FileChannel
import java.nio.file.Paths
import readInput._
import scala.annotation.tailrec
import scala.collection.mutable.ListBuffer
* Created by hminle on 12/5/2016.
object Utils {
def getListOfFiles(dir: String): List[File] = {
val d = new File(dir)
if(d.exists() && d.isDirectory){
} else List[File]()
def get100BytesKeyAndValue(fileChannel: FileChannel): Option[(Key, Value)] = {
val size = 100
val buffer = ByteBuffer.allocate(size)
val numOfByteRead =
if(numOfByteRead != -1){
val data: Array[Byte] = new Array[Byte](numOfByteRead)
buffer.get(data, 0, numOfByteRead)
val (key, value) = data.splitAt(10)
Some(Key(key.toList), Value(value.toList))
} else {
def getFileChannelFromInput(file: File): FileChannel = {
val fileChannel: FileChannel =
def estimateAvailableMemory(): Long = {
val runtime: Runtime = Runtime.getRuntime
val allocatedMemory: Long = runtime.totalMemory() - runtime.freeMemory()
val presFreeMemory: Long = runtime.maxMemory() - allocatedMemory
def writePartition(dir: String, keyValue: List[(Key, Value)]): Unit = {
val byteArray: Array[Byte] = flattenKeyValueList(keyValue).toArray[Byte]
val bos = new BufferedOutputStream(new FileOutputStream(dir))
def flattenKeyValueList(keyValue: List[(Key,Value)]): List[Byte] = {
keyValue flatten {
case (Key(keys), Value(values)) => keys:::values
def flattenKeyValue(keyVal: (Key, Value)): List[Byte] = {
def getChunkKeyAndValueBySize(size: Int, fileChannel: FileChannel): (List[(Key, Value)], Boolean) = {
val oneKeyValueSize = 100
val countMax = size / oneKeyValueSize
var isEndOfFileChannel: Boolean = false
var count = 0
val chunks: ListBuffer[(Key, Value)] = ListBuffer.empty
val keyValue = get100BytesKeyAndValue(fileChannel)
if(keyValue.isDefined) chunks.append(keyValue.get)
isEndOfFileChannel = !keyValue.isDefined
count += 1
}while(!isEndOfFileChannel && count < countMax)
(chunks.toList, isEndOfFileChannel)
def getSortedChunk(oneChunk: List[(Key, Value)]): List[(Key, Value)] = {
oneChunk.sortWith((_._1 < _._1))
How I define Key and Value:
case class Key(keys: List[Byte]) extends Ordered[Key] {
def isEmpty(): Boolean = keys.isEmpty
def compare(that: Key): Int = {
compare_aux(this.keys, that.keys)
private def compare_aux(keys1: List[Byte], keys2: List[Byte]): Int = {
(keys1, keys2) match {
case (Nil, Nil) => 0
case (list, Nil) => 1
case (Nil, list) => -1
case (hd1::tl1, hd2::tl2) => {
if(hd1 > hd2) 1
else if(hd1 < hd2) -1
else compare_aux(tl1, tl2)
case class Value(values: List[Byte])
I've found the answer. Reading from Binary and ASCII are different.
In what order should the sorted file be?
For binary records (GraySort or MinuteSort), the 10-byte keys should be ordered as arrays of unsigned bytes. The memcmp() library routine can be used for this purpose.
For sorting Binary, I need to convert signed bytes into unsigned bytes.

Scala groupBy of a tuple to calculate stock basis

I am working on an exercise to calculate stock basis given a list of stock purchases in the form of thruples (ticker, qty, stock_price). I've got it working, but would like to do the calculation part in more of a functional way. Anyone have an answer for this?
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 20, 100),
// ("FB", 10, 100)
// output:
// List(("FB", (10, 100)),
// ("TSLA", (40, 150))))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val basises = trades groupBy(_._1) map {
case (key, pairs) =>
val quantity =
val price =
var totalPrice: Int = 0
for (i <- quantity.indices) {
totalPrice += quantity(i) * price(i)
key -> (quantity.sum, totalPrice / quantity.sum)
This looks like this might work for you. (updated)
def generateBasis(trades: Iterable[(String, Int, Int)]) =
trades.groupBy(_._1).mapValues {
_.foldLeft((0,0)){case ((tq,tp),(_,q,p)) => (tq + q, tp + q * p)}
}.map{case (k, (q,p)) => (k,q,p/q)} // turn Map into tuples (triples)
I came up with the solution below. Thanks everyone for their input. I'd love to hear if anyone had a more elegant solution.
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 10, 100),
// ("FB", 5, 50)
// output:
// List(("FB", (5, 50)),
// ("TSLA", (30, 166)))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val groupedTrades = (trades groupBy(_._1)) map {
case (key, pairs) =>
key -> ( => (e._2, e._3)))
} // List((FB,List((5,50))), (TSLA,List((20,200), (10,100))))
val costBasises = for {groupedTrade <- groupedTrades
tradeCost = for {tup <- groupedTrade._2 // (qty, cost)
} yield tup._1 * tup._2 // (trade_qty * trade_cost)
tradeQuantity = for { tup <- groupedTrade._2
} yield tup._1 // trade_qty
} yield (groupedTrade._1, tradeQuantity.sum, tradeCost.sum / tradeQuantity.sum )
costBasises.toList // List(("FB", (5, 50)),("TSLA", (30, 166)))

Find a instance by field value comparison in Seq

What is best practice to get a instance in Seq ?
case class Point(x: Int, y: Int)
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
I'd like to acquire Point with the maximum of y. (in this case: Point(3, 30))
What's best way ?
The easiest way would be to use TraversableOnce.maxBy:
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
scala> points.maxBy(_.y)
res1: Point = Point(3,30)
#YuvalItzchakov's answer is correct but here another way to do it using Ordering :
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
// points: Seq[Point] = List(Point(1,10), Point(2,20), Point(3,30))
val order = Point).y)
// order: scala.math.Ordering[Point] = scala.math.Ordering$$anon$9#5a2fa51f
val max_point = points.reduce(order.max)
// max_point: Point = Point(3,30)
// Point = Point(3,30)
or with implicit Ordering:
implicit val pointOrdering = Point).y)
// Point = Point(3,30)
Note: TraversableOnce.maxBy uses also implicit Ordering. Reference.
Another way of doing it is by using foldleft.
val points:Seq[Point] = Seq(Point(1,10),Point(2,20),Point(3,30))
points.foldLeft[Point](Point(0,0)){(z,f) =>if (f.y>z.y) f else z}