scala reduce a complex structure - scala

I have the following case classes
case class AdsWeight(ads: Seq[LimitedAd], finalWeight: Double)
case class LimitedAd(
id: Long,
tid: Long,
mt: String,
oe: String,
bid: Double,
score: Double,
ts: Double
)
Now given val records: Seq[AdsWeight], how can I replace the scores in every LimitedAd with score * finalWeight, and then concat only the LimitedAd to output?
E.g.,
val ad1 = LimitedAd(1, 100, "mt1", "ot1", 0.1, 0.01, 0.001)
val ad2 = LimitedAd(2, 200, "mt2", "ot2", 0.2, 0.02, 0.002)
val ad3 = LimitedAd(3, 300, "mt3", "ot4", 0.3, 0.03, 0.003)
val ad4 = LimitedAd(4, 400, "mt4", "ot4", 0.4, 0.04, 0.004)
val ads1 = AdsWeight(Seq(ad1, ad2), 0.9)
val ads2 = AdsWeight(Seq(ad3, ad4), 0.8)
val records: Seq[AdsWeight] = Seq(ads1, ads2)
and get the output
[
(1, 100, "mt1", "ot1", 0.1, 0.009, 0.001), (2, 200, "mt2", "ot2", 0.2, 0.018, 0.002)
(3, 300, "mt3", "ot3", 0.3, 0.024, 0.003), (4, 400, "mt4", "ot4", 0.4, 0.032, 0.004)
]

scala> val res = records.flatMap(r => r.ads.map(ad => ad.copy(score = ad.score * r.finalWeight)))
val res: Seq[LimitedAd] = List(LimitedAd(1,100,mt1,ot1,0.1,0.009000000000000001,0.001), LimitedAd(2,200,mt2,ot2,0.2,0.018000000000000002,0.002), LimitedAd(3,300,mt3,ot4,0.3,0.024,0.003), LimitedAd(4,400,mt4,ot4,0.4,0.032,0.004))
scala> res.foreach(println)
LimitedAd(1,100,mt1,ot1,0.1,0.009000000000000001,0.001)
LimitedAd(2,200,mt2,ot2,0.2,0.018000000000000002,0.002)
LimitedAd(3,300,mt3,ot4,0.3,0.024,0.003)
LimitedAd(4,400,mt4,ot4,0.4,0.032,0.004)

Related

Aggregating sum for RDD in Scala (Spark)

If I have a variable such as books: RDD[(String, Integer, Integer)], how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price).
ex:
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
becomes
[("book1", 120, 110),
("book2", 5, 10)]
With an RDD you can use reduceByKey.
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
Book("book2",5,10),
Book("book1",100,100)))
val aggRdd = rdd.map(book => (book.name, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)
Just use Dataset:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))
spark.createDataFrame(rdd).groupBy("_1").sum().show()
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+
Try converting it first to a key-tuple RDD and then performing a reduceByKey:
yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))
Output:
(book2,(5,10))
(book1,(120,110))

Spark dataframe to nested map

How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spark's DAG. I believe this operation will be quicker than a join later on (Spark dynamic DAG is a lot slower and different from hard coded DAG) as the transformed values were created during the train step of a custom estimator. Now I just want to apply them really quick during predict step of the pipeline.
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
This gives the wrong type of map
val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
I would rather want something like:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)
Edit: removed collect operation from final map
If you are using Spark 2+, here's a suggestion:
val inputToMap = inputSmall.select(
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
)
val cols = inputToMap.columns
val localData = inputToMap.collect
cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap
I'm not sure I follow the motivation, but I think this is the transformation that would get you the result you're after:
// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()
// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
.grouped(2)
.collect { case Array(k, v) => (k, v) }
.toList
// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
.map { case (k, v) => k -> data.map(r => (r.getAs[String](k), r.getAs[Double](v))) }
.toMap
.mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })
result.foreach(println)
// prints:
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))

How to evaluate binary key-value?

I am writing a external merge sort for big input files in Binary using Scala.
I generate input using gensort and evaluate output using valsort from this website: http://www.ordinal.com/gensort.html
I will read 100 bytes at a time, first 10 bytes for Key(List[Byte]) and the rest 90 bytes for Value(List[Byte])
After sorting, my output is evaluated by valsort, and it's wrong.
But when I using input in ASCII, my output is right.
So I wonder how to sort binary inputs in the right way?
Valsort said that my first unordered record is 56, here is what I printed out:
50 --> Key(List(-128, -16, 5, -10, -83, 23, -107, -109, 42, -11))
51 --> Key(List(-128, -16, 5, -10, -83, 23, -107, -109, 42, -11))
52 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
53 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
54 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
55 --> Key(List(-128, -10, -10, 68, -94, 37, -103, 30, 90, 16))
56 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
57 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
58 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
59 --> Key(List(-128, 0, -27, -4, -82, -82, 121, -125, -22, 99))
60 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
61 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
62 --> Key(List(-128, 7, -65, 118, 121, -12, 48, 50, 59, -8))
This is my external sorting code:
package externalsorting
import java.io.{BufferedOutputStream, File, FileOutputStream}
import java.nio.channels.FileChannel
import java.util.Calendar
import scala.collection.mutable
import readInput._
import scala.collection.mutable.ListBuffer
/**
* Created by hminle on 12/5/2016.
*/
object ExternalSortingExample extends App{
val dir: String = "C:\\ShareUbuntu\\testMerge"
val listFile: List[File] = Utils.getListOfFiles(dir)
listFile foreach(x => println(x.getName))
var fileChannelsInput: List[(FileChannel, Boolean)] = listFile.map{input => (Utils.getFileChannelFromInput(input), false)}
val tempDir: String = dir + "/tmp/"
val tempDirFile: File = new File(tempDir)
val isSuccessful: Boolean = tempDirFile.mkdir()
if(isSuccessful) println("Create temp dir successfully")
else println("Create temp dir failed")
var fileNameCounter: Int = 0
val chunkSize = 100000
// Split big input files into small chunks
while(!fileChannelsInput.isEmpty){
if(Utils.estimateAvailableMemory() > 400000){
val fileChannel = fileChannelsInput(0)._1
val (chunks, isEndOfFileChannel) = Utils.getChunkKeyAndValueBySize(chunkSize, fileChannel)
if(isEndOfFileChannel){
fileChannel.close()
fileChannelsInput = fileChannelsInput.drop(1)
} else {
val sortedChunk: List[(Key, Value)] = Utils.getSortedChunk(chunks)
val fileName: String = tempDir + "partition-" + fileNameCounter
Utils.writePartition(fileName, sortedChunk)
fileNameCounter += 1
}
} else {
println(Thread.currentThread().getName +"There is not enough available free memory to continue processing" + Utils.estimateAvailableMemory())
}
}
val listTempFile: List[File] = Utils.getListOfFiles(tempDir)
val start = Calendar.getInstance().getTime
val tempFileChannels: List[FileChannel] = listTempFile.map(Utils.getFileChannelFromInput(_))
val binaryFileBuffers: List[BinaryFileBuffer] = tempFileChannels.map(BinaryFileBuffer(_))
binaryFileBuffers foreach(x => println(x.toString))
val pq1: ListBuffer[BinaryFileBuffer] = ListBuffer.empty
binaryFileBuffers.filter(!_.isEmpty()).foreach(pq1.append(_))
val outputDir: String = dir + "/mergedOutput"
val bos = new BufferedOutputStream(new FileOutputStream(outputDir))
// Start merging temporary files
while(pq1.length > 0){
val pq2 = pq1.toList.sortWith(_.head()._1 < _.head()._1)
val buffer: BinaryFileBuffer = pq2.head
val keyVal: (Key, Value) = buffer.pop()
val byteArray: Array[Byte] = Utils.flattenKeyValue(keyVal).toArray[Byte]
Stream.continually(bos.write(byteArray))
if(buffer.isEmpty()){
buffer.close()
pq1 -= buffer
}
count+=1
}
bos.close()
}
This is BinaryFileBuffer.scala --> which is just a wrapper
package externalsorting
import java.nio.channels.FileChannel
import readInput._
/**
* Created by hminle on 12/5/2016.
*/
object BinaryFileBuffer{
def apply(fileChannel: FileChannel): BinaryFileBuffer = {
val buffer: BinaryFileBuffer = new BinaryFileBuffer(fileChannel)
buffer.reload()
buffer
}
}
class BinaryFileBuffer(fileChannel: FileChannel) extends Ordered[BinaryFileBuffer] {
private var cache: Option[(Key, Value)] = _
def isEmpty(): Boolean = cache == None
def head(): (Key, Value) = cache.get
def pop(): (Key, Value) = {
val answer = head()
reload()
answer
}
def reload(): Unit = {
this.cache = Utils.get100BytesKeyAndValue(fileChannel)
}
def close(): Unit = fileChannel.close()
def compare(that: BinaryFileBuffer): Int = {
this.head()._1.compare(that.head()._1)
}
}
This is my Utils.scala:
package externalsorting
import java.io.{BufferedOutputStream, File, FileOutputStream}
import java.nio.ByteBuffer
import java.nio.channels.FileChannel
import java.nio.file.Paths
import readInput._
import scala.annotation.tailrec
import scala.collection.mutable.ListBuffer
/**
* Created by hminle on 12/5/2016.
*/
object Utils {
def getListOfFiles(dir: String): List[File] = {
val d = new File(dir)
if(d.exists() && d.isDirectory){
d.listFiles.filter(_.isFile).toList
} else List[File]()
}
def get100BytesKeyAndValue(fileChannel: FileChannel): Option[(Key, Value)] = {
val size = 100
val buffer = ByteBuffer.allocate(size)
buffer.clear()
val numOfByteRead = fileChannel.read(buffer)
buffer.flip()
if(numOfByteRead != -1){
val data: Array[Byte] = new Array[Byte](numOfByteRead)
buffer.get(data, 0, numOfByteRead)
val (key, value) = data.splitAt(10)
Some(Key(key.toList), Value(value.toList))
} else {
None
}
}
def getFileChannelFromInput(file: File): FileChannel = {
val fileChannel: FileChannel = FileChannel.open(Paths.get(file.getPath))
fileChannel
}
def estimateAvailableMemory(): Long = {
System.gc()
val runtime: Runtime = Runtime.getRuntime
val allocatedMemory: Long = runtime.totalMemory() - runtime.freeMemory()
val presFreeMemory: Long = runtime.maxMemory() - allocatedMemory
presFreeMemory
}
def writePartition(dir: String, keyValue: List[(Key, Value)]): Unit = {
val byteArray: Array[Byte] = flattenKeyValueList(keyValue).toArray[Byte]
val bos = new BufferedOutputStream(new FileOutputStream(dir))
Stream.continually(bos.write(byteArray))
bos.close()
}
def flattenKeyValueList(keyValue: List[(Key,Value)]): List[Byte] = {
keyValue flatten {
case (Key(keys), Value(values)) => keys:::values
}
}
def flattenKeyValue(keyVal: (Key, Value)): List[Byte] = {
keyVal._1.keys:::keyVal._2.values
}
def getChunkKeyAndValueBySize(size: Int, fileChannel: FileChannel): (List[(Key, Value)], Boolean) = {
val oneKeyValueSize = 100
val countMax = size / oneKeyValueSize
var isEndOfFileChannel: Boolean = false
var count = 0
val chunks: ListBuffer[(Key, Value)] = ListBuffer.empty
do{
val keyValue = get100BytesKeyAndValue(fileChannel)
if(keyValue.isDefined) chunks.append(keyValue.get)
isEndOfFileChannel = !keyValue.isDefined
count += 1
}while(!isEndOfFileChannel && count < countMax)
(chunks.toList, isEndOfFileChannel)
}
def getSortedChunk(oneChunk: List[(Key, Value)]): List[(Key, Value)] = {
oneChunk.sortWith((_._1 < _._1))
}
}
How I define Key and Value:
case class Key(keys: List[Byte]) extends Ordered[Key] {
def isEmpty(): Boolean = keys.isEmpty
def compare(that: Key): Int = {
compare_aux(this.keys, that.keys)
}
private def compare_aux(keys1: List[Byte], keys2: List[Byte]): Int = {
(keys1, keys2) match {
case (Nil, Nil) => 0
case (list, Nil) => 1
case (Nil, list) => -1
case (hd1::tl1, hd2::tl2) => {
if(hd1 > hd2) 1
else if(hd1 < hd2) -1
else compare_aux(tl1, tl2)
}
}
}
}
case class Value(values: List[Byte])
I've found the answer. Reading from Binary and ASCII are different.
In what order should the sorted file be?
For binary records (GraySort or MinuteSort), the 10-byte keys should be ordered as arrays of unsigned bytes. The memcmp() library routine can be used for this purpose.
For sorting Binary, I need to convert signed bytes into unsigned bytes.

Scala groupBy of a tuple to calculate stock basis

I am working on an exercise to calculate stock basis given a list of stock purchases in the form of thruples (ticker, qty, stock_price). I've got it working, but would like to do the calculation part in more of a functional way. Anyone have an answer for this?
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 20, 100),
// ("FB", 10, 100)
// output:
// List(("FB", (10, 100)),
// ("TSLA", (40, 150))))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val basises = trades groupBy(_._1) map {
case (key, pairs) =>
val quantity = pairs.map(_._2).toList
val price = pairs.map(_._3).toList
var totalPrice: Int = 0
for (i <- quantity.indices) {
totalPrice += quantity(i) * price(i)
}
key -> (quantity.sum, totalPrice / quantity.sum)
}
basises
}
This looks like this might work for you. (updated)
def generateBasis(trades: Iterable[(String, Int, Int)]) =
trades.groupBy(_._1).mapValues {
_.foldLeft((0,0)){case ((tq,tp),(_,q,p)) => (tq + q, tp + q * p)}
}.map{case (k, (q,p)) => (k,q,p/q)} // turn Map into tuples (triples)
I came up with the solution below. Thanks everyone for their input. I'd love to hear if anyone had a more elegant solution.
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 10, 100),
// ("FB", 5, 50)
// output:
// List(("FB", (5, 50)),
// ("TSLA", (30, 166)))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val groupedTrades = (trades groupBy(_._1)) map {
case (key, pairs) =>
key -> (pairs.map(e => (e._2, e._3)))
} // List((FB,List((5,50))), (TSLA,List((20,200), (10,100))))
val costBasises = for {groupedTrade <- groupedTrades
tradeCost = for {tup <- groupedTrade._2 // (qty, cost)
} yield tup._1 * tup._2 // (trade_qty * trade_cost)
tradeQuantity = for { tup <- groupedTrade._2
} yield tup._1 // trade_qty
} yield (groupedTrade._1, tradeQuantity.sum, tradeCost.sum / tradeQuantity.sum )
costBasises.toList // List(("FB", (5, 50)),("TSLA", (30, 166)))
}

Find a instance by field value comparison in Seq

What is best practice to get a instance in Seq ?
case class Point(x: Int, y: Int)
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
I'd like to acquire Point with the maximum of y. (in this case: Point(3, 30))
What's best way ?
The easiest way would be to use TraversableOnce.maxBy:
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
scala> points.maxBy(_.y)
res1: Point = Point(3,30)
#YuvalItzchakov's answer is correct but here another way to do it using Ordering :
val points: Seq[Point] = Seq(Point(1, 10), Point(2, 20), Point(3, 30))
// points: Seq[Point] = List(Point(1,10), Point(2,20), Point(3,30))
val order = Ordering.by((_: Point).y)
// order: scala.math.Ordering[Point] = scala.math.Ordering$$anon$9#5a2fa51f
val max_point = points.reduce(order.max)
// max_point: Point = Point(3,30)
or
points.max(order)
// Point = Point(3,30)
or with implicit Ordering:
{
implicit val pointOrdering = Ordering.by((_: Point).y)
points.max
}
// Point = Point(3,30)
Note: TraversableOnce.maxBy uses also implicit Ordering. Reference.
Another way of doing it is by using foldleft.
val points:Seq[Point] = Seq(Point(1,10),Point(2,20),Point(3,30))
points.foldLeft[Point](Point(0,0)){(z,f) =>if (f.y>z.y) f else z}