Change value inside Future(success(Some(List(BN))) - scala

I have a class as
case class BN[A] (val BNId: Long, score: Double, child: A)
and a database as BnDAO
class BnDAO(...) {
def readAsync(info : scala.Long)(implicit ec : scala.concurrent.ExecutionContext) : scala.concurrent.Future[scala.Option[scala.List[BN]]] = { /* compiled code */ }
}
During my function of calling BnDAO to retrieve BN for set of ids, I want to change the result BN's score according to a Map[InfoId: Long, score: Double]
Here is my current code
val infoIds = scoreMap.keys.toSet
val futureBN = infoIds.map {
id =>
val curBN = interestsDAO.readAsync(id)
//curBN is now "Future(Success(Some(List(BN))))"
val curScore = scoreMap.get(id)
//how to change curBN's score with curSocre?
}
Thank you!!
//Thank you for Allen's answer. Based on his answer. Is it possible to create a Set of Future[Option[List[BN]]] and a Map of Long to List[Long] (infoId -> List[BNId]) within one traverse.

I was not sure what you meant, so I created two answers. Pick the one that you want.
import scala.concurrent.duration._
val ids = scoreMap.keys.toVector
val futureBN = ids.map { id =>
val curBNFut = interestsDAO.readAsync(id)
val v = curBNFut.map(opt => opt.map(l =>
id -> l.map(e =>
scoreMap.get(e.id).map(escore => e.copy(score = escore)).getOrElse(e)
)
)
}
val mapResultFut = Future.sequence(futureBN).map(m =>
m.flatMap(identity).toMap
)
val mapResult = Await.result(mapResultFut, 5.seconds)
val setFut = futureBN.map(fut => fut.map{opt => opt.map{case (id, l) => l }}).toSet
or
import scala.concurrent.duration._
val ids = scoreMap.keys.toVector
val futureBN = ids.map { id =>
val curBNFut = interestsDAO.readAsync(id)
val v = curBNFut.map(opt => opt.map(l =>
id -> l.map(e =>
scoreMap.get(id).map(escore => e.copy(score = escore)).getOrElse(e)
)
)
}
val mapResultFut = Future.sequence(futureBN).map(m =>
m.flatMap(identity).toMap
)
val mapResult = Await.result(mapResultFut, 5.seconds)
val setFut = futureBN.map(fut => fut.map{opt => opt.map{case (id, l) => l }}).toSet
The difference is in the input to scoreMap.get
I am not sure which one you want.
Post a comment if there is anything wrong. Make sure to test my answer, so that it is what you want.

Related

Spark doesn't conform to expected type TraversableOnce

val num_idf_pairs = rescaledData.select("item", "features")
.rdd.map(x => {(x(0), x(1))})
val itemRdd = rescaledData.select("item", "features").where("item = 1")
.rdd.map(x => {(x(0), x(1))})
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val sims = num_idf_pairs.flatMap {
case (key, value) =>
val sv1 = value.asInstanceOf[SV]
import breeze.linalg._
val valuesVector = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
itemRdd.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val xVector = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val sim = valuesVector.dot(xVector) / (norm(valuesVector) * norm(xVector))
(id2.toString, key.toString, sim)
}
}
The error is doesn't conform to expected type TraversableOnce.
When i modify as follows:
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val docSims = num_idf_pairs.flatMap {
case (id1, idf1) =>
val idfs = b_num_idf_pairs.value.filter(_._1 != id1)
val sv1 = idf1.asInstanceOf[SV]
import breeze.linalg._
val bsv1 = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
idfs.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val bsv2 = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val cosSim = bsv1.dot(bsv2).asInstanceOf[Double] / (norm(bsv1) * norm(bsv2))
(id1.toString(), id2.toString(), cosSim)
}
}
it compiles but this will cause an OutOfMemoryException. I set --executor-memory 4G.
The first snippet:
num_idf_pairs.flatMap {
...
itemRdd.map { ...}
}
is not only not valid Spark code (no nested transformations are allowed), but also, as you already know, won't type check, because RDD is not TraversableOnce.
The second snippet likely fails, because data you are trying to collect and broadcast is to large.
It looks like you are trying to find all items similarity so you'll need Cartesian product, and structure your code roughly like this:
num_idf_pairs
.cartesian(itemRdd)
.filter { case ((id1, idf1), (id2, idf2)) => id1 != id2 }
.map { case ((id1, idf1), (id2, idf2)) => {
val cosSim = ??? // Compute similarity
(id1.toString(), id2.toString(), cosSim)
}}

Iterate an RDD[A] where A contains List[B]

I have object A which contains a List of objects B, I want to get something from every B object (ex: B.id), more or less like 2 for each combined.
Example code:
rddA.flatMap(
a => a.listB.map(
b => (a.id, b.id)
)
)
This actually works, my problem was that i call a method that uses another map thus failing.
case class Car( id :Long, listWheels : List[Wheel])
case class Wheel( id : Long)
rddCars.flatMap(
a => a.listWheels.map(
b => (a.id, b.id, getBrandFromSerial(serialToBrandMap,b.id))
)
)
getBrandFromSerial(serialToBrandMap: RDD[(Int, String)], id : Int) : String = {
val a = serialToBrandMap.filter(_._1 == id)
val b = a.map(_._2).top(1)
b(0)
}
The result expected is a RDD[(Int,Int,String)] with Car id, Wheel id and Wheel Brand on a Tuple3.
EDIT: Sample input/output
Input:
val wheels1 = List(Wheel(1),Wheel(1),Wheel(2), Wheel(2)
val wheels2 = List(Wheel(3),Wheel(3),Wheel(2), Wheel(2)
val rddCars : RDD[Car] = sparkContect.parallelize(List(Car(1,wheels1), Car(2, wheels2)))
val serialToBrandList : List[(Int, String)] = List((1,"Brand1"), (2,"Brand2"),(3,"Brand3"))
val serialToBrandMap : RDD[(Int, String)] = sparkContect.parallelize( serialToBrandList)
Output:
(1,1,Brand1),(1,1,Brand1),(1,2,Brand2)....(2,3,Brand3) and so on

splitting into new columns as many time separator between column values has occurred

I’ve a Dataframe where in some columns there are multiple values, always separated by ^
phone|contact|
ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5|
phone1|phone2|contact1|contact2|
ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5|
How can this be achieved using loop as the separator between column values
are not constant.
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("charset", "UTF-8").load("test.txt").
val columnList=df.columns
val xx = columnList.map(x => x->0).toMap
val opMap = df.rdd.flatMap { row =>
columnList.foldLeft(xx) { case (y, col) =>
val s = row.getAs[String](col).split("\\^").length
if (y(col) < s)
y.updated(col, s)
else
y
}.toList
}
val colMaxSizeMap = opMap.groupBy(x => x._1).map(x => x._2.toList.maxBy(x => x._2)).collect().toMap
val x = df.rdd.map{x =>
val op = columnList.flatMap{ y =>
val op = x.getAs[String](y).split("\\^")
op++List.fill(colMaxSizeMap(y)-op.size)("")
}
Row.fromSeq(op)
}
val structFieldList = columnList.flatMap{colName =>
List.range(0,colMaxSizeMap(colName),1).map{ i =>
StructField(s"$colName"+s"$i",StringType)
}
}
val schema = StructType(structFieldList)
val da= spark.createDataFrame(x,schema)

How to combine `count` and `sum` computations for the same source

There is a some stream of integers:
val source = Source(List(1,2,3,4,5))
Is there possible to get the (count, sum) result from the source? For the above example it will be (5, 15).
I guess I should use flows and combine them:
val countFlow = Flow[Int].fold(0)((c, _) => c + 1)
val sumFlow = Flow[Int].fold(0)((s, e) => s + e)
How to apply the above flows to the source. Or is there another way?
Final Total
The Flow that you presented is almost correct for getting a final value after the source is exhausted:
case class Data(sum : Int = 0, count : Int = 0)
val updateData : (Data, Int) => Data =
(data, i) => Data(data.sum + i, data.count + 1)
val zeroData = Data()
val countAndSum = Flow[Int].fold(zeroData)(updateData)
This Flow can then be combined with a Sink.head to get the final result:
val result : Future[Data] =
source
.via(countAndSum)
.runWith(Sink[Data].head)
Intermediate Values
If you want a "running counter", e.g. you want all of the intermediate Data values, then you can use Flow.scan instead of fold:
val intermediateCountAndSum =
Flow[Int].scan(zeroData)(updateData)
And you can "drain" these Data values into a Sink.seq:
val intermediateResult : Future[Seq[Data]] =
source
.via(intermediateCountAndSum)
.runWith(Sink[Data].seq)
val graph = Source.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val fanOut = builder.add(Broadcast[Int](2))
val merge = builder.add(Zip[Int, Int])
source ~> fanOut ~> countFlow ~> merge.in0
fanOut ~> sumFlow ~> merge.in1
SourceShape(merge.out)
})
graph.runWith(Sink.last)
You can simply do the following
source.map(list => (list.length, list.reduceLeft(_+_)))
I hope its helpful
case class Stats(sum: Int, count: Int) {
def add(el: Int): Stats = this.copy(sum = sum += el, count = count +=1)
}
object Stats {
def empty: Stats = Stats(0, 0)
}
val countFlow = Flow[Status].fold(Stats.empty)((stats, e) => stats add e)

Chisel: Access to Module Parameters from Tester

How does one access the parameters used to construct a Module from inside the Tester that is testing it?
In the test below I am passing the parameters explicitly both to the Module and to the Tester. I would prefer not to have to pass them to the Tester but instead extract them from the module that was also passed in.
Also I am new to scala/chisel so any tips on bad techniques I'm using would be appreciated :).
import Chisel._
import math.pow
class TestA(dataWidth: Int, arrayLength: Int) extends Module {
val dataType = Bits(INPUT, width = dataWidth)
val arrayType = Vec(gen = dataType, n = arrayLength)
val io = new Bundle {
val i_valid = Bool(INPUT)
val i_data = dataType
val i_array = arrayType
val o_valid = Bool(OUTPUT)
val o_data = dataType.flip
val o_array = arrayType.flip
}
io.o_valid := io.i_valid
io.o_data := io.i_data
io.o_array := io.i_array
}
class TestATests(c: TestA, dataWidth: Int, arrayLength: Int) extends Tester(c) {
val maxData = pow(2, dataWidth).toInt
for (t <- 0 until 16) {
val i_valid = rnd.nextInt(2)
val i_data = rnd.nextInt(maxData)
val i_array = List.fill(arrayLength)(rnd.nextInt(maxData))
poke(c.io.i_valid, i_valid)
poke(c.io.i_data, i_data)
(c.io.i_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
expect(c.io.o_valid, i_valid)
expect(c.io.o_data, i_data)
(c.io.o_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
step(1)
}
}
object TestAObject {
def main(args: Array[String]): Unit = {
val tutArgs = args.slice(0, args.length)
val dataWidth = 5
val arrayLength = 6
chiselMainTest(tutArgs, () => Module(
new TestA(dataWidth=dataWidth, arrayLength=arrayLength))){
c => new TestATests(c, dataWidth=dataWidth, arrayLength=arrayLength)
}
}
}
If you make the arguments dataWidth and arrayLength members of TestA you can just reference them. In Scala this can be accomplished by inserting val into the argument list:
class TestA(val dataWidth: Int, val arrayLength: Int) extends Module ...
Then you can reference them from the test as members with c.dataWidth or c.arrayLength