Iterate an RDD[A] where A contains List[B]

Iterate an RDD[A] where A contains List[B] - scala

I have object A which contains a List of objects B, I want to get something from every B object (ex: B.id), more or less like 2 for each combined.
Example code:
rddA.flatMap(
a => a.listB.map(
b => (a.id, b.id)
)
)

This actually works, my problem was that i call a method that uses another map thus failing.
case class Car( id :Long, listWheels : List[Wheel])
case class Wheel( id : Long)
rddCars.flatMap(
a => a.listWheels.map(
b => (a.id, b.id, getBrandFromSerial(serialToBrandMap,b.id))
)
)
getBrandFromSerial(serialToBrandMap: RDD[(Int, String)], id : Int) : String = {
val a = serialToBrandMap.filter(_._1 == id)
val b = a.map(_._2).top(1)
b(0)
}
The result expected is a RDD[(Int,Int,String)] with Car id, Wheel id and Wheel Brand on a Tuple3.
EDIT: Sample input/output
Input:
val wheels1 = List(Wheel(1),Wheel(1),Wheel(2), Wheel(2)
val wheels2 = List(Wheel(3),Wheel(3),Wheel(2), Wheel(2)
val rddCars : RDD[Car] = sparkContect.parallelize(List(Car(1,wheels1), Car(2, wheels2)))
val serialToBrandList : List[(Int, String)] = List((1,"Brand1"), (2,"Brand2"),(3,"Brand3"))
val serialToBrandMap : RDD[(Int, String)] = sparkContect.parallelize( serialToBrandList)
Output:
(1,1,Brand1),(1,1,Brand1),(1,2,Brand2)....(2,3,Brand3) and so on

Related

Transform a list of object to lists of its field

I have a List[MyObject], with MyObject containing the fields field1, field2 and field3.
I'm looking for an efficient way of doing :
Tuple3(_.map(_.field1), _.map(_.field2), _.map(_.field3))
In java I would do something like :
Field1Type f1 = new ArrayList<Field1Type>();
Field2Type f2 = new ArrayList<Field2Type>();
Field3Type f3 = new ArrayList<Field3Type>();
for(MyObject mo : myObjects) {
f1.add(mo.getField1());
f2.add(mo.getField2());
f3.add(mo.getField3());
}
I would like something more functional since I'm in scala but I can't put my finger on it.

Get 2\3 sub-groups with unzip\unzip3
Assuming the starting point:
val objects: Seq[MyObject] = ???
You can unzip to get all 3 sub-groups:
val (firsts, seconds, thirds) =
objects
.unzip3((o: MyObject) => (o.f1, o.f2, o.f3))
What if I have more than 3 relevant sub-groups ?
If you really need more sub-groups you need to implement your own unzipN however instead of working with Tuple22 I would personally use an adapter:
case class MyObjectsProjection(private val objs: Seq[MyObject]) {
lazy val f1s: Seq[String] =
objs.map(_.f1)
lazy val f2s: Seq[String] =
objs.map(_.f2)
...
lazy val f22s: Seq[String] =
objs.map(_.f3)
}
val objects: Seq[MyClass] = ???
val objsProjection = MyObjectsProjection(objects)
objs.f1s
objs.f2s
...
objs.f22s
Notes:
Change MyObjectsProjection according to your needs.
This is from a Scala 2.12\2.11 vanilla perspective.

The following will decompose your objects into three lists:
case class MyObject[T,S,R](f1: T, f2: S, f3: R)
val myObjects: Seq[MyObject[Int, Double, String]] = ???
val (l1, l2, l3) = myObjects.foldLeft((List.empty[Int], List.empty[Double], List.empty[String]))((acc, nxt) => {
(nxt.f1 :: acc._1, nxt.f2 :: acc._2, nxt.f3 :: acc._3)
})

Change value inside Future(success(Some(List(BN)))

I have a class as
case class BN[A] (val BNId: Long, score: Double, child: A)
and a database as BnDAO
class BnDAO(...) {
def readAsync(info : scala.Long)(implicit ec : scala.concurrent.ExecutionContext) : scala.concurrent.Future[scala.Option[scala.List[BN]]] = { /* compiled code */ }
}
During my function of calling BnDAO to retrieve BN for set of ids, I want to change the result BN's score according to a Map[InfoId: Long, score: Double]
Here is my current code
val infoIds = scoreMap.keys.toSet
val futureBN = infoIds.map {
id =>
val curBN = interestsDAO.readAsync(id)
//curBN is now "Future(Success(Some(List(BN))))"
val curScore = scoreMap.get(id)
//how to change curBN's score with curSocre?
}
Thank you!!
//Thank you for Allen's answer. Based on his answer. Is it possible to create a Set of Future[Option[List[BN]]] and a Map of Long to List[Long] (infoId -> List[BNId]) within one traverse.

I was not sure what you meant, so I created two answers. Pick the one that you want.
import scala.concurrent.duration._
val ids = scoreMap.keys.toVector
val futureBN = ids.map { id =>
val curBNFut = interestsDAO.readAsync(id)
val v = curBNFut.map(opt => opt.map(l =>
id -> l.map(e =>
scoreMap.get(e.id).map(escore => e.copy(score = escore)).getOrElse(e)
)
)
}
val mapResultFut = Future.sequence(futureBN).map(m =>
m.flatMap(identity).toMap
)
val mapResult = Await.result(mapResultFut, 5.seconds)
val setFut = futureBN.map(fut => fut.map{opt => opt.map{case (id, l) => l }}).toSet
or
import scala.concurrent.duration._
val ids = scoreMap.keys.toVector
val futureBN = ids.map { id =>
val curBNFut = interestsDAO.readAsync(id)
val v = curBNFut.map(opt => opt.map(l =>
id -> l.map(e =>
scoreMap.get(id).map(escore => e.copy(score = escore)).getOrElse(e)
)
)
}
val mapResultFut = Future.sequence(futureBN).map(m =>
m.flatMap(identity).toMap
)
val mapResult = Await.result(mapResultFut, 5.seconds)
val setFut = futureBN.map(fut => fut.map{opt => opt.map{case (id, l) => l }}).toSet
The difference is in the input to scoreMap.get
I am not sure which one you want.
Post a comment if there is anything wrong. Make sure to test my answer, so that it is what you want.

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌dd.leftOuterJoin(ite‌m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)

Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

Trouble with Scala pattern matching with map - required String

I'm trying to make my RDD into a pairdRDD, but having trouble with the pattern matching and I have no idea what I'm doing wrong..
val test = sc.textFile("neighborhood_test.csv");
val nhead0 = test.first;
val test_split = test.map(line => line.split("\t"));
val nhead = test_split.first;
val test_neigh0 = test.filter(line => line!= nhead0);
//test_neigh0.first = 3335 Dunlap Seattle
val test_neigh1 = test_neigh0.map(line => line.split("\t"));
//test_neigh1.first = Array[String] = Array(3335, Dunlap, Seattle)
val test_neigh = test_neigh1.map({case (id, neigh, city) => (id, (neigh, city))});
Gives error:
found : (T1, T2, T3)
required: String
val test_neigh = test_neigh0.map({case (id, neigh, city) => (id, (neigh, city))});
EDIT:
The inputfile is tab seperated and looks like this:
id neighbourhood city
3335 Dunlap Seattle
4291 Roosevelt Seattle
5682 South Delridge Seattle
As output I wan't a pairRDD with id as key, and (neigh, city) as value.

Neither test_neigh0.first nor test_neigh1.first is a triple, so you cannot pattern match it as such.
The elements in test_neigh1 are Array[String]. Under the assumption that these arrays are all of length 3, you can pattern match against them as { case Array(id, neigh, city) => ...}.
To make sure that you won't get a matching error if one of the line as more or less than 3 elements, you may collect on this pattern matching, instead of mapping on it.
val test_neigh: RDD[(String, (String, String))] = test_neigh1.collect{
case Array(id, neigh, city) => (id, (neigh, city))
}
EDIT
The issues you experienced as described in your comment are related to RDD[_] not being a usual collection (such as List, Array or Set). To avoid those, you might need to fetch elements in the array without pattern matching:
val test_neigh: RDD[(String, (String, String))] = test_neigh0.map(line => {
val arr = line.split("\t")
(arr(0), (arr(1), arr(2))
})

val baseRDD = sc.textFile("neighborhood_test.csv").filter { x => !x.contains("city") }
baseRDD.map { x =>
val split = x.split("\t")
(split(0), (split(1), split(2)))
}.groupByKey().foreach(println(_))
Result:
(3335,CompactBuffer((Dunlap,Seattle)))
(4291,CompactBuffer((Roosevelt,Seattle)))
(5682,CompactBuffer((South Delridge,Seattle)))

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar

If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]

rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......

val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Iterate an RDD[A] where A contains List[B] - scala

I have object A which contains a List of objects B, I want to get something from every B object (ex: B.id), more or less like 2 for each combined. Example code: rddA.flatMap( a => a.listB.map( b => (a.id, b.id) ) )

Related

Transform a list of object to lists of its field

Change value inside Future(success(Some(List(BN)))

spark: join rdd based on sequence of another rdd

Trouble with Scala pattern matching with map - required String

spark join operation based on two columns

Categories

Resources