Spark - aggregateByKey Type mismatch error - scala

I am trying find the problem behind this. I am trying to find the maximum number Marks of each student using aggregateByKey.
val data = spark.sc.Seq(("R1","M",22),("R1","E",25),("R1","F",29),
("R2","M",20),("R2","E",32),("R2","F",52))
.toDF("Name","Subject","Marks")
def seqOp = (acc:Int,ele:(String,Int)) => if (acc>ele._2) acc else ele._2
def combOp =(acc:Int,acc1:Int) => if(acc>acc1) acc else acc1
val r = data.rdd.map{case(t1,t2,t3)=> (t1,(t2,t3))}.aggregateByKey(0)(seqOp,combOp)
I am getting error that aggregateByKey accepts (Int,(Any,Any)) but actual is (Int,(String,Int)).

Your map function is incorrect since you have a Row as input, not a Tuple3
Fix the last line with :
val r = data.rdd.map { r =>
val t1 = r.getAs[String](0)
val t2 = r.getAs[String](1)
val t3 = r.getAs[Int](2)
(t1,(t2,t3))
}.aggregateByKey(0)(seqOp,combOp)

Related

Multiplication of "double" values in scala

I want to multiply two sparse matrices in spark using scala. I am passing these matrices in form of arguments and storing result in another argument.
Matrices are text files where each matrix element is represented by as: row, column, element.
I am not able to multiply two Double values in Scala.
object MultiplySpark {
def main(args: Array[ String ]) {
val conf = new SparkConf().setAppName("Multiply")
conf.setMaster("local[2]")
val sc = new SparkContext(conf)
val M = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val N = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val Mmap = M.map( e => (e._2,e))
val Nmap = N.map( d => (d._2,d))
val MNjoin = Mmap.join(Nmap).map{ case (k,(e,d)) => e._2.toDouble+","+d._2.toDouble }
val result = MNjoin.reduceByKey( (a,b) => a*b)
.map(entry => {
((entry._1._1, entry._1._2), entry._2)
})
.reduceByKey((a, b) => a + b)
result.saveAsTextFile(args(2))
sc.stop()
How can I multiply double values in Scala?
Please note:
I tried a.toDouble * b.toDouble
Error is: Value * is not a member of Double Double
This reduceByKey would work if you had RDD[((Int, Int), Double)] (or RDD[(SomeType, Double)] more generally) and join gives you RDD[((Int, Int), (Double, Double))]. So you are trying to multiply pairs (Double, Double), not Doubles.

Merging list of uneven length with default value for missing matches

Im trying to pair up two lists in Scala where non matching pairs should be replaced by a default value, this is what I have so far but thy all fall short in some way.
How do I create List((a,a),(b,empty),(c,c))???
case class Test(id: Option[Int] = None)
val empty = Test()
val a = Test(Some(1))
val b = Test(Some(2))
val c = Test(Some(3))
val cache = List(a,b,c)
val delta = List(a,c)
//Trial 1
val newCache1 = cache.zipAll(delta,empty,empty)
//Tial 2
val newCache2 = for {
c <- cache
d <- delta
if c.id == d.id
} yield (c,d)
//Tial 3
val newCache3 = for {
c <- cache
d <- delta
} yield if (c.id == d.id) (c,d) else (c,empty)
Turn your delta into a map, then join them up.
val deltaMap: Map[Int, Test] =
delta.flatMap(x => x.id.map(id => id -> x)).toMap
val newCache: Seq[(Test, Test)] = cache.map { c =>
c -> c.id.flatMap(deltaMap.get).getOrElse(empty)
}

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it
There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList
Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.
Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))

Finding values within broadcast variable

I want to join two sets by applying broadcast variable. I am trying to implement the first suggestion from Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
val emp_newBC = sc.broadcast(emp_new.collectAsMap())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
} yield ((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
However as mentioned here: broadcast variable fails to take all data I need to use collect() rather than collectAsMAp(). I tried to adjust my code as below:
val emp_newBC = sc.broadcast(emp_new.collect())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
amk = m.indexOf(t)
} yield ((w + '-' + emp_newBC.value(amk)),1) //yield ((t, w), (m.get(t).get)) //((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
But it seems m.contains(t) does not respond. How can I remedy this?
Thanks in advance.
How about something like this?
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
Regarding your code... As far as I understand emp_new is a RDD[(String, String)]. When it is collected you get an Array[(String, String)]. When you use
((t, w)) <- iter
t is a String so m.contains(t) will always return false.
Another problem I see is preservesPartitioning = true inside mapPartitions. There a few possible scenarios:
emp is partitioned and you want joined to be partitioned as well. Since you change key from t to some new value partitioning cannot be preserved and resulting RDD has to be repartitioned. If you use preservesPartitioning = true output RDD will end up with wrong partitions.
emp is partitioned but you don't need partitioning for joined. There is no reason to use preservesPartitioning.
emp is not partitioned. Setting preservesPartitioning has no effect.

RDD Product without repeat some tuples

I got the following RDD[String]
TTT
SSS
AAA
and I am having problems to get the following tuples
(TTT, SSS)
(TTT, AAA)
(SSS, AAA)
I was doing:
val res = input.cartesian(input).filter{ case (a,b) => a != b }
But the result is:
(TTT,SSS)
(TTT,AAA)
(SSS,TTT)
(SSS,AAA)
(AAA,TTT)
(AAA,SSS)
What is the best way to do that? please
You could impose an order in the tuple to obtain the combinations:
val res = input.cartesian(input).filter{ case (a,b) => a < b }