Finding average of values against key using RDD in Spark - scala

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys

First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

Related

SPARK: sum of elements with this same indexes from RDD[Array[Int]] in spark-rdd

I have three files like:
file1: 1,2,3,4,5
6,7,8,9,10
file2: 11,12,13,14,15
16,17,18,19,20
file3: 21,22,23,24,25
26,27,28,29,30
I have to find the sum of rows from each file:
1+2+3+4+5 + 11+12+13+14+15 + 21+21+23+24+25
6+7+8+9+10 + 16+17+18+19+20 + 26+27+28+29+30
I have written following code in spark-scala to get the Array of sum of all the rows:
val filesRDD = sc.wholeTextFiles("path to folder\\numbers\\*")
// creating RDD[Array[String]]
val linesRDD = filesRDD.map(elem => elem._2.split("\\n"))
// creating RDD[Array[Array[Int]]]
val rdd1 = linesRDD.map(line => line.map(str => str.split(",").map(_.trim.toInt)))
// creating RDD[Array[Int]]
val rdd2 = rdd1.map(elem => elem.map(e => e.sum))
rdd2.collect.foreach(elem => println(elem.mkString(",")))
the output I am getting is:
15,40
65,90
115,140
What I want is to sum 15+65+115 and 40+90+140
Any help is appreciated!
PS:
the files can have different no. of lines like some with 3 lines other with 4 and there can be any no. of files.
I want to do this using rdds only not dataframes.
You can use reduce to sum up the arrays:
val result = rdd2.reduce((x,y) => (x,y).zipped.map(_ + _))
// result: Array[Int] = Array(195, 270)
and if the files are of different length (e.g. file 3 has only one line 21,22,23,24,25)
val result = rdd2.reduce((x,y) => x.zipAll(y,0,0).map{case (a, b) => a + b})

Scala -Spark - Iterate over joined pair RDD

I am trying to join 2 PairRDD in spark and not sure how to iterate over the result.
val input1 = sc.textFile(inputFile1)
val input2 = sc.textFile(inputFile2)
val pairs = input1.map(x => (x.split("\\|")(18),x))
val groupPairs = pairs.groupByKey()
val staPairs = input2.map(y => (y.split("\\|")(0),y))
val stagroupPairs = staPairs.groupByKey()
val finalJoined = groupPairs.leftOuterJoin(stagroupPairs)
finalJoined is of type finalJoined:
org.apache.spark.rdd.RDD[(String, (Iterable[String], Option[Iterable[String]]))]
When I do finalJoined.collect().foreach(println) I see the below output :
(key1,(CompactBuffer(val1a,val1b),Some(CompactBuffer(val1)))
(key2,(CompactBuffer(val2a,val2b),Some(CompactBuffer(val2)))
I would like the output to be
for key1
val1a+"|"+val1
val1b+"|"+val1
for key2
val2a+"|"+val2
avoid groupByKey step on both the rdds and perform join on directly pairs and starpairs..you will get the desired result.
For e.g,
val rdd1 = sc.parallelize(Array("key1,val1a","key1,val1b","key2,val2a","key2,val2b").toSeq)
val rdd2 = sc.parallelize(Array("key1,val1","key2,val2").toSeq)
val pairs= rdd1.map(_.split(",")).map(x => (x(0),x(1)))
val starPairs= rdd2.map(_.split(",")).map(x => (x(0),x(1)))
val res = pairs.join(starPairs)
res.foreach(println)
(key1,(val1a,val1))
(key1,(val1b,val1))
(key2,(val2a,val2))
(key2,(val2b,val2))

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it
There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList
Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.
Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))