Scala -Spark - Iterate over joined pair RDD - scala

I am trying to join 2 PairRDD in spark and not sure how to iterate over the result.
val input1 = sc.textFile(inputFile1)
val input2 = sc.textFile(inputFile2)
val pairs = input1.map(x => (x.split("\\|")(18),x))
val groupPairs = pairs.groupByKey()
val staPairs = input2.map(y => (y.split("\\|")(0),y))
val stagroupPairs = staPairs.groupByKey()
val finalJoined = groupPairs.leftOuterJoin(stagroupPairs)
finalJoined is of type finalJoined:
org.apache.spark.rdd.RDD[(String, (Iterable[String], Option[Iterable[String]]))]
When I do finalJoined.collect().foreach(println) I see the below output :
(key1,(CompactBuffer(val1a,val1b),Some(CompactBuffer(val1)))
(key2,(CompactBuffer(val2a,val2b),Some(CompactBuffer(val2)))
I would like the output to be
for key1
val1a+"|"+val1
val1b+"|"+val1
for key2
val2a+"|"+val2

avoid groupByKey step on both the rdds and perform join on directly pairs and starpairs..you will get the desired result.
For e.g,
val rdd1 = sc.parallelize(Array("key1,val1a","key1,val1b","key2,val2a","key2,val2b").toSeq)
val rdd2 = sc.parallelize(Array("key1,val1","key2,val2").toSeq)
val pairs= rdd1.map(_.split(",")).map(x => (x(0),x(1)))
val starPairs= rdd2.map(_.split(",")).map(x => (x(0),x(1)))
val res = pairs.join(starPairs)
res.foreach(println)
(key1,(val1a,val1))
(key1,(val1b,val1))
(key2,(val2a,val2))
(key2,(val2b,val2))

Related

How to apply kmeans for parquet file?

I want to apply k-means for my parquet file.but error appear .
edited
java.lang.ArrayIndexOutOfBoundsException: 2
code
val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 30
val numIteration = 1
val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)
How to solve this error?
I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.
In the example
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation
To convert from Dataframe to RDD you can use the below sample as reference
val rows: RDD[Row] = df.rdd
val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

Finding average of values against key using RDD in Spark

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys
First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

Multiplication of "double" values in scala

I want to multiply two sparse matrices in spark using scala. I am passing these matrices in form of arguments and storing result in another argument.
Matrices are text files where each matrix element is represented by as: row, column, element.
I am not able to multiply two Double values in Scala.
object MultiplySpark {
def main(args: Array[ String ]) {
val conf = new SparkConf().setAppName("Multiply")
conf.setMaster("local[2]")
val sc = new SparkContext(conf)
val M = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val N = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val Mmap = M.map( e => (e._2,e))
val Nmap = N.map( d => (d._2,d))
val MNjoin = Mmap.join(Nmap).map{ case (k,(e,d)) => e._2.toDouble+","+d._2.toDouble }
val result = MNjoin.reduceByKey( (a,b) => a*b)
.map(entry => {
((entry._1._1, entry._1._2), entry._2)
})
.reduceByKey((a, b) => a + b)
result.saveAsTextFile(args(2))
sc.stop()
How can I multiply double values in Scala?
Please note:
I tried a.toDouble * b.toDouble
Error is: Value * is not a member of Double Double
This reduceByKey would work if you had RDD[((Int, Int), Double)] (or RDD[(SomeType, Double)] more generally) and join gives you RDD[((Int, Int), (Double, Double))]. So you are trying to multiply pairs (Double, Double), not Doubles.

Spark refuse to zip RDD [duplicate]

This question already has answers here:
Can only zip RDDs with same number of elements in each partition despite repartition
(3 answers)
Closed 6 years ago.
I have the following exception at the last line running the code below with Spark
org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
val rdd1 = anRDD
val rdd2 = AnotherRDD
println(rdd1.count() == rdd2.count()) // Write true
val nparts = rdd1.getNumPartitions + rdd2.getNumPartitions
val rdd1Bis = rdd1.repartition(nparts) // Try to repartition (useless)
val rdd2Bis = rdd2.repartition(nparts)
val zipped = rdd1Bis.zip(rdd2Bis)
println(zipped.count())
What is wrong ?
PS: it works if I collect rdd1 and rdd2 before zipping but i need to keep them as RDD
A solution could be to zip with a join:
val rdd1Bis = rdd1.zipWithIndex.map((x) =>(x._2, x._1))
val rdd2Bis = rdd2.zipWithIndex.map((x) =>(x._2, x._1))
val zipped = rdd1Bis.join(rdd2Bis).map(x => x._2)
it works check this:Please reply with what part it fails for you
val list1 = List("a","b","c","d")
val list1 = List("a","b","c","d")
val rdd1 = sc.parallelize(list1)
val rdd1 = sc.parallelize(list2)
Executing ur code :
val nparts = rdd1.getNumPartitions + rdd2.getNumPartitions
val rdd1Bis = rdd1.repartition(nparts) // Try to repartition (useless)
val rdd2Bis = rdd2.repartition(nparts)
val zipped = rdd1Bis.zip(rdd2Bis)
Result:
println(zipped.count())
4
zipped.foreach(println)
(a,a)
(b,b)
(c,c)
(d,d)

How to change RowMatrix into Array in Spark or export it as a CSV?

I've got this code in Scala:
val mat: CoordinateMatrix = new CoordinateMatrix(data)
val rowMatrix: RowMatrix = mat.toRowMatrix()
val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(100, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val S: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val uArray: Array[Double] = U.toArray // doesn't work, because there is not toArray function in RowMatrix type
val sArray: Array[Double] = S.toArray // works good
val vArray: Array[Double] = V.toArray // works good
How can I change U into uArray or similar type, that could be printed out into CSV file?
That's a basic operation, here is what you have to do considering that U is a RowMatrix as following :
val U = svd.U
rows() is a RowMatrix method that allows you to get an RDD from your RowMatrix by row.
You'll just need to apply rows on your RowMatrix and map the RDD[Vector] to create an Array that you would concatenate into a string creating an RDD[String].
val rdd = U.rows.map( x => x.toArray.mkString(","))
All you'll have to do now it to save the RDD :
rdd.saveAsTextFile(path)
It works:
def exportRowMatrix(matrix:RDD[String], fileName: String) = {
val pw = new PrintWriter(fileName)
matrix.collect().foreach(line => pw.println(line))
pw.flush
pw.close
}
val rdd = U.rows.map( x => x.toArray.mkString(","))
exportRowMatrix(rdd, "U.csv")