Spark refuse to zip RDD [duplicate] - scala

This question already has answers here:
Can only zip RDDs with same number of elements in each partition despite repartition
(3 answers)
Closed 6 years ago.
I have the following exception at the last line running the code below with Spark
org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
val rdd1 = anRDD
val rdd2 = AnotherRDD
println(rdd1.count() == rdd2.count()) // Write true
val nparts = rdd1.getNumPartitions + rdd2.getNumPartitions
val rdd1Bis = rdd1.repartition(nparts) // Try to repartition (useless)
val rdd2Bis = rdd2.repartition(nparts)
val zipped = rdd1Bis.zip(rdd2Bis)
println(zipped.count())
What is wrong ?
PS: it works if I collect rdd1 and rdd2 before zipping but i need to keep them as RDD

A solution could be to zip with a join:
val rdd1Bis = rdd1.zipWithIndex.map((x) =>(x._2, x._1))
val rdd2Bis = rdd2.zipWithIndex.map((x) =>(x._2, x._1))
val zipped = rdd1Bis.join(rdd2Bis).map(x => x._2)

it works check this:Please reply with what part it fails for you
val list1 = List("a","b","c","d")
val list1 = List("a","b","c","d")
val rdd1 = sc.parallelize(list1)
val rdd1 = sc.parallelize(list2)
Executing ur code :
val nparts = rdd1.getNumPartitions + rdd2.getNumPartitions
val rdd1Bis = rdd1.repartition(nparts) // Try to repartition (useless)
val rdd2Bis = rdd2.repartition(nparts)
val zipped = rdd1Bis.zip(rdd2Bis)
Result:
println(zipped.count())
4
zipped.foreach(println)
(a,a)
(b,b)
(c,c)
(d,d)

Related

How to apply kmeans for parquet file?

I want to apply k-means for my parquet file.but error appear .
edited
java.lang.ArrayIndexOutOfBoundsException: 2
code
val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 30
val numIteration = 1
val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)
How to solve this error?
I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.
In the example
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation
To convert from Dataframe to RDD you can use the below sample as reference
val rows: RDD[Row] = df.rdd
val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

Finding average of values against key using RDD in Spark

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys
First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

Scala -Spark - Iterate over joined pair RDD

I am trying to join 2 PairRDD in spark and not sure how to iterate over the result.
val input1 = sc.textFile(inputFile1)
val input2 = sc.textFile(inputFile2)
val pairs = input1.map(x => (x.split("\\|")(18),x))
val groupPairs = pairs.groupByKey()
val staPairs = input2.map(y => (y.split("\\|")(0),y))
val stagroupPairs = staPairs.groupByKey()
val finalJoined = groupPairs.leftOuterJoin(stagroupPairs)
finalJoined is of type finalJoined:
org.apache.spark.rdd.RDD[(String, (Iterable[String], Option[Iterable[String]]))]
When I do finalJoined.collect().foreach(println) I see the below output :
(key1,(CompactBuffer(val1a,val1b),Some(CompactBuffer(val1)))
(key2,(CompactBuffer(val2a,val2b),Some(CompactBuffer(val2)))
I would like the output to be
for key1
val1a+"|"+val1
val1b+"|"+val1
for key2
val2a+"|"+val2
avoid groupByKey step on both the rdds and perform join on directly pairs and starpairs..you will get the desired result.
For e.g,
val rdd1 = sc.parallelize(Array("key1,val1a","key1,val1b","key2,val2a","key2,val2b").toSeq)
val rdd2 = sc.parallelize(Array("key1,val1","key2,val2").toSeq)
val pairs= rdd1.map(_.split(",")).map(x => (x(0),x(1)))
val starPairs= rdd2.map(_.split(",")).map(x => (x(0),x(1)))
val res = pairs.join(starPairs)
res.foreach(println)
(key1,(val1a,val1))
(key1,(val1b,val1))
(key2,(val2a,val2))
(key2,(val2b,val2))

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.
You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.