Shuffling elements of an RDD[List[Double]] in Spark - scala

In a program I am developing using Spark 2.3 in Scala, I have an RDD[List[Double]]. Every List[Double] have the same size. I can't figure out how to perform a transformation that given the RDD
[1.0, 1.5, 4.0, 3.0],
[2.3, 5.6, 3.4, 9.0],
[4.5, 2.0, 1.0, 5.7]
transform it in the RDD
[2.3, 2.0, 1.0, 3.0],
[1.0, 5.6, 4.0, 5.7],
[4.5, 1.5, 3.4, 9.0]
where every single element of the lists is swapped among them, maintaining the same position.
For example, the first element of the first list is moved to the first position of the second list, the second element of the first list is moved to the second position of the third list, and so on.
Thanks a lot.

One approach to shuffling column-wise would be to break down the dataset into individual single-column DataFrames each of which gets shuffled using orderBy(rand), and then piece them back together.
To join the shuffled DataFrames, RDD zipWithIndex is applied to each of them to create row-identifying ids. Note that monotonically_increasing_id won't cut it as it doesn't guarantee generating same list of ids necessary for the final join. Hence, this is rather expensive due to the required transformation between RDD and DataFrame.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val rdd0 = sc.parallelize(Seq(
List(1.0, 1.5, 4.0, 3.0),
List(2.3, 5.6, 3.4, 9.0),
List(4.5, 2.0, 1.0, 5.7)
))
//rdd0: org.apache.spark.rdd.RDD[List[Double]] = ...
val rdd = rdd0.map{ case x: Seq[Double] => (x(0), x(1), x(2), x(3)) }
val df = rdd.toDF("c1", "c2", "c3", "c4")
val shuffledDFs = df.columns.filter(_.startsWith("c")).map{ c =>
val subDF = df.select(c)
val subRDD = subDF.orderBy(rand).rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
spark.createDataFrame( subRDD,
StructType(subDF.schema.fields :+ StructField("idx", LongType, false))
)
}
shuffledDFs.reduce( _.join(_, Seq("idx")) ).
show
// +---+---+---+---+---+
// |idx| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 0|2.3|2.0|4.0|9.0|
// | 1|1.0|5.6|3.4|3.0|
// | 2|4.5|1.5|1.0|5.7|
// +---+---+---+---+---+

Related

scala spark UDF ClassCastException : WrappedArray$ofRef cannot be cast to [Lscala.Tuple2

So I perform the necessary imports etc
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types._
import spark.implicits._
then define some latlong points
val london = (1.0, 1.0)
val suburbia = (2.0, 2.0)
val southampton = (3.0, 3.0)
val york = (4.0, 4.0)
I then create a spark Dataframe like this and check that it works:
val exampleDF = Seq((List(london,suburbia),List(southampton,york)),
(List(york,london),List(southampton,suburbia))).toDF("AR1","AR2")
exampleDF.show()
the dataframe consists of the following types
DataFrame = [AR1: array<struct<_1:double,_2:double>>, AR2: array<struct<_1:double,_2:double>>]
I create a function to create a combination of points
// function to do what I want
val latlongexplode = (x: Array[(Double,Double)], y: Array[(Double,Double)]) => {
for (a <- x; b <-y) yield (a,b)
}
I check that the function works
latlongexplode(Array(london,york),Array(suburbia,southampton))
and it does. However after i create a UDF out of this function
// declare function into a Spark UDF
val latlongexplodeUDF = udf (latlongexplode)
when i try to use it in the spark dataframe I have created above like this:
exampleDF.withColumn("latlongexplode", latlongexplodeUDF($"AR1",$"AR2")).show(false)
I get a really long stacktrace which basically boils down to :
java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to
[Lscala.Tuple2;
org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$3(ScalaUDF.scala:121)
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:151)
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:50)
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:32)
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
How can I get this udf to work in Scala Spark? (im using 2.4 at the moment if this helps)
EDIT: it could be that the way I construct my example df has an issue.
But what I have as the actual data is an array (of unknown size) of lat/long tuples on each column.
When working with struct types in UDF, they are represented as Row objects, and array columns are represented as Seq. Also, you need to return a struct in the form of a Row, and you need to define a schema to return a struct.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val london = (1.0, 1.0)
val suburbia = (2.0, 2.0)
val southampton = (3.0, 3.0)
val york = (4.0, 4.0)
val exampleDF = Seq((List(london,suburbia),List(southampton,york)),
(List(york,london),List(southampton,suburbia))).toDF("AR1","AR2")
exampleDF.show(false)
+------------------------+------------------------+
|AR1 |AR2 |
+------------------------+------------------------+
|[[1.0, 1.0], [2.0, 2.0]]|[[3.0, 3.0], [4.0, 4.0]]|
|[[4.0, 4.0], [1.0, 1.0]]|[[3.0, 3.0], [2.0, 2.0]]|
+------------------------+------------------------+
val latlongexplode = (x: Seq[Row], y: Seq[Row]) => {
for (a <- x; b <- y) yield Row(a, b)
}
val udf_schema = ArrayType(
StructType(Seq(
StructField(
"city1",
StructType(Seq(
StructField("lat", FloatType),
StructField("long", FloatType)
))
),
StructField(
"city2",
StructType(Seq(
StructField("lat", FloatType),
StructField("long", FloatType)
))
)
))
)
// include this line if you see errors like
// "You're using untyped Scala UDF, which does not have the input type information."
// spark.sql("set spark.sql.legacy.allowUntypedScalaUDF = true")
val latlongexplodeUDF = udf(latlongexplode, udf_schema)
result = exampleDF.withColumn("latlongexplode", latlongexplodeUDF($"AR1",$"AR2"))
result.show(false)
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+
|AR1 |AR2 |latlongexplode |
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+
|[[1.0, 1.0], [2.0, 2.0]]|[[3.0, 3.0], [4.0, 4.0]]|[[[1.0, 1.0], [3.0, 3.0]], [[1.0, 1.0], [4.0, 4.0]], [[2.0, 2.0], [3.0, 3.0]], [[2.0, 2.0], [4.0, 4.0]]]|
|[[4.0, 4.0], [1.0, 1.0]]|[[3.0, 3.0], [2.0, 2.0]]|[[[4.0, 4.0], [3.0, 3.0]], [[4.0, 4.0], [2.0, 2.0]], [[1.0, 1.0], [3.0, 3.0]], [[1.0, 1.0], [2.0, 2.0]]]|
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+

Combine two lists with one different element

I'm new in Scala and Spark and i don't know how to do this.
I have preprocessed a CSV file, resulting in an RDD that contains lists with this format:
List("2014-01-01T23:56:06.0", NaN, 1, NaN)
List("2014-01-01T23:56:06.0", NaN, NaN, 2)
All lists have the same number of elements.
What I want to do is to combine the lists having the same first element (the timestamp). For example, I want this two example lists to produce only one List, with the following values:
List("2014-01-01T23:56:06.0", NaN, 1, 2)
Thanks for your help :)
# Below can help you in achieving your target
val input_rdd1 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "1", "NaN")))
val input_rdd2 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "NaN", "2")))
//added one more row for your data
val input_rdd3 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "2", "NaN", "NaN")))
val input_df1 = input_rdd1.toDF("col1", "col2", "col3", "col4")
val input_df2 = input_rdd2.toDF("col1", "col2", "col3", "col4")
val input_df3 = input_rdd3.toDF("col1", "col2", "col3", "col4")
val output_df = input_df1.union(input_df2).union(input_df3).groupBy($"col1").agg(min($"col2").as("col2"), min($"col3").as("col3"), min($"col4").as("col4"))
output_df.show
output:
+--------------------+----+----+----+
| col1|col2|col3|col4|
+--------------------+----+----+----+
|2014-01-01T23:56:...| 2| 1| 2|
+--------------------+----+----+----+
If array tail values are doubles, can be implemented in this way (as sachav suggests):
val original = sparkContext.parallelize(
Seq(
List("2014-01-01T23:56:06.0", NaN, 1.0, NaN),
List("2014-01-01T23:56:06.0", NaN, NaN, 2.0)
)
)
val result = original
.map(v => v.head -> v.tail)
.reduceByKey(
(acc, curr) => acc.zip(curr).map({ case (left, right) => if (left.asInstanceOf[Double].isNaN) right else left }))
.map(v => v._1 :: v._2)
result.foreach(println)
Output is:
List(2014-01-01T23:56:06.0, NaN, 1.0, 2.0)

How to Sum a part of a list in RDD

I have an RDD, and I would like to sum a part of the list.
(key, element2 + element3)
(1, List(2.0, 3.0, 4.0, 5.0)), (2, List(1.0, -1.0, -2.0, -3.0))
output should look like this,
(1, 7.0), (2, -3.0)
Thanks
You can map and indexing on the second part:
yourRddOfTuples.map(tuple => {val list = tuple._2; list(1) + list(2)})
Update after your comment, convert it to Vector:
yourRddOfTuples.map(tuple => {val vs = tuple._2.toVector; vs(1) + vs(2)})
Or if you do not want to use conversions:
yourRddOfTuples.map(_._2.drop(1).take(2).sum)
This skips the first element (.drop(1)) from the second element of the tuple (.map(_._2), takes the next two (.take(2)) (might be less if you have less) and sums them (.sum).
You can map the key-list pair to obtain the 2nd and 3rd list elements as follows:
val rdd = sc.parallelize(Seq(
(1, List(2.0, 3.0, 4.0, 5.0)),
(2, List(1.0, -1.0, -2.0, -3.0))
))
rdd.map{ case (k, l) => (k, l(1) + l(2)) }.collect
// res1: Array[(Int, Double)] = Array((1,7.0), (2,-3.0))

Scala, Spark: find element-wise average of N maps

I have N maps (Map[String, Double]) each having the same set of keys. Let's say something like the following:
map1 = ("elem1": 2.0, "elem2": 4.0, "elem3": 3.0)
map2 = ("elem1": 4.0, "elem2": 1.0, "elem3": 1.0)
map3 = ("elem1": 3.0, "elem2": 10.0, "elem3": 2.0)
I need to return a new map with element-wise average of those input maps:
resultMap = ("elem1": 3.0, "elem2": 5.0, "elem3": 2.0)
What's the cleanest way to do that in scala? Preferrably without using extra external libraries.
This all happens in Spark*. Thus any answers suggesting spark-specific usage could be helpful.
One option is to convert all Maps to Seqs, union them to a single Seq, group by key and take the average of values:
val maps = Seq(map1, map2, map3)
maps.map(_.toSeq).reduce(_++_).groupBy(_._1).mapValues(x => x.map(_._2).sum/x.length)
// res6: scala.collection.immutable.Map[String,Double] = Map(elem1 -> 3.0, elem3 -> 2.0, elem2 -> 5.0)
Since your question is tagged with apache-spark you can get your desired output by combining the maps into RDD[Map[String, Double]] as
scala> val rdd = sc.parallelize(Seq(Map("elem1"-> 2.0, "elem2"-> 4.0, "elem3"-> 3.0),Map("elem1"-> 4.0, "elem2"-> 1.0, "elem3"-> 1.0),Map("elem1"-> 3.0, "elem2"-> 10.0, "elem3"-> 2.0)))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Double]] = ParallelCollectionRDD[1] at parallelize at <console>:24
Then you can use flatMap to flatten the entries of maps into individual rows and use groupBy function with key and sum the grouped values and devide it with the size of the grouped maps. You should get Your desired output as
scala> rdd.flatMap(row => row).groupBy(kv => kv._1).mapValues(values => values.map(value => value._2).sum/values.size)
res0: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[5] at mapValues at <console>:27
scala> res0.foreach(println)
[Stage 0:> (0 + 0) / 4](elem2,5.0)
(elem3,2.0)
(elem1,3.0)
Hope the answer is helpful

How to add an element to the end of a seq in scala?

I want to add an element to the end of a seq in scala. But it didn't work. Can somebody help ? Thanks
val data = Seq(
Vectors.dense(1.0, 2.0),
Vectors.dense(2.0, 4.0),
Vectors.dense(3.0, 6.0)
)
data :+ Vectors.dense(4.0, 8.0) // didn't work
println(data)
Result shown
println shows List([1.0,2.0], [2.0, 4.0], [3.0,6.0])
Seq is immutable structure. When you added new element to it, new structure was created and returned, but val "data" remained the same.
Try
val newData = data :+ Vectors.dense(4.0, 8.0)
println(newData)