I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow.
val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10)))
val rddReduced = rddInit..groupByKey.mapValues(_.toList)
rddReduced.take(3).foreach(println)
This code give me the next RDD :
(1,List(2, 3)) (2,List(5, 7)) (3,List(10))
But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd).
My first guess is to realise some kind of cross product between the key and each element of the List like this :
rddReduced.map{
case (x, y) =>
val myList:ListBuffer[(Int, Int)] = ListBuffer()
for(element <- y) {
myList+=new Pair(x, element)
}
myList.toList
}.flatMap(x => x).take(5).foreach(println)
With this code, I get the initial RDD as a result. But I don't think using a ListBuffer inside a spark job is a good practice. Is there any other way to resolve this problem ?
I'm surprised no one has offered a solution with Scala's for-comprehension (that gets "desugared" to flatMap and map at compile time).
I don't use this syntax very often, but when I do...I find it quite entertaining. Some people prefer for-comprehension over a series of flatMap and map, esp. for more complex transformations.
// that's what you ended up with after `groupByKey.mapValues`
val rddReduced: RDD[(Int, List[Int])] = ...
val r = for {
(k, values) <- rddReduced
v <- values
} yield (k, v)
scala> :type r
org.apache.spark.rdd.RDD[(Int, Int)]
scala> r.foreach(println)
(3,10)
(2,5)
(2,7)
(1,2)
(1,3)
// even nicer to our eyes
scala> r.toDF("key", "value").show
+---+-----+
|key|value|
+---+-----+
| 1| 2|
| 1| 3|
| 2| 5|
| 2| 7|
| 3| 10|
+---+-----+
After all, that's why we enjoy flexibility of Scala, isn't it?
It's obviously not a good practice to use that kind of operation.
From what I have learned in a spark-summit course, you have to use Dataframes and Datasets as much as possible, using them you will benefit from a lot of optimizations form spark engine.
What you wanna do is called explode and it's preformed by applying the explode method from the sql.functions package
The solution whould be something like this :
import spark.implicits._
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.collect_list
val dfInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10))).toDF("x", "y")
val dfReduced = dfInit.groupBy("x").agg(collect_list("y") as "y")
val dfResult = dfReduced.withColumn("y", explode($"y"))
dfResult will contains the same data as the dfInit
Here's one way to restore the grouped RDD back to original:
val rddRestored = rddReduced.flatMap{
case (k, v) => v.map((k, _))
}
rddRestored.collect.foreach(println)
(1,2)
(1,3)
(2,5)
(2,7)
(3,10)
According to your question, I think this is what you want to do
rddReduced.map{case(x, y) => y.map((x,_))}.flatMap(_).take(5).foreach(println)
You get a list after group by in which you can map through it again.
Related
I am new to Scala and Spark .
There are 2 RDDs like
RDD_A= (keyA,5),(KeyB,10)
RDD_B= (keyA,3),(KeyB,7)
how do I calculate : RDD_A-RDD_B so that I get (keyA,2),(KeyB,3)
I tried subtract and subtractByKey but I am unable to get similar output like above
Let's assume that each RDD has only one value with specified key:
val df =
Seq(
("A", 5),
("B", 10)
).toDF("key", "value")
val df2 =
Seq(
("A", 3),
("B", 7)
).toDF("key", "value")
You can merge these RDDs using union and perform the computation via groupBy as follows:
import org.apache.spark.sql.functions._
df.union(df2)
.groupBy("key")
.agg(first("value").minus(last("value")).as("value"))
.show()
will print:
+---+-----+
|key|value|
+---+-----+
| B| 3|
| A| 2|
+---+-----+
RDD solution for the question
Please find inline code comments for the explanation
object SubtractRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate(); // Create Spark Session
val list1 = List(("keyA",5),("keyB",10))
val list2 = List(("keyA",3),("keyB",7))
val rdd1= spark.sparkContext.parallelize(list1) // convert list to RDD
val rdd2= spark.sparkContext.parallelize(list2)
val result = rdd1.join(rdd2) // Inner join RDDs
.map(x => (x._1, x._2._1 - x._2._2 )) // Combiner function for RDDs
.collectAsMap() // Collect result as Map
println(result)
}
}
I am trying to look for explode function or its equivalent in plain scala rather Spark.
Using the explode function in Spark, I was able to flatten a row with multiple elements into multiple rows as below.
scala> import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.explode
scala> val test = spark.read.json(spark.sparkContext.parallelize(Seq("""{"a":1,"b":[2,3]}""")))
scala> test.schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true))
scala> test.show
+---+------+
| a| b|
+---+------+
| 1|[2, 3]|
+---+------+
scala> val flat = test.withColumn("b",explode($"b"))
flat: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> flat.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
+---+---+
Is there an explode equivalent function in plain scala without using Spark ? Is there anyway I can implement it if there is no explode function available in scala ?
Simple flatMap should help you in this case. I don't know exact data structure, which you would like to work with in scala, but let's take a bit artificial example:
val l: List[(Int, List[Int])] = List(1 -> List(2, 3))
val result: List[(Int, Int)] = l.flatMap {
case (a, b) => b.map(i => a -> i)
}
println(result)
Which will produce next result:
List((1,2), (1,3))
UPDATE
As suggested in comment section by #jwvh, or same result may be achieved using for-comprehension construction and hiding explicit flatMap & map invocations:
val result2: List[(Int, Int)] = for((a, bList) <- l; b <- bList) yield a -> b
Hope this helps!
I have a following dataframe
+--------------------+
| values |
+--------------------+
|[[1,1,1],[3,2,4],[1,|
|[[1,1,2],[2,2,4],[1,|
|[[1,1,3],[4,2,4],[1,|
I want a column with the tail of the list. So far I know how to select the first element
val df1 = df.select("values").getItem(0) , but is there a method which would allow me drop the first element ?
A UDF with a simple size check seems to be the simplest solution:
val df = Seq((1, Seq(1, 2, 3)), (2, Seq(4, 5))).toDF("c1", "c2")
def tail = udf( (s: Seq[Int]) => if (s.size > 1) s.tail else Seq.empty[Int] )
df.select($"c1", tail($"c2").as("c2tail")).show
// +---+------+
// | c1|c2tail|
// +---+------+
// | 1|[2, 3]|
// | 2| [5]|
// +---+------+
As per suggestion in the comment section, a preferred solution would be to use Spark built-in function slice:
df.select($"c1", slice($"c2", 2, Int.MaxValue).as("c2tail"))
I don't think exists a built-in operator for this.
But you can use UDFs, for example:
import collection.mutable.WrappedArray
def tailUdf = udf((array: WrappedArray[WrappedArray[Int]])=> array.tail)
df.select(tailUdf(col("value"))).show()
I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+
I want to create essentially a sumproduct across columns in a Spark DataFrame. I have a DataFrame that looks like this:
id val1 val2 val3 val4
123 10 5 7 5
I also have a Map that looks like:
val coefficents = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
I want to take the value in each column of the DataFrame, multiply it by the corresponding value from the map, and return the result in a new column so essentially:
(10*1) + (5*2) + (7*3) + (5*4) = 61
I tried this:
val myDF1 = myDF.withColumn("mySum", {var a:Double = 0.0; for ((k,v) <- coefficients) a + (col(k).cast(DoubleType)*coefficients(k));a})
but got an error that the "+" method was overloaded. Even if I solved that, I'm not sure this would work. Any ideas? I could always dynamically build a SQL query as text string and do it that way but I was hoping for something a little more eloquent.
Any ideas are appreciated.
Problem with your code is that you try to add a Column to Double. cast(DoubleType) affects only a type of stored value, not a type of column itself. Since Double doesn't provide *(x: org.apache.spark.sql.Column): org.apache.spark.sql.Column method everything fails.
To make it work you can for example do something like this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, lit}
val df = sc.parallelize(Seq(
(123, 10, 5, 7, 5), (456, 1, 1, 1, 1)
)).toDF("k", "val1", "val2", "val3", "val4")
val coefficients = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
val dotProduct: Column = coefficients
// To be explicit you can replace
// col(k) * v with col(k) * lit(v)
// but it is not required here
// since we use * f Column.* method not Int.*
.map{ case (k, v) => col(k) * v } // * -> Column.*
.reduce(_ + _) // + -> Column.+
df.withColumn("mySum", dotProduct).show
// +---+----+----+----+----+-----+
// | k|val1|val2|val3|val4|mySum|
// +---+----+----+----+----+-----+
// |123| 10| 5| 7| 5| 61|
// |456| 1| 1| 1| 1| 10|
// +---+----+----+----+----+-----+
It looks like the issue is that you aren't actually doing anything with a
for((k, v) <- coefficients) a + ...
You probably meant a += ...
Also, some advice for cleaning up the block of code inside the withColumn call:
You don't need to call coefficients(k) because you've already got its value in v from for((k,v) <- coefficients)
Scala is pretty good at making one-liners, but it's kinda cheating if you have to put semicolons in that one line :P I'd suggest breaking up the sum calculation section into one line per expression.
The sum expression could be rewritten as a fold which avoids using a var (idiomatic Scala usually avoids vars), e.g.
import org.apache.spark.sql.functions.lit
coefficients.foldLeft(lit(0.0)){
case (sumSoFar, (k,v)) => col(k).cast(DoubleType) * v + sumSoFar
}
I'm not sure if this is possible through the DataFrame API since you are only able to work with columns and not any predefined closures (e.g. your parameter map).
I've outlined a way below using the underlying RDD of the DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Initializing your input example.
val df1 = sc.parallelize(Seq((123, 10, 5, 7, 5))).toDF("id", "val1", "val2", "val3", "val4")
// Return column names as an array
val names = df1.columns
// Grab underlying RDD and zip elements with column names
val rdd1 = df1.rdd.map(row => (0 until row.length).map(row.getInt(_)).zip(names))
// Tack on accumulated total to the existing row
val rdd2 = rdd0.map { seq => Row.fromSeq(seq.map(_._1) :+ seq.map { case (value: Int, name: String) => value * coefficents.getOrElse(name, 0) }.sum) }
// Create output schema (with total)
val totalSchema = StructType(df1.schema.fields :+ StructField("total", IntegerType))
// Apply schema to create output dataframe
val df2 = sqlContext.createDataFrame(rdd1, totalSchema)
// Show output:
df2.show()
...
+---+----+----+----+----+-----+
| id|val1|val2|val3|val4|total|
+---+----+----+----+----+-----+
|123| 10| 5| 7| 5| 61|
+---+----+----+----+----+-----+