In my Spark Scala application I have an RDD with the following format:
(05/05/2020, (name, 1))
(05/05/2020, (name, 1))
(05/05/2020, (name2, 1))
...
(06/05/2020, (name, 1))
What I want to do is group these elements by date and sum the tuples that have the same "name" as key.
Expected Output:
(05/05/2020, List[(name, 2), (name2, 1)]),
(06/05/2020, List[(name, 1)])
...
In order to do that, I am currently using a groupByKey operation and some extra transformations in order to group the tuples by key and calculate the sum for those that share the same one.
For performance reasons, I would like to replace this groupByKey operation with a reduceByKey or an aggregateByKey in order to reduce the amount of data transferred over the network.
However, I can't get my head around on how to do this. Both of these transformations take as parameter a function between the values (tuples in my case) so I can't see how I can group the tuples by key in order to calculate their sum.
Is it doable?
Here's how you can merge your Tuples using reduceByKey:
/**
File /path/to/file1:
15/04/2010 name
15/04/2010 name
15/04/2010 name2
15/04/2010 name2
15/04/2010 name3
16/04/2010 name
16/04/2010 name
File /path/to/file2:
15/04/2010 name
15/04/2010 name3
**/
import org.apache.spark.rdd.RDD
val filePaths = Array("/path/to/file1", "/path/to/file2").mkString(",")
val rdd: RDD[(String, (String, Int))] = sc.textFile(filePaths).
map{ line =>
val pair = line.split("\\t", -1)
(pair(0), (pair(1), 1))
}
rdd.
map{ case (k, (n, v)) => (k, Map(n -> v)) }.
reduceByKey{ (acc, m) =>
acc ++ m.map{ case (n, v) => (n -> (acc.getOrElse(n, 0) + v)) }
}.
map(x => (x._1, x._2.toList)).
collect
// res1: Array[(String, List[(String, Int)])] = Array(
// (15/04/2010, List((name,3), (name2,2), (name3,2))), (16/04/2010, List((name,2)))
// )
Note that the initial mapping is needed because we want to merge the Tuples as elements in a Map, and reduceByKey for RDD[K, V] requires the same data type V before and after the transformation:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
Yes .aggeregateBykey() can be used as follows:
import scala.collection.mutable.HashMap
def merge(map: HashMap[String, Int], element: (String, Int)) = {
if(map.contains(element._1)) map(element._1) += element._2 else map(element._1) = element._2
map
}
val input = sc.parallelize(List(("05/05/2020",("name",1)),("05/05/2020", ("name", 1)),("05/05/2020", ("name2", 1)),("06/05/2020", ("name", 1))))
val output = input.aggregateByKey(HashMap[String, Int]())({
//combining map & tuple
case (map, element) => merge(map, element)
}, {
// combining two maps
case (map1, map2) => {
val combined = (map1.keySet ++ map2.keySet).map { i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0)) }.toMap
collection.mutable.HashMap(combined.toSeq: _*)
}
}).mapValues(_.toList)
credits: Best way to merge two maps and sum the values of same key?
You could convert the RDD to a DataFrame and just use a groupBy with sum, here is one way to do it
import org.apache.spark.sql.types._
val schema = StructType(StructField("date", StringType, false) :: StructField("name", StringType, false) :: StructField("value", IntegerType, false) :: Nil)
val rd = sc.parallelize(Seq(("05/05/2020", ("name", 1)),
("05/05/2020", ("name", 1)),
("05/05/2020", ("name2", 1)),
("06/05/2020", ("name", 1))))
val df = spark.createDataFrame(rd.map{ case (a, (b,c)) => Row(a,b,c)},schema)
df.show
+----------+-----+-----+
| date| name|value|
+----------+-----+-----+
|05/05/2020| name| 1|
|05/05/2020| name| 1|
|05/05/2020|name2| 1|
|06/05/2020| name| 1|
+----------+-----+-----+
val sumdf = df.groupBy("date","name").sum("value")
sumdf.show
+----------+-----+----------+
| date| name|sum(value)|
+----------+-----+----------+
|06/05/2020| name| 1|
|05/05/2020| name| 2|
|05/05/2020|name2| 1|
+----------+-----+----------+
Related
I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+
I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+
I have a dataframe with 20 Columns and in these columns there is a value XX which i want to replace with Empty String. How do i achieve that in scala. The withColumn function is for a single column, But i want to pass all 20 columns and replace values that have XX in the entire frame with Empty String , Can some one suggest a way.
Thanks
You can gather all the stringType columns in a list and use foldLeft to apply your removeXX UDF to each of the columns as follows:
val df = Seq(
(1, "aaXX", "bb"),
(2, "ccXX", "XXdd"),
(3, "ee", "fXXf")
).toDF("id", "desc1", "desc2")
import org.apache.spark.sql.types._
val stringColumns = df.schema.fields.collect{
case StructField(name, StringType, _, _) => name
}
val removeXX = udf( (s: String) =>
if (s == null) null else s.replaceAll("XX", "")
)
val dfResult = stringColumns.foldLeft( df )( (acc, c) =>
acc.withColumn( c, removeXX(df(c)) )
)
dfResult.show
+---+-----+-----+
| id|desc1|desc2|
+---+-----+-----+
| 1| aa| bb|
| 2| cc| dd|
| 3| ee| ff|
+---+-----+-----+
def clearValueContains(dataFrame: DataFrame,token :String,columnsToBeUpdated : List[String])={
columnsToBeUpdated.foldLeft(dataFrame){
(dataset ,columnName) =>
dataset.withColumn(columnName, when(col(columnName).contains(token), "").otherwise(col(columnName)))
}
}
You can use this function .. where you can put token as "XX" . Also the columnsToBeUpdated is the list of columns in which you need to search for the particular column.
dataset.withColumn(columnName, when(col(columnName) === token, "").otherwise(col(columnName)))
you can use the above code to replace on exact match.
We can do like this as well in scala.
//Getting all columns
val columns: Seq[String] = df.columns
//Using DataFrameNaFunctions to achieve this.
val changedDF = df.na.replace(columns, Map("XX"-> ""))
Hope this helps.
I am trying to merge multiple RDDs of strings to a RDD of row in a specific order. I've tried to create a Map[String, RDD[Seq[String]]] (where the Seq contains only one element) and then merge them to a RDD[Row[String]], but it doesn't seems to work (content of RDD[Seq[String]] is lost).. Do someone have any ideas ?
val t1: StructType
val mapFields: Map[String, RDD[Seq[String]]]
var ordRDD: RDD[Seq[String]] = context.emptyRDD
t1.foreach(field => ordRDD = ordRDD ++ mapFiels(field.name))
val rdd = ordRDD.map(line => Row.fromSeq(line))
EDIT :
Using zip function lead to a spark exception, because my RDDs didn't have the same number of elements in each partition. I don't know how to make sure that they all have the same number of elements in each partition, so I've just zip them with index and then join them in good order using a ListMap. Maybe there is a trick to do with the mapPartitions function, but I don't know enough the Spark API yet.
val mapFields: Map[String, RDD[String]]
var ord: ListMap[String, RDD[String]] = ListMap()
t1.foreach(field => ord = ord ++ Map(field.name -> mapFields(field.name)))
// Note : zip = SparkException: Can only zip RDDs with same number of elements in each partition
//val rdd: RDD[Row] = ord.toSeq.map(_._2.map(s => Seq(s))).reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map{ case (l1, l2) => l1 ++ l2 }).map(Row.fromSeq)
val zipRdd = ord.toSeq.map(_._2.map(s => Seq(s)).zipWithIndex().map{ case (d, i) => (i, d) })
val concatRdd = zipRdd.reduceLeft((rdd1, rdd2) => rdd1.join(rdd2).map{ case (i, (l1, l2)) => (i, l1 ++ l2)})
val rowRdd: RDD[Row] = concatRdd.map{ case (i, d) => Row.fromSeq(d) }
val df1 = spark.createDataFrame(rowRdd, t1)
The key here is using RDD.zip to "zip" the RDDs together (creating an RDD in which each record is the combination of records with same index in ell RDDs):
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// INPUT: Map does not preserve order (not the defaul implementation, at least) - using Seq
val rdds: Seq[(String, RDD[String])] = Seq(
"field1" -> sc.parallelize(Seq("a", "b", "c")),
"field2" -> sc.parallelize(Seq("1", "2", "3")),
"field3" -> sc.parallelize(Seq("Q", "W", "E"))
)
// Use RDD.zip to zip all RDDs together, then convert to Rows
val rowRdd: RDD[Row] = rdds
.map(_._2)
.map(_.map(s => Seq(s)))
.reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map { case (l1, l2) => l1 ++ l2 })
.map(Row.fromSeq)
// Create schema using the column names:
val schema: StructType = StructType(rdds.map(_._1).map(name => StructField(name, StringType)))
// Create DataFrame:
val result: DataFrame = spark.createDataFrame(rowRdd, schema)
result.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| 1| Q|
// | b| 2| W|
// | c| 3| E|
// +------+------+------+
I want to rank user id based on one field. For the same value of the field, rank should be same. That data is in Hive table.
e.g.
user value
a 5
b 10
c 5
d 6
Rank
a - 1
c - 1
d - 3
b - 4
How can i do that?
It is possible to use rank window function either with a DataFrame API:
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"value")
val df = sc.parallelize(Seq(
("a", 5), ("b", 10), ("c", 5), ("d", 6)
)).toDF("user", "value")
df.select($"user", rank.over(w).alias("rank")).show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
or raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT user, RANK() OVER (ORDER BY value) AS rank FROM df").show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
but it is extremely inefficient.
You can also try to use RDD API but it is not exactly straightforward. First lets convert DataFrame to RDD:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.RangePartitioner
val rdd: RDD[(Int, String)] = df.select($"value", $"user")
.map{ case Row(value: Int, user: String) => (value, user) }
val partitioner = new RangePartitioner(rdd.partitions.size, rdd)
val sorted = rdd.repartitionAndSortWithinPartitions(partitioner)
Next we have to compute ranks per partition:
def rank(iter: Iterator[(Int,String)]) = {
val zero = List((-1L, Integer.MIN_VALUE, "", 1L))
def f(acc: List[(Long,Int,String,Long)], x: (Int, String)) =
(acc.head, x) match {
case (
(prevRank: Long, prevValue: Int, _, offset: Long),
(currValue: Int, label: String)) => {
val newRank = if (prevValue == currValue) prevRank else prevRank + offset
val newOffset = if (prevValue == currValue) offset + 1L else 1L
(newRank, currValue, label, newOffset) :: acc
}
}
iter.foldLeft(zero)(f).reverse.drop(1).map{case (rank, _, label, _) =>
(rank, label)}.toIterator
}
val partRanks = sorted.mapPartitions(rank)
offset for each partition
def getOffsets(sorted: RDD[(Int, String)]) = sorted
.mapPartitionsWithIndex((i: Int, iter: Iterator[(Int, String)]) =>
Iterator((i, iter.size)))
.collect
.foldLeft(List((-1, 0)))((acc: List[(Int, Int)], x: (Int, Int)) =>
(x._1, x._2 + acc.head._2) :: acc)
.toMap
val offsets = sc.broadcast(getOffsets(sorted))
and the final ranks:
def adjust(i: Int, iter: Iterator[(Long, String)]) =
iter.map{case (rank, label) => (rank + offsets.value(i - 1).toLong, label)}
val ranks = partRanks
.mapPartitionsWithIndex(adjust)
.map{case (i, label) => (1 + i , label)}