Merge multiple RDD in a specific order - scala

I am trying to merge multiple RDDs of strings to a RDD of row in a specific order. I've tried to create a Map[String, RDD[Seq[String]]] (where the Seq contains only one element) and then merge them to a RDD[Row[String]], but it doesn't seems to work (content of RDD[Seq[String]] is lost).. Do someone have any ideas ?
val t1: StructType
val mapFields: Map[String, RDD[Seq[String]]]
var ordRDD: RDD[Seq[String]] = context.emptyRDD
t1.foreach(field => ordRDD = ordRDD ++ mapFiels(field.name))
val rdd = ordRDD.map(line => Row.fromSeq(line))
EDIT :
Using zip function lead to a spark exception, because my RDDs didn't have the same number of elements in each partition. I don't know how to make sure that they all have the same number of elements in each partition, so I've just zip them with index and then join them in good order using a ListMap. Maybe there is a trick to do with the mapPartitions function, but I don't know enough the Spark API yet.
val mapFields: Map[String, RDD[String]]
var ord: ListMap[String, RDD[String]] = ListMap()
t1.foreach(field => ord = ord ++ Map(field.name -> mapFields(field.name)))
// Note : zip = SparkException: Can only zip RDDs with same number of elements in each partition
//val rdd: RDD[Row] = ord.toSeq.map(_._2.map(s => Seq(s))).reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map{ case (l1, l2) => l1 ++ l2 }).map(Row.fromSeq)
val zipRdd = ord.toSeq.map(_._2.map(s => Seq(s)).zipWithIndex().map{ case (d, i) => (i, d) })
val concatRdd = zipRdd.reduceLeft((rdd1, rdd2) => rdd1.join(rdd2).map{ case (i, (l1, l2)) => (i, l1 ++ l2)})
val rowRdd: RDD[Row] = concatRdd.map{ case (i, d) => Row.fromSeq(d) }
val df1 = spark.createDataFrame(rowRdd, t1)

The key here is using RDD.zip to "zip" the RDDs together (creating an RDD in which each record is the combination of records with same index in ell RDDs):
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// INPUT: Map does not preserve order (not the defaul implementation, at least) - using Seq
val rdds: Seq[(String, RDD[String])] = Seq(
"field1" -> sc.parallelize(Seq("a", "b", "c")),
"field2" -> sc.parallelize(Seq("1", "2", "3")),
"field3" -> sc.parallelize(Seq("Q", "W", "E"))
)
// Use RDD.zip to zip all RDDs together, then convert to Rows
val rowRdd: RDD[Row] = rdds
.map(_._2)
.map(_.map(s => Seq(s)))
.reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map { case (l1, l2) => l1 ++ l2 })
.map(Row.fromSeq)
// Create schema using the column names:
val schema: StructType = StructType(rdds.map(_._1).map(name => StructField(name, StringType)))
// Create DataFrame:
val result: DataFrame = spark.createDataFrame(rowRdd, schema)
result.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| 1| Q|
// | b| 2| W|
// | c| 3| E|
// +------+------+------+

Related

Scala RDD groupbykey without using groupbykey function

I am trying to get an RDD[(String, Iterable[String])] without using groupbykey. These are my tuples:
(Group 1, John)
(Group 2, Sam)
(Group 1, Mary)
(Group 3, Pam)
I need to get:
(Group 1, List(John, Mary)), (Group 2, List(Sam)), (Group 3, List(Pam))
without using groupby or groupbykeys function. How can I do this?
So if you want to use Spark APIs, you can use a windowing function over key, and try to collect values into a list:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(Group 1, John),
(Group 2, Sam),
(Group 1, Mary),
(Group 3, Pam)
).toDF("key", "value")
val keyWindow = Window.partitionBy("key")
val result = df
.withColumn(
"values",
collect_list(col("value")).over(keyWindow)
)
.select("key", "values")
.distinct
result.show()
// here is the result:
+-------+------------+
| key| values|
+-------+------------+
|Group 1|[John, Mary]|
|Group 2| [Sam]|
|Group 3| [Pam]|
+-------+------------+
You can then convert this df to rdd easily!
Update:
So if you want to use plain scala, assuming the data is given as follows:
val data = Seq(
("Group 1", "John"),
("Group 2", "Sam"),
("Group 1", "Mary"),
("Group 3", "Pam")
)
You need to foldLeft over the data, with a state of Map[String, List[String]] (which basically means a mapping from each key/group to values), and update the state of your iteration (the Map thing)
val result = data.foldLeft(Map.empty[String, List[String]]) {
case (acc, (key, value)) =>
acc.updatedWith(key) {
case Some(values) => Some(value :: values)
case None => Some(value :: Nil)
}
}
Just to make things clearer, you can use .collect on a dataframe to collect the rows as an Array. And, to make an RDD from the resulting Map[String, List[String]], you can use spark.sparkContext.parallelize(result.toSeq).

Spark create a dataframe from multiple lists/arrays

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?
You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+
Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Reduce and sum tuples by key

In my Spark Scala application I have an RDD with the following format:
(05/05/2020, (name, 1))
(05/05/2020, (name, 1))
(05/05/2020, (name2, 1))
...
(06/05/2020, (name, 1))
What I want to do is group these elements by date and sum the tuples that have the same "name" as key.
Expected Output:
(05/05/2020, List[(name, 2), (name2, 1)]),
(06/05/2020, List[(name, 1)])
...
In order to do that, I am currently using a groupByKey operation and some extra transformations in order to group the tuples by key and calculate the sum for those that share the same one.
For performance reasons, I would like to replace this groupByKey operation with a reduceByKey or an aggregateByKey in order to reduce the amount of data transferred over the network.
However, I can't get my head around on how to do this. Both of these transformations take as parameter a function between the values (tuples in my case) so I can't see how I can group the tuples by key in order to calculate their sum.
Is it doable?
Here's how you can merge your Tuples using reduceByKey:
/**
File /path/to/file1:
15/04/2010 name
15/04/2010 name
15/04/2010 name2
15/04/2010 name2
15/04/2010 name3
16/04/2010 name
16/04/2010 name
File /path/to/file2:
15/04/2010 name
15/04/2010 name3
**/
import org.apache.spark.rdd.RDD
val filePaths = Array("/path/to/file1", "/path/to/file2").mkString(",")
val rdd: RDD[(String, (String, Int))] = sc.textFile(filePaths).
map{ line =>
val pair = line.split("\\t", -1)
(pair(0), (pair(1), 1))
}
rdd.
map{ case (k, (n, v)) => (k, Map(n -> v)) }.
reduceByKey{ (acc, m) =>
acc ++ m.map{ case (n, v) => (n -> (acc.getOrElse(n, 0) + v)) }
}.
map(x => (x._1, x._2.toList)).
collect
// res1: Array[(String, List[(String, Int)])] = Array(
// (15/04/2010, List((name,3), (name2,2), (name3,2))), (16/04/2010, List((name,2)))
// )
Note that the initial mapping is needed because we want to merge the Tuples as elements in a Map, and reduceByKey for RDD[K, V] requires the same data type V before and after the transformation:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
Yes .aggeregateBykey() can be used as follows:
import scala.collection.mutable.HashMap
def merge(map: HashMap[String, Int], element: (String, Int)) = {
if(map.contains(element._1)) map(element._1) += element._2 else map(element._1) = element._2
map
}
val input = sc.parallelize(List(("05/05/2020",("name",1)),("05/05/2020", ("name", 1)),("05/05/2020", ("name2", 1)),("06/05/2020", ("name", 1))))
val output = input.aggregateByKey(HashMap[String, Int]())({
//combining map & tuple
case (map, element) => merge(map, element)
}, {
// combining two maps
case (map1, map2) => {
val combined = (map1.keySet ++ map2.keySet).map { i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0)) }.toMap
collection.mutable.HashMap(combined.toSeq: _*)
}
}).mapValues(_.toList)
credits: Best way to merge two maps and sum the values of same key?
You could convert the RDD to a DataFrame and just use a groupBy with sum, here is one way to do it
import org.apache.spark.sql.types._
val schema = StructType(StructField("date", StringType, false) :: StructField("name", StringType, false) :: StructField("value", IntegerType, false) :: Nil)
val rd = sc.parallelize(Seq(("05/05/2020", ("name", 1)),
("05/05/2020", ("name", 1)),
("05/05/2020", ("name2", 1)),
("06/05/2020", ("name", 1))))
val df = spark.createDataFrame(rd.map{ case (a, (b,c)) => Row(a,b,c)},schema)
df.show
+----------+-----+-----+
| date| name|value|
+----------+-----+-----+
|05/05/2020| name| 1|
|05/05/2020| name| 1|
|05/05/2020|name2| 1|
|06/05/2020| name| 1|
+----------+-----+-----+
val sumdf = df.groupBy("date","name").sum("value")
sumdf.show
+----------+-----+----------+
| date| name|sum(value)|
+----------+-----+----------+
|06/05/2020| name| 1|
|05/05/2020| name| 2|
|05/05/2020|name2| 1|
+----------+-----+----------+

collapse the rows with flatmap or reducedbyKey

I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+