In Apache Spark, I have a dataframe with one column which has string (its a date) but leading zero is missing from month and day - scala

import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df
.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\d{2}))/(?:(\\d{1}))/(?:(\\d{4}))", "$1/0$2/$3"))
val newDf1 = newDf
.withColumn("x4New1", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{1}))/(?:(\\d{4}))", "0$1/0$2/$3"))
.withColumn("x4New2", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{2}))/(?:(\\d{4}))", "0$1/$2/$3"))
newDf1.show
Output now
+---+----------+----------+-----------+-----------+
| Id| x4| x4New| x4New1| x4New2|
+---+----------+----------+-----------+-----------+
| 1| 9/11/2020| 9/11/2020| 9/11/2020| 09/11/2020|
| 2|10/11/2020|10/11/2020| 10/11/2020|100/11/2020|
| 3| 1/1/2020| 1/1/2020| 01/01/2020| 1/1/2020|
| 4| 12/7/2020|12/07/2020|102/07/2020| 12/7/2020|
+---+----------+----------+-----------+-----------+
`Desired Output, add a leading Zero in front of day or month is single-digit' Do not want to use a UDF for performance reasons
+---+----------+----------+
| Id| x4| date |
+---+----------+----------+
| 1| 9/11/2020|09/11/2020|
| 2|10/11/2020|10/11/2020|
| 3| 1/1/2020|01/01/2020|
| 4| 12/7/2020|12/07/2020|
+---+----------+----------+-----------+-----------+

Use from_unixtime,unix_timestamp (or) date_format,to_timestamp,(or) to_date
in built functions.
Example:(In Spark-2.4)
import org.apache.spark.sql.functions._
//sample data
val df = spark.createDataFrame(Seq((1, "9/11/2020"),(2, "10/11/2020"),(3, "1/1/2020"), (4, "12/7/2020"))).toDF("Id", "x4")
//using from_unixtime
df.withColumn("date",from_unixtime(unix_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//using date_format
df.withColumn("date",date_format(to_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
df.withColumn("date",date_format(to_date(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//+---+----------+----------+
//| Id| x4| date|
//+---+----------+----------+
//| 1| 9/11/2020|09/11/2020|
//| 2|10/11/2020|10/11/2020|
//| 3| 1/1/2020|01/01/2020|
//| 4| 12/7/2020|12/07/2020|
//+---+----------+----------+

`Found a workaround, see if there is a better solution using one dataframe and no UDF'
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\b\\d{2}))/(?:(\\d))/(?:(\\d{4})\\b)", "$1/0$2/$3"))
val newDf1 = newDf.withColumn("x4New1", regexp_replace(newDf("x4New"), "(?:(\\b\\d{1}))/(?:(\\d))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf2 = newDf1.withColumn("x4New2", regexp_replace(newDf1("x4New1"), "(?:(\\b\\d{1}))/(?:(\\d{2}))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf3 = newDf2.withColumn("date", to_date(regexp_replace(newDf2("x4New2"), "(?:(\\b\\d{2}))/(?:(\\d{1}))/(?:(\\d{4})\\b)", "$1/0$2/$3"),"MM/dd/yyyy"))
val formatedDataDf = newDf3
.drop("x4New")
.drop("x4New1")
.drop("x4New2")
formatedDataDf.printSchema
formatedDataDf.show
Output looks like as follows
root
|-- Id: integer (nullable = false)
|-- x4: string (nullable = true)
|-- date: date (nullable = true)
+---+----------+----------+
| Id| x4| date|
+---+----------+----------+
| 1| 9/11/2020|2020-09-11|
| 2|10/11/2020|2020-10-11|
| 3| 1/1/2020|2020-01-01|
| 4| 12/7/2020|2020-12-07|
+---+----------+----------+

Related

Spark Dataframes :Convert unix exponential numbers to string whole number to obtain timestamp

The spark dataframe below has start_t and end_t in unix format but has an exponential e in it.
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
| alt_t| end_t|engine_fuel_rate_t| lat_t|left_max_t|left_min_t| lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t| start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
|1237.5|1.521956985733E9| 0|-27.7314| 0.0| 0.0|22.9552| 1.5| 0.0| 0.0| 0| 17.4| 198.0| 1.52195698056E9|
|1236.5|1.521956989922E9| 0|-27.7316| 0.0| 0.0|22.9552| -3.3| 0.0| 0.0| 0| 17.6| 156.1|1.521956985733E9|
|1234.5|1.521956995378E9| 0|-27.7318| 0.0| 0.0|22.9552| -2.7| 0.0| 0.0| 0| 11.9| 148.6|1.521956989922E9|
|1230.5|1.521957001498E9| 0| -27.732| 0.0| 0.0|22.9551| 2.3| 0.0| 0.0| 0| 13.2| 169.1|1.521956995378E9|
Since it is double it can not be convert directly to timestamp. It will through an error stating it needs to be string.
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
| alt_t| end_t|engine_fuel_rate_t| lat_t|left_max_t|left_min_t| lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t|start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
|1237.5|1.521956985733E9| 0|-27.7314| 0.0| 0.0|22.9552| 1.5| 0.0| 0.0| 0| 17.4| 198.0| null|
|1236.5|1.521956989922E9| 0|-27.7316| 0.0| 0.0|22.9552| -3.3| 0.0| 0.0| 0| 17.6| 156.1| null|
|1234.5|1.521956995378E9| 0|-27.7318| 0.0| 0.0|22.9552| -2.7| 0.0| 0.0| 0| 11.9| 148.6| null|
Therefore I used the following code:
%scala
val df2 = df.withColumn("start_t", df("start_t").cast("string"))
val df3 = df2.withColumn("end_t", df("end_t").cast("string"))
val filteredDF = df3.withColumn("start_t", unix_timestamp($"start_t", "yyyyMMddHHmmss").cast("timestamp"))
filteredDF.show()
I get null in start_t and think its due to the E (exponential sign). I have tested it in pandas python, the dates are valid and outputs results. I know there is a way using precision to change this.
I am trying to convert it to timestamp in the format yyyy-MM-dd HH:mm:ss and have a separate column for just the time and date.
Note: similar question was posed but not answered. Scala Spark : Convert Double Column to Date Time Column in dataframe
Chain the casting from String -> Double -> Timestamp. The below works
scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t")
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: string]
scala> df.withColumn("end_t",'end_t.cast("double").cast("timestamp")).show(false)
+------+-----------------------+
|alt_t |end_t |
+------+-----------------------+
|1237.5|2018-03-25 01:49:45.733|
|1236.5|2018-03-25 01:49:49.922|
|1234.5|2018-03-25 01:49:55.378|
|1230.5|2018-03-25 01:50:01.498|
+------+-----------------------+
scala>
UPDATE1
scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t").withColumn("end_t",'end_t.cast("double").cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: timestamp]
scala> df.printSchema
root
|-- alt_t: string (nullable = true)
|-- end_t: timestamp (nullable = true)
scala>
You should be able to cast a double to timestamp as shown below
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala>
| val df = Seq((1237.5,1.521956985733E9),
| (1236.5,1.521956989922E9),
| (1234.5,1.521956995378E9),
| (1230.5,1.521957001498E9)).toDF("alt_t","end_t")
df: org.apache.spark.sql.DataFrame = [alt_t: double, end_t: double]
scala>
scala> df.printSchema
root
|-- alt_t: double (nullable = false)
|-- end_t: double (nullable = false)
scala>
scala> df.withColumn("end_t",$"end_t".cast("timestamp")).show
+------+--------------------+
| alt_t| end_t|
+------+--------------------+
|1237.5|2018-03-25 05:49:...|
|1236.5|2018-03-25 05:49:...|
|1234.5|2018-03-25 05:49:...|
|1230.5|2018-03-25 05:50:...|
+------+--------------------+

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()
Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>
There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show
import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

delete constant columns spark having issue with timestamp column

hi guys i did this code that allows to drop columns with constant values.
i start by computing the standard deviation i then drop the ones having standard equal to zero ,but i got this issue when having a column which has a timestamp type what to do
cannot resolve 'stddev_samp(time.1)' due to data type mismatch: argument 1 requires double type, however, 'time.1' is of timestamp type.;;
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
//val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
df.show(5)
//df.columns.map(p=>s"`${p}`")
//val aggs = df.columns.map(c => stddev(c).as(c))
val aggs = df.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = df.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
df.select(columnsToKeep: _*).show(5,false)
Using stddev
stddev is only defined on numeric columns. If you want to compute the standard deviation of a date column you will need to convert it to a time stamp first:
scala> var myDF = (0 to 10).map(x => (x, scala.util.Random.nextDouble)).toDF("id", "rand_double")
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double]
scala> myDF = myDF.withColumn("Date", current_date())
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: date (nullable = false)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|2017-03-21|
| 1| 0.5968932024004612|2017-03-21|
| 2|0.05912760417456575|2017-03-21|
| 3|0.29974600653895667|2017-03-21|
| 4| 0.8448407414817856|2017-03-21|
| 5| 0.2049495659443249|2017-03-21|
| 6| 0.4184846380144779|2017-03-21|
| 7|0.21400484330739022|2017-03-21|
| 8| 0.9558142791013501|2017-03-21|
| 9|0.32530639391058036|2017-03-21|
| 10| 0.5100585655062743|2017-03-21|
+---+-------------------+----------+
scala> myDF = myDF.withColumn("Date", unix_timestamp($"Date"))
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: long (nullable = true)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|1490072400|
| 1| 0.5968932024004612|1490072400|
| 2|0.05912760417456575|1490072400|
| 3|0.29974600653895667|1490072400|
| 4| 0.8448407414817856|1490072400|
| 5| 0.2049495659443249|1490072400|
| 6| 0.4184846380144779|1490072400|
| 7|0.21400484330739022|1490072400|
| 8| 0.9558142791013501|1490072400|
| 9|0.32530639391058036|1490072400|
| 10| 0.5100585655062743|1490072400|
+---+-------------------+----------+
At this point all of the columns are numeric so your code will run fine:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val aggs = myDF.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = myDF.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(myDF.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
myDF.select(columnsToKeep: _*).show(false)
// Exiting paste mode, now interpreting.
+---+-------------------+
|id |rand_double |
+---+-------------------+
|0 |0.3786008989478248 |
|1 |0.5968932024004612 |
|2 |0.05912760417456575|
|3 |0.29974600653895667|
|4 |0.8448407414817856 |
|5 |0.2049495659443249 |
|6 |0.4184846380144779 |
|7 |0.21400484330739022|
|8 |0.9558142791013501 |
|9 |0.32530639391058036|
|10 |0.5100585655062743 |
+---+-------------------+
aggs: Array[org.apache.spark.sql.Column] = Array(stddev_samp(id) AS `id`, stddev_samp(rand_double) AS `rand_double`, stddev_samp(Date) AS `Date`)
stddevs: org.apache.spark.sql.DataFrame = [id: double, rand_double: double ... 1 more field]
columnsToKeep: Seq[org.apache.spark.sql.Column] = ArrayBuffer(id, rand_double)
Using countDistinct
All that being said, it would be more general to use countDistinct:
scala> val distCounts = myDF.select(myDF.columns.map(c => countDistinct(c) as c): _*).first.toSeq.zip(myDF.columns)
distCounts: Seq[(Any, String)] = ArrayBuffer((11,id), (11,rand_double), (1,Date))]
scala> distCounts.foldLeft(myDF)((accDF, dc_col) => if (dc_col._1 == 1) accDF.drop(dc_col._2) else accDF).show
+---+-------------------+
| id| rand_double|
+---+-------------------+
| 0| 0.3786008989478248|
| 1| 0.5968932024004612|
| 2|0.05912760417456575|
| 3|0.29974600653895667|
| 4| 0.8448407414817856|
| 5| 0.2049495659443249|
| 6| 0.4184846380144779|
| 7|0.21400484330739022|
| 8| 0.9558142791013501|
| 9|0.32530639391058036|
| 10| 0.5100585655062743|
+---+-------------------+

Spark - Csv data split with scala

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3
I want to change this data like this (as dataset or rdd)
whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB
I loaded data with read method.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session.
Any idea of this?
Those are not working.
val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })
val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))
This can be accomplished with explode and a udf:
scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]
scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
| A| 1| 2|
| B| 1| 3|
| C| 4| 3|
+----+----+----+
scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))
scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]
scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]
scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]
scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
| A| 1| Key1|
| A| 2| Key2|
| B| 1| Key1|
| B| 3| Key2|
| C| 4| Key1|
| C| 3| Key2|
+----+---+------------+

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.
If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey