spark flatten records using a key column - scala

I am trying to implement the logic to flatten the records using spark/Scala API. I am trying to use map function.
Could you please help me with the easiest approach to solve this problem?
Assume, for a given key I need to have 3 process codes
Input dataframe-->
Keycol|processcode
John |1
Mary |8
John |2
John |4
Mary |1
Mary |7
==============================
Output dataframe-->
Keycol|processcode1|processcode2|processcode3
john |1 |2 |4
Mary |8 |1 |7

Assuming same number of rows per Keycol, one approach would be to aggregate processcode into an array for each Keycol and expand out into individual columns:
val df = Seq(
("John", 1),
("Mary", 8),
("John", 2),
("John", 4),
("Mary", 1),
("Mary", 7)
).toDF("Keycol", "processcode")
val df2 = df.groupBy("Keycol").agg(collect_list("processcode").as("processcode"))
val numCols = df2.select( size(col("processcode")) ).as[Int].first
val cols = (0 to numCols - 1).map( i => col("processcode")(i) )
df2.select(col("Keycol") +: cols: _*).show
+------+--------------+--------------+--------------+
|Keycol|processcode[0]|processcode[1]|processcode[2]|
+------+--------------+--------------+--------------+
| Mary| 8| 1| 7|
| John| 1| 2| 4|
+------+--------------+--------------+--------------+

A couple of alternative approaches.
SQL
df.createOrReplaceTempView("tbl")
val q = """
select keycol,
c[0] processcode1,
c[1] processcode2,
c[2] processcode3
from (select keycol, collect_list(processcode) c
from tbl
group by keycol) t0
"""
sql(q).show
Result
scala> sql(q).show
+------+------------+------------+------------+
|keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
PairRDDFunctions (groupByKey) + mapPartitions
import org.apache.spark.sql.Row
val my_rdd = df.map{ case Row(a1: String, a2: Int) => (a1, a2)
}.rdd.groupByKey().map(t => (t._1, t._2.toList))
def f(iter: Iterator[(String, List[Int])]) : Iterator[Row] = {
var res = List[Row]();
while (iter.hasNext) {
val (keycol: String, c: List[Int]) = iter.next
res = res ::: List(Row(keycol, c(0), c(1), c(2)))
}
res.iterator
}
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("Keycol", StringType, true)).add(
StructField("processcode1", IntegerType, true)).add(
StructField("processcode2", IntegerType, true)).add(
StructField("processcode3", IntegerType, true))
spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
Result
scala> spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
+------+------------+------------+------------+
|Keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
Please keep in mind that in all cases order of values in columns for process codes is undetermined unless explicitly specified.

Related

Extracting array elements and mapping them to case class

The following is the output I am getting after performing a groupByKey, mapGroups and then a joinWith operation on the caseclass dataset:
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[IND0001,Christopher,Black] |null |
|[IND0002,Madeleine,Kerr] |[IND0002,WrappedArray([IND0002,ACC0155,323], [IND0002,ACC0262,60])] |
|[IND0003,Sarah,Skinner] |[IND0003,WrappedArray([IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53])] |
|[IND0004,Rachel,Parsons] |[IND0004,WrappedArray([IND0004,ACC0116,965])] |
|[IND0005,Oliver,Johnston] |[IND0005,WrappedArray([IND0005,ACC0146,378], [IND0005,ACC0201,34], [IND0005,ACC0450,329])] |
|[IND0006,Carl,Metcalfe] |[IND0006,WrappedArray([IND0006,ACC0052,57], [IND0006,ACC0597,547])] |
The code is as follows:
val test = accountDS.groupByKey(_.customerId).mapGroups{ case (id, xs) => (id, xs.toSeq)}
test.show(false)
val newTest = customerDS.joinWith(test, customerDS("customerId") === test("_1"), "leftouter")
newTest.show(500,false)
Now I want to take the arrays and output them in a format as follows:
+----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
* +----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |IND0001 |Christopher|Black |[] |0 |0 |0.0 |
* |IND0002 |Madeleine |Kerr |[[IND0002,ACC0155,323], [IND0002,ACC0262,60]] |2 |383 |191.5 |
* |IND0003 |Sarah |Skinner |[[IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53]] |3 |1084 |361.3333333333333|
Note: I cannot use spark.sql.functions._ at all --> training academy rules :(
How do I get the above output which should be mapped to a case class as follows:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
I really need help with this. Stuck with this for weeks without a working solution.
Let's say you have the following DataFrame:
val sourceDf = Seq((1, Array("aa", "CA")), (2, Array("bb", "OH"))).toDF("id", "data_arr")
sourceDf.show()
// output:
+---+--------+
| id|data_arr|
+---+--------+
| 1|[aa, CA]|
| 2|[bb, OH]|
+---+--------+
and you want to convert it to the following schema:
val destSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("state", StringType, true)
))
You can do:
val destDf: DataFrame = sourceDf
.map { sourceRow =>
Row(sourceRow(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(1))
}(RowEncoder(destSchema))
destDf.show()
// output:
+---+----+-----+
| id|name|state|
+---+----+-----+
| 1| aa| CA|
| 2| bb| OH|
+---+----+-----+

Add values to a dataframe against some particular ID in Spark Scala

I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

scala spark use expr to value inside a column

I need to add a new column to a dataframe with a boolean value, evaluating a column inside the dataframe. For example, I have a dataframe of
+----+----+----+----+----+-----------+----------------+
|colA|colB|colC|colD|colE|colPRODRTCE| colCOND|
+----+----+----+----+----+-----------+----------------+
| 1| 1| 1| 1| 3| 39|colA=1 && colB>0|
| 1| 1| 1| 1| 3| 45| colD=1|
| 1| 1| 1| 1| 3| 447|colA>8 && colC=1|
+----+----+----+----+----+-----------+----------------+
In my new column I need to evaluate if the expression of colCOND is true or false.
It's easy if you have something like this:
val df = List(
(1,1,1,1,3),
(2,2,3,4,4)
).toDF("colA", "colB", "colC", "colD", "colE")
val myExpression = "colA<colC"
import org.apache.spark.sql.functions.expr
df.withColumn("colRESULT",expr(myExpression)).show()
+----+----+----+----+----+---------+
|colA|colB|colC|colD|colE|colRESULT|
+----+----+----+----+----+---------+
| 1| 1| 1| 1| 3| false|
| 2| 2| 3| 4| 4| true|
+----+----+----+----+----+---------+
But I have to evaluate a different expression in each row and it is inside the column colCOND.
I thought in create a UDF function with all columns, but my real dataframe have a lot of columns. How can I do it?
Thanks to everyone
if && change to AND, can try
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
object DataFrameLogicWithColumn extends App{
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val sourceDF = Seq((1,1,1,1,3,39,"colA=1 AND colB>0"),
(1,1,1,1,3,45,"colD=1"),
(1,1,1,1,3,447,"colA>8 AND colC=1")
).toDF("colA", "colB", "colC", "colD", "colE", "colPRODRTCE", "colCOND").persist(MEMORY_AND_DISK)
val exprs = sourceDF.select('colCOND).distinct().as[String].collect()
val d1 = exprs.map(i => {
val df = sourceDF.filter('colCOND.equalTo(i))
df.withColumn("colRESULT", expr(i))
})
val resultDF = d1.reduce(_ union _)
resultDF.show(false)
// +----+----+----+----+----+-----------+-----------------+---------+
// |colA|colB|colC|colD|colE|colPRODRTCE|colCOND |colRESULT|
// +----+----+----+----+----+-----------+-----------------+---------+
// |1 |1 |1 |1 |3 |39 |colA=1 AND colB>0|true |
// |1 |1 |1 |1 |3 |447 |colA>8 AND colC=1|false |
// |1 |1 |1 |1 |3 |45 |colD=1 |true |
// +----+----+----+----+----+-----------+-----------------+---------+
sourceDF.unpersist()
}
can try DataSet
case class c1 (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String)
case class cRes (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String, colResult: Boolean)
val sourceData = Seq(c1(1,1,1,1,3,39,"colA=1 AND colB>0"),
c1(1,1,1,1,3,45,"colD=1"),
c1(1,1,1,1,3,447,"colA>8 AND colC=1")
).toDS()
def f2(a: c1): Boolean={
// we need parse value with colCOUND
a.colCOND match {
case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true
case _ => false
}
}
val res2 = sourceData
.map(i => cRes(i.colA, i.colB, i.colC, i.colD, i.colE, i.colPRODRTCE, i.colCOND,
f2(i)))

how to use achieve below requirement using spark RDD

I am really new for Spark. would you please help below require..
I have below source file. first field is name, second field is group id. I need to count how many group the name has, and list all the groups and count.
abc 1
abc 2
abc 3
xyz 1
xyz 3
def 2
def 4
lmn 6
I want to get below ex
name dept count
abc 1,2,3 3
xyz 1,3 2
def 2,4 2
lmn 6 1
thanks in advance.
Assuming you have a CSV file. So , first create dataframe using following steps.
import org.apache.spark.sql.types._
import org.apache.spark.Row
val members = sc.textFile("member").Map(lines => lines.split(",")).map(a => Row(a(0),a(1)))
val rddStruct = new StructType(Array(StructField("name", StringType, nullable = true),StructField("depart", StringType, nullable = true)))
val df = sqlContext.createDataFrame(members,rddStruct)
To achieve the output, following steps can be followed.
Apply a groupBy function can collect all departments as a set
val df2 = df.groupBy("Name").agg(collect_set("Depart").as("Depart"))
df2.show
+---+-----------+
| Name| Depart|
+---+-----------+
|lmn| [6]|
|def| [2, 4]|
|abc|[1, 2, 3]|
|xyz| [1, 2]|
+---+---------+
Then apply a size function on the Depart column to get the count.
val df3 = df2.withColumn("Count", size(df2("Depart")))
df3.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|lmn| [6]| 1|
|def| [2, 4]| 2|
|abc|[1, 2, 3]| 3|
|xyz| [1, 2]| 2|
+---+---------+-----+
If result required should be sorted in descending order than you can apply a orderBy function on the above output.
val df4 = df3.orderBy(desc("Count"))
df4.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|abc|[1, 2, 3]| 3|
|def| [2, 4]| 2|
|xyz| [1, 2]| 2|
|lmn| [6]| 1|
+---+---------+-----+
About structType you can read here
You can make it simple using RDD transformation:
scala> var rdd = sc.textFile("/data_test1")
scala> rdd.map(x => x.split(" ")).
map(x => (x(0), x(1))).
groupByKey().
map(x => (x._1, x._2.toSet.mkString(","),x._2.size)).
toDF("name", "dept", "count").show()
Output:
+----+-----+-----+
|name| dept|count|
+----+-----+-----+
| abc|1,2,3| 3|
| lmn| 6| 1|
| def| 2,4| 2|
| xyz| 1,3| 2|
+----+-----+-----+