Spark: Rows to Columns (like transpose or pivot) - scala

How to transpose rows to columns using RDD or data frame without pivot.
SessionId,date,orig, dest, legind, nbr
1 9/20/16,abc0,xyz0,o,1
1 9/20/16,abc1,xyz1,o,2
1 9/20/16,abc2,xyz2,i,3
1 9/20/16,abc3,xyz3,i,4
So I want to generate new schema like:
SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4
1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null
Logic is if:
nbr is 1 and legind = o then orig1 value (fetch from row 1) ...
nbr is 3 and legind = i then dest1 value (fetch from row 3)
So how to transpose the rows to columns...
Any idea will be great appreciated.
Tried with below option but its just flatten all in single row..
val keys = List("SessionId");
val selectFirstValueOfNoneGroupedColumns =
df.columns
.filterNot(keys.toSet)
.map(_ -> "first").toMap
val grouped =
df.groupBy(keys.head, keys.tail: _*)
.agg(selectFirstValueOfNoneGroupedColumns).show()

It is relatively simple if you use pivot function. First lets create a data set like the one in your question:
import org.apache.spark.sql.functions.{concat, first, lit, when}
val df = Seq(
("1", "9/20/16", "abc0", "xyz0", "o", "1"),
("1", "9/20/16", "abc1", "xyz1", "o", "2"),
("1", "9/20/16", "abc2", "xyz2", "i", "3"),
("1", "9/20/16", "abc3", "xyz3", "i", "4")
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr")
then define and attach helper columns:
// This will be the column name
val key = when($"legind" === "o", concat(lit("orig"), $"nbr"))
.when($"legind" === "i", concat(lit("dest"), $"nbr"))
// This will be the value
val value = when($"legind" === "o", $"orig") // If o take origin
.when($"legind" === "i", $"dest") // If i take dest
val withKV = df.withColumn("key", key).withColumn("value", value)
This will result in a DataFrame like this:
+---------+-------+----+----+------+---+-----+-----+
|SessionId| date|orig|dest|legind|nbr| key|value|
+---------+-------+----+----+------+---+-----+-----+
| 1|9/20/16|abc0|xyz0| o| 1|orig1| abc0|
| 1|9/20/16|abc1|xyz1| o| 2|orig2| abc1|
| 1|9/20/16|abc2|xyz2| i| 3|dest3| xyz2|
| 1|9/20/16|abc3|xyz3| i| 4|dest4| xyz3|
+---------+-------+----+----+------+---+-----+-----+
Next let's define a list of possible levels:
val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y"))
and finally pivot
val result = withKV
.groupBy($"sessionId", $"date")
.pivot("key", levels)
.agg(first($"value", true)).show
And the result is:
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|sessionId| date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
| 1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+

Related

spark expression rename the column list after aggregation

I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).

Calculating edit distance on successive rows of a `Spark Dataframe

I have a data frame as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "AA", "BB", ("AA", "BB")),
(2, "AA", "BB", ("AA", "BB")),
(3, "AB", "BB", ("AB", "BB"))
).toDF("id","name", "surname", "array")
df.show()
and i am looking to calculate the edit distance between the 'array' column in successive row. As an example i want to calculate the edit distance between the 'array' entity in column 1 ("AA", "BB") and the the 'array' entity in column 2 ("AA", "BB"). Here is the edit distance function i am using:
def editDist2[A](a: Iterable[A], b: Iterable[A]): Int = {
val startRow = (0 to b.size).toList
a.foldLeft(startRow) { (prevRow, aElem) =>
(prevRow.zip(prevRow.tail).zip(b)).scanLeft(prevRow.head + 1) {
case (left, ((diag, up), bElem)) => {
val aGapScore = up + 1
val bGapScore = left + 1
val matchScore = diag + (if (aElem == bElem) 0 else 1)
List(aGapScore, bGapScore, matchScore).min
}
}
}.last
}
I know i need to create a UDF for this function but can't seem to be able to. If i use the function as is and using Spark Windowing to get at the pervious row:
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
i get the following error:
<console>:245: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Iterable[?]
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
I figured out you can use Spark's own levenshtein function for this. This function takes in two string to compare, so it can't be used with the array.
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", levenshtein(($"name"), lag("name", 1).over(window)) + levenshtein(($"surname"), lag("surname", 1).over(window))).show()
giving the desired output:
+---+----+-------+--------+------+
| id|name|surname| array|edit-d|
+---+----+-------+--------+------+
| 1| AA| BB|[AA, BB]| null|
| 2| AA| BB|[AA, BB]| 0|
| 3| AB| BB|[AB, BB]| 1|
+---+----+-------+--------+------+

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Scala: How to add a column with the value of a changed field that was changed between two tables

I have two tables with the same schema (A and B) where every unique ID in table A also exists in table B in a 1 to 1 way. I want to add a column to table B with the name of the column whose value is different between the tables for each row. There is only one difference per row.
For example:
Table A:
{ "id1": 1,"id2": "a","name": "bob","state": "nj"}
{"id1": 2,"id2": "b","name": "sue","state": "ma"}
Table B:
{"id1": 1,"id2": "a","name": "bob","state": "fl"}
{"id1": 2,"id2": "b","name": "susan","state": "ma"}
After comparing them, I want Table B to look like this:
{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}
{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}
I can't find any functions that do this in Spark Scala's data frames. Is there something that I missed?
EDIT: I am working with hundreds to thousands of columns
Here's a way to achieve this without having to "spell-out" the columns, and without a UDF (only using built-in functions):
import org.apache.spark.sql.functions._
import spark.implicits._
// list of columns to compare
val comparableColumns = A.columns.tail // without id
// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
case (result, col) => when(
result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
).otherwise(result)
}
// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
.withColumn("changed_field", changedFieldCol)
.select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)
result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1 |a |bob |fl |state |
// |2 |b |susan|ma |name |
// +---+---+-----+-----+-------------+
You can compare the fields in an UDF which generates the appropriate string:
import spark.implicits._
val df_a = Seq(
(1, "a", "bob", "nj"),
(2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")
val df_b = Seq(
(1, "a", "bob", "fl"),
(2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")
val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
val changedState = if (aState != bState) Some("state") else None
val changedName = if (aName != bName) Some("name") else None
Seq(changedName, changedState).flatten.mkString(",")
}
)
df_b.as("b")
.join(
df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()
gives
+---+---+------+-----+--------------+
|id1|id2| name|state|changed_fields|
+---+---+------+-----+--------------+
| 1| a| bob| fl| state|
| 2| b|susane| ma| name|
+---+---+------+-----+--------------+
EDIT:
Here a more generic version which compares all fields at once:
val compareFields = udf((a:Row,b:Row) => {
assert(a.schema==b.schema)
a.schema
.indices
.map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
.flatten
.mkString(",")
}
)
df_b.as("b")
.join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
.withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
.select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
.show()

How to set ignoreNulls flag for first function in agg with map of columns and aggregate functions?

I have around 20-25 list of columns from conf file and have to aggregate first Notnull value. I tried the function to pass the column list and agg expr from reading the conf file.
I was able to get first function but couldn't find how to specify first with ignoreNull value as true.
The code that I tried is
def groupAndAggregate(df: DataFrame, cols: List[String] , aggregateFun: Map[String, String]): DataFrame = {
df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}
val df = sc.parallelize(Seq(
(0, null, "1"),
(1, "2", "2"),
(0, "3", "3"),
(0, "4", "4"),
(1, "5", "5"),
(1, "6", "6"),
(1, "7", "7")
)).toDF("grp", "col1", "col2")
//first
groupAndAggregate(df, List("grp"), Map("col1"-> "first", "col2"-> "COUNT") ).show()
+---+-----------+-----------+
|grp|first(col1)|count(col2)|
+---+-----------+-----------+
| 1| 2| 4|
| 0| | 3|
+---+-----------+-----------+
I need to get 3 as a result in place of null.
I am using Spark 2.1.0 and Scala 2.11
Edit 1:
If I use the following function
import org.apache.spark.sql.functions.{first,count}
df.groupBy("grp").agg(first(df("col1"), ignoreNulls = true), count("col2")).show()
I get my desired result. Can we pass the ignoreNulls true for first function in Map?
I have been able to achieve this by creating a list of Columns and passing it to agg function of groupBy. The earlier approach had an issue where i was not able to name the columns as the agg function was not returning me the order of columns in the output DF, i have renamed the columns in the list itself.
import org.apache.spark.sql.functions._
def groupAndAggregate(df: DataFrame): DataFrame = {
val list: ListBuffer[Column] = new ListBuffer[Column]()
try {
val columnFound = getAggColumns(df) // function to return a Map[String, String]
val agg_func = columnFound.entrySet().toList.
foreach(field =>
list += first(df(columnFound.getOrDefault(field.getKey, "")),ignoreNulls = true).as(field.getKey)
)
list += sum(df("col1")).as("watch_time")
list += count("*").as("frequency")
val groupColumns = getGroupColumns(df) // function to return a List[String]
val output = df.groupBy(groupColumns.head, groupColumns.tail: _*).agg(
list.head, list.tail: _*
)
output
} catch {
case e: Exception => {
e.printStackTrace()}
null
}
}
I think you should use na operator and drop all the nulls before you do aggregation.
na: DataFrameNaFunctions Returns a DataFrameNaFunctions for working with missing data.
drop(cols: Array[String]): DataFrame Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.
The code would then look as follows:
df.na.drop("col1").groupBy(...).agg(first("col1"))
That will impact count so you'd have to do count separately.