create new columns from string column - scala

I have as DataFrame with a string column
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3")).toDF("col_str")
+--------------------+
| col_str|
+--------------------+
|0003C32C-FC1D-482...|
+--------------------+
The string column is comprised of character sequences separated by whitespace. If a character sequence starts with 0, I want to return the second number and the last number of the sequence. The second number can be any number between 0 and 8.
Array("8,3", "6,1", "7,1", "6,1", "7,1", "8,3")
I then want to transform the array of pairs into 9 columns, with the first number of the pair as the column and the second number as the value. If a number is missing, it will get a value of 0.
For example
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:1")).).toDF("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
+--------------------+----+----+----+----+----+----+----+----+----+
| col_str|col0|col1|col2|col3|col4|col5|col6|col7|col8|
+--------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482...| 0| 0| 0| 0| 0| 0| 1| 1| 3|
+--------------------+----+----+----+----+----+----+----+----+----+
I don't care if the solution is in either scala or python.

You can do the following (commented for clarity)
//string defining
val str = """0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3"""
//string splitting with space
val splittedStr = str.split(" ")
//parsing the splitted string to get the desired format with the second element as key and the last element as value of the elements starting with 0
val parsedStr = List(("col_str"->splittedStr.head)) ++ splittedStr.tail.filter(_.startsWith("0")).map(value => {val splittedValue = value.split("[,:]"); ("col"+splittedValue(1)->splittedValue.last)}) toMap
//expected header names
val expectedHeader = Seq("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
//populating 0 for the missing header names in the parsed string in above step
val missedHeaderWithValue = expectedHeader.diff(parsedStr.keys.toSeq).map((_->"0")).toMap
//combining both the maps
val expectedKeyValues = parsedStr ++ missedHeaderWithValue
//converting to a dataframe
Seq(expectedDF(expectedKeyValues(expectedHeader(0)), expectedKeyValues(expectedHeader(1)), expectedKeyValues(expectedHeader(2)), expectedKeyValues(expectedHeader(3)), expectedKeyValues(expectedHeader(4)), expectedKeyValues(expectedHeader(5)), expectedKeyValues(expectedHeader(6)), expectedKeyValues(expectedHeader(7)), expectedKeyValues(expectedHeader(8)), expectedKeyValues(expectedHeader(9))))
.toDF()
.show(false)
which should give you
+------------------------------------+----+----+----+----+----+----+----+----+----+
|col_str |col0|col1|col2|col3|col4|col5|col6|col7|col8|
+------------------------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482F-B543-3CBD7F0A0E36|0 |0 |0 |0 |0 |0 |1 |1 |3 |
+------------------------------------+----+----+----+----+----+----+----+----+----+
and of course you would need expectedDF case class defined somewhere out of scope
case class expectedDF(col_str: String, col0: String, col1: String, col2: String, col3: String, col4: String, col5: String, col6: String, col7: String, col8: String)

Related

Scala Spark: Flatten Array of Key/Value structs

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.
Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+
I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.
Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.
I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

Run a custom transformation on string columns

Suppose I have the following dataframe:
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
I need to run a transformation on the string colums (in this example: Hex1, Hex2 and Bool) and convert them to a numeric value by using some custom logic.
The dataframes are generated by reading CSV files which I don't know the schema. All I know is that they contain a Timestamp column as the first column and then a variable number of columns which might be numeric (integers or doubles/floats) or these hex and boolean values.
I'm thinking this transformation would need to find all the string columns and for each one, run the transformation that will add a new column to the dataframe with the numerical representation of the string.
In this case, the hex values would be converted to their decimal representation. And the "True", "False" strings would be converted to 1 and 0 respectively.
Back to the simplified example, I should get a df like this:
|Timestamp |Float|Integer|Hex1 |Hex2 |Bool |
|-----------|-----|-------|------------------|----------|-----|
|2019-09-01 |0.1 |1 |1 |1 |1 |
|2019-09-02 |0.2 |2 |2 |2 |0 |
|2019-09-03"|0.3 |3 |3 |3 |1 |
With all numeric (integer, float or double) columns except for the Timestamp
As per your example use following function:
Use conv standard function to convert hex to appropriate type.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
when(Column condition, Object value):
Evaluates a list of conditions and returns one of multiple possible result expressions.
import org.apache.spark.sql.functions.conv
import org.apache.spark.sql.functions._
val s1 = df.
withColumn("Hex1", conv(col("Hex1").substr(lit(3), length(col("Hex1"))), 16, 10) cast IntegerType).
withColumn("Hex2", conv(col("Hex2").substr(lit(3), length(col("Hex2"))), 16, 10) cast IntegerType).
withColumn("Bool", when(col("Bool") === "True", 1)
.otherwise(0))
s1.show()
s1.printSchema()
From your problem definition ie dynamically. If you want to do same task dynamically you have to do extra work.
Create mapping ie column and it's datatype map: This can be abstracted out, you can create your mapping file externally. Can be generated dynamically by reading mapping file.
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
create converter using pattern matching :
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
you can add all the cases and its appropriate implementation.
final solution
import spark.implicits._
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
val temp = list.foldLeft(df) { (tempDF, listValue) =>
tempDF.withColumn(listValue._2, Helper.convert(listValue))
}
temp.show(false)
temp.printSchema()
}
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
Result:
+----------+-----+-------+----+----+----+
|Timestamp |Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01|0.1 |1 |1 |1 |1 |
|2019-09-02|0.2 |2 |2 |2 |0 |
|2019-09-03|0.3 |3 |3 |3 |1 |
+----------+-----+-------+----+----+----+
root
|-- Timestamp: string (nullable = true)
|-- Float: double (nullable = false)
|-- Integer: integer (nullable = false)
|-- Hex1: integer (nullable = true)
|-- Hex2: integer (nullable = true)
|-- Bool: integer (nullable = false)
Below is my spark code to do this. I have used conv function of spark sql http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.conv
. Also if you want to write a logic to dynamically identify all string columns at run time and perform conversion, it could be done only if you know exactly what kind of conversion you are going to do.
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
// df.show
df.createOrReplaceTempView("sourceTable")
val finalDF = spark.sql("""
select Timestamp,
Float,
Integer,
conv(substr(Hex1,3),16,10) as Hex1,
conv(substr(Hex2,3),16,10) as Hex2,
case when Bool = "True" then 1
when Bool = "False" then 0
else NULL
end as Bool
from sourceTable
""")
finalDF.show
Result :
+----------+-----+-------+----+----+----+
| Timestamp|Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01| 0.1| 1| 1| 1| 1|
|2019-09-02| 0.2| 2| 2| 2| 0|
|2019-09-03| 0.3| 3| 3| 3| 1|
+----------+-----+-------+----+----+----+

How can I take a column in a dataframe that is a Map type and create a string that is just the key/value of the Map column

I'm interested in taking a column in my dataframe called mapColumn
+-------------------+
| mapColumn |
+-------------------+
| Map(KEY -> VALUE) |
+-------------------+
and create a stringColumn that's just the key and value of the Map column where the value is "KEY,VALUE":
+-------------------+
| stringColumn |
+-------------------+
| KEY,VALUE |
+-------------------
I have tried creating a UDF to pass this value like follows:
var getStringColumn = udf(mapToString _)
df.withColumn("stringColumn,
when(col(mapColumn).isNotNull,
getStringColumn(col(mapColumn)))
.otherwise(lit(null: String)))
def mapToString(row: Row): String = {
if (null == row || row.isNullAt(FirstItemIndex)) {
return null
}
return row.getValuesMap[Any](row.schema.fieldNames).mkString(",")
}
I keep getting the following error:
Failed to execute user defined function($anonfun$1: (map) => string)
Cause: java.lang.ClassCastException:scala.collection.immutable.Map$Map1 cannot be cast to org.apache.spark.sql.Row
There is no need for a UDF. One approach is to explode the Map column into flattened key & value columns and concat the key-value elements as Strings accordingly:
val df = Seq(
(10, Map((1, "a"), (2, "b"))),
(20, Map((3, "c")))
).toDF("id", "map")
df.
select($"id", explode($"map")).
withColumn("kv_string", concat($"key".cast("string"), lit(","), $"value")).
show
// +---+---+-----+---------+
// | id|key|value|kv_string|
// +---+---+-----+---------+
// | 10| 1| a| 1,a|
// | 10| 2| b| 2,b|
// | 20| 3| c| 3,c|
// +---+---+-----+---------+

Scala concatenate Column of Array[String] into single Array[String]

I have a Spark Dataframe (Scala) with an id - (Int) and tokens - (array<string>) column:
id,tokens
0,["a","b","c"]
1,["a","b"]
...
Assuming I am able to retrieve the data via a SparkSession and casting to a case class:
case class Token(id: Int, tokens: Array[String])
After getting a Dataset[Token] object, how do I concatenate all the array of string tokens into a single Array<String> and subsequently perform a count to find the most occuring strings?
Output:
a,2
b,2
c,1
...
You need to explode the token column & take the count after grouping by the individual tokens:
scala> val input = sc.parallelize(List(
(0, Array("a","b","c")),
(1, Array("a","b"))
)).toDF("id","token")
scala> input.withColumn("token_split",explode($"token"))
.groupBy($"token_split")
.agg(count($"id") as "count")
.orderBy($"count".desc)
.show
Output:
+-----------+-----+
|token_split|count|
+-----------+-----+
| b| 2|
| a| 2|
| c| 1|
+-----------+-----+

Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.
I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.
How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.
Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.
Kindly please advise with a working or pseudo code.
Thanks
Sanjay
Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
df1.show
// +---+----+---+------------+
// | x| y| z| foobar|
// +---+----+---+------------+
// | 1| 3.0| a| [3.0,291.0]|
// | 2|-1.0| b|[-2.0,-98.0]|
// | 3| 0.0| c| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- x: long (nullable = false)
// |-- y: double (nullable = false)
// |-- z: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)
This can be easily flattened later but usually there is no need for that.
Switch to RDD, reshape and rebuild DF:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
Seq(x * y, z.head.toInt * y)
val schema = StructType(df.schema.fields ++
Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
r.toSeq ++
foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | x| y| z| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| a| 3.0|291.0|
// | 2|-1.0| b|-2.0|-98.0|
// | 3| 0.0| c| 0.0| 0.0|
// +---+----+---+----+-----+
Assume that after your function there will be a sequence of elements, giving an example as below:
val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
df.show
+------------------+---+
| infoComb|age|
+------------------+---+
|Mike,1986,Toronto| 30|
| Andre,1980,Ottawa| 36|
| jill,1989,London| 27|
+------------------+---+
now what you can do with this infoComb is that you can start split the string and get more columns with:
df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
+-----+----------+-------+---+
| name|yearOfBorn| city|age|
+-----+----------+-------+---+
|Mike| 1986|Toronto| 30|
|Andre| 1980| Ottawa| 36|
| jill| 1989| London| 27|
+-----+----------+-------+---+
Hope this helps.
If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:
val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
.withColumn("newCol2", myFun2(myDf("originalColumn"))
.drop(myDf("originalColumn"))
where myFun is an udf defined like this:
def myFun= udf(
(originalColumnContent : String) => {
// do something with your original column content and return a new one
}
)
I opted to create a function to flatten one column and then just call it simultaneously with the udf.
First define this:
implicit class DfOperations(df: DataFrame) {
def flattenColumn(col: String) = {
def addColumns(df: DataFrame, cols: Array[String]): DataFrame = {
if (cols.isEmpty) df
else addColumns(
df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
cols.tail
)
}
val field = df.select(col).schema.fields(0)
val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)
addColumns(df, newCols).drop(col)
}
def withColumnMany(colName: String, col: Column) = {
df.withColumn(colName, col).flattenColumn(colName)
}
}
Then usage is very simple:
case class MyClass(a: Int, b: Int)
val df = sc.parallelize(Seq(
(0),
(1)
)).toDF("x")
val f = udf((x: Int) => MyClass(x*2,x*3))
df.withColumnMany("test", f($"x")).show()
// +---+------+------+
// | x|test_a|test_b|
// +---+------+------+
// | 0| 0| 0|
// | 1| 2| 3|
// +---+------+------+
This can be easily achieved by using pivot function
df4.groupBy("year").pivot("course").sum("earnings").collect()