Create DataFrame with null value for few column

Create DataFrame with null value for few column - scala

I am trying to create a DataFrame using RDD.
First I am creating a RDD using below code -
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
It is working fine -
account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] =
ParallelCollectionRDD[0] at parallelize at :27
but when try to create DataFrame from the RDD using below code
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
I am getting below error
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
I analyzed that whenever I put null value in Seq then only I got the error.
Is there any way to add null value?

Alternative way without using RDDs:
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)

The problem is that Any is too general type and Spark just has no idea how to serialize it. You should explicitly provide some specific type, in your case Integer. Since null can't be assigned to primitive types in Scala you can use java.lang.Integer instead. So try this:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
Here is an output:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
And the corresponding DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
Also you can consider some cleaner way to declare the null integer value like:
object Constants {
val NullInteger: java.lang.Integer = null
}

Related

Scala Spark: Flatten Array of Key/Value structs

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.

Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+

I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.

Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.

I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

In Apache Spark, I have a dataframe with one column which has string (its a date) but leading zero is missing from month and day

import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df
.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\d{2}))/(?:(\\d{1}))/(?:(\\d{4}))", "$1/0$2/$3"))
val newDf1 = newDf
.withColumn("x4New1", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{1}))/(?:(\\d{4}))", "0$1/0$2/$3"))
.withColumn("x4New2", regexp_replace(df("x4"), "(?:(\\d{1}))/(?:(\\d{2}))/(?:(\\d{4}))", "0$1/$2/$3"))
newDf1.show
Output now
+---+----------+----------+-----------+-----------+
| Id| x4| x4New| x4New1| x4New2|
+---+----------+----------+-----------+-----------+
| 1| 9/11/2020| 9/11/2020| 9/11/2020| 09/11/2020|
| 2|10/11/2020|10/11/2020| 10/11/2020|100/11/2020|
| 3| 1/1/2020| 1/1/2020| 01/01/2020| 1/1/2020|
| 4| 12/7/2020|12/07/2020|102/07/2020| 12/7/2020|
+---+----------+----------+-----------+-----------+
`Desired Output, add a leading Zero in front of day or month is single-digit' Do not want to use a UDF for performance reasons
+---+----------+----------+
| Id| x4| date |
+---+----------+----------+
| 1| 9/11/2020|09/11/2020|
| 2|10/11/2020|10/11/2020|
| 3| 1/1/2020|01/01/2020|
| 4| 12/7/2020|12/07/2020|
+---+----------+----------+-----------+-----------+

Use from_unixtime,unix_timestamp (or) date_format,to_timestamp,(or) to_date
in built functions.
Example:(In Spark-2.4)
import org.apache.spark.sql.functions._
//sample data
val df = spark.createDataFrame(Seq((1, "9/11/2020"),(2, "10/11/2020"),(3, "1/1/2020"), (4, "12/7/2020"))).toDF("Id", "x4")
//using from_unixtime
df.withColumn("date",from_unixtime(unix_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//using date_format
df.withColumn("date",date_format(to_timestamp(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
df.withColumn("date",date_format(to_date(col("x4"),"MM/dd/yyyy"),"MM/dd/yyyy")).show()
//+---+----------+----------+
//| Id| x4| date|
//+---+----------+----------+
//| 1| 9/11/2020|09/11/2020|
//| 2|10/11/2020|10/11/2020|
//| 3| 1/1/2020|01/01/2020|
//| 4| 12/7/2020|12/07/2020|
//+---+----------+----------+

`Found a workaround, see if there is a better solution using one dataframe and no UDF'
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "9/11/2020"),
(2, "10/11/2020"),
(3, "1/1/2020"),
(4, "12/7/2020"))).toDF("Id", "x4")
val newDf = df.withColumn("x4New", regexp_replace(df("x4"), "(?:(\\b\\d{2}))/(?:(\\d))/(?:(\\d{4})\\b)", "$1/0$2/$3"))
val newDf1 = newDf.withColumn("x4New1", regexp_replace(newDf("x4New"), "(?:(\\b\\d{1}))/(?:(\\d))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf2 = newDf1.withColumn("x4New2", regexp_replace(newDf1("x4New1"), "(?:(\\b\\d{1}))/(?:(\\d{2}))/(?:(\\d{4})\\b)", "0$1/$2/$3"))
val newDf3 = newDf2.withColumn("date", to_date(regexp_replace(newDf2("x4New2"), "(?:(\\b\\d{2}))/(?:(\\d{1}))/(?:(\\d{4})\\b)", "$1/0$2/$3"),"MM/dd/yyyy"))
val formatedDataDf = newDf3
.drop("x4New")
.drop("x4New1")
.drop("x4New2")
formatedDataDf.printSchema
formatedDataDf.show
Output looks like as follows
root
|-- Id: integer (nullable = false)
|-- x4: string (nullable = true)
|-- date: date (nullable = true)
+---+----------+----------+
| Id| x4| date|
+---+----------+----------+
| 1| 9/11/2020|2020-09-11|
| 2|10/11/2020|2020-10-11|
| 3| 1/1/2020|2020-01-01|
| 4| 12/7/2020|2020-12-07|
+---+----------+----------+

Nulls in spark.createDataset

I'm writing a unit test that the test data need to have some null values.
I tried putting the nulls straight in the tuples and I also tried using Options. It didn't work out.
Here is my code:
import sparkSession.implicits._
// Data set with null for even values
val sampleData = sparkSession.createDataset(Seq(
(1, Some("Yes"), None),
(2, None, None),
(3, Some("Okay"), None),
(4, None, None)))
.toDF("id", "title", "value")
Stack trace:
None.type (of class scala.reflect.internal.Types$UniqueSingleType)
scala.MatchError: None.type (of class scala.reflect.internal.Types$UniqueSingleType)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:472)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)

You should use None: Option[String] instead of None
scala> val maybeString = None: Option[String]
maybeString: Option[String] = None
scala> val sampleData = spark.createDataset(Seq(
| (1, Some("Yes"), maybeString),
| (2, maybeString, maybeString),
| (3, Some("Okay"), maybeString),
| (4, maybeString, maybeString))).toDF("id", "title", "value")
sampleData: org.apache.spark.sql.DataFrame = [id: int, title: string ... 1 more field]
scala> sampleData.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Or you can use: null.asInstanceOf[String] If you're just dealing with Strings
val df1 = sc.parallelize(Seq((1, "Yes", null.asInstanceOf[String]),
| (2, null.asInstanceOf[String], null.asInstanceOf[String]),
| (3, "Okay", null.asInstanceOf[String]),
| (4, null.asInstanceOf[String], null.asInstanceOf[String]))).toDF("id", "title", "value")
df1: org.apache.spark.sql.DataFrame = [id: int, title: string, value: string]
scala> df1.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.

Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+

Spark - Csv data split with scala

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3
I want to change this data like this (as dataset or rdd)
whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB
I loaded data with read method.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session.
Any idea of this?
Those are not working.
val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })
val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))

This can be accomplished with explode and a udf:
scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]
scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
| A| 1| 2|
| B| 1| 3|
| C| 4| 3|
+----+----+----+
scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))
scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]
scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]
scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]
scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
| A| 1| Key1|
| A| 2| Key2|
| B| 1| Key1|
| B| 3| Key2|
| C| 4| Key1|
| C| 3| Key2|
+----+---+------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Create DataFrame with null value for few column - scala

Related

Scala Spark: Flatten Array of Key/Value structs

In Apache Spark, I have a dataframe with one column which has string (its a date) but leading zero is missing from month and day

Nulls in spark.createDataset

Select column by name with multiple aggregate columns after pivot with Spark Scala

Spark - Csv data split with scala

Categories

Resources