update a dataframe struct column value in spark scala - scala

I have the following data frame with schema as follows:
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = true)
|-- colStruct: struct (nullable = true)
| |-- subCol1: integer (nullable = true)
| |-- subCol2: string (nullable = true)
|-- subCol3: integer (nullable = true)
How to update the subCol1 and subCol3 column values using a UDF?

Access the nested columns with .(dot) notation.
Here's an example:
Data
case class Details(height: Integer, weight: Integer, sex: String) // height in cms, weight in lbs
case class Person(name: String, age: Integer, details: Details)
println("The following is our dataset")
val data = Seq(
Person("Darth Vader", 80, Details(180, 200, "male")),
Person("Luke Skywalker", 25, Details(185, 180, "male")),
Person("Obi-Wan Kenobe", 50, Details(175, 175, "male")),
Person("Princess Leia", 23, Details(165, 150, "female")),
).toDF.cache()
data.show(5, false)
println("The schema of our data is:")
data.printSchema()
/*
The following is our dataset
+--------------+---+------------------+
|name |age|details |
+--------------+---+------------------+
|Darth Vader |80 |{180, 200, male} |
|Luke Skywalker|25 |{185, 180, male} |
|Obi-Wan Kenobe|50 |{175, 175, male} |
|Princess Leia |23 |{165, 150, female}|
+--------------+---+------------------+
The schema of our data is:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- details: struct (nullable = true)
| |-- height: integer (nullable = true)
| |-- weight: integer (nullable = true)
| |-- sex: string (nullable = true)
*/
Update Nested Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// list out the columns you want to update using .(dot) notation
val allNestedColumnNamesToUpdate = Seq("details.height", "details.weight")
// list out all nested columns
val allNestedColumnNames = Seq("height", "weight", "sex")
// create your UDFs. Here we have created one for each integer nested column
val updateHeight = (value: Int) => { if (value < 180) 190 else 170 }
val updateWeight = (value: Int) => { if (value < 180) 190 else 170 }
// register UDFs
val updateHeightUDF = spark.udf.register("updateHeightUDF", updateHeight)
val updateWeightUDF = spark.udf.register("updateWeightUDF", updateWeight)
// Map the name of the nested column to update to it's UDF
val columnNameToUpdateToUDFMap = Map (
"details.height" -> updateHeightUDF,
"details.weight" -> updateWeightUDF
)
val updatedDF = allNestedColumnNamesToUpdate.foldLeft(data)((acc, columnNameToUpdate) => {
val udf = columnNameToUDFMap(columnNameToUpdate)
val updatedStructColumns = allNestedColumnNames.map(x => {
if(x == columnNameToUpdate) lit(udf(col(columnNameToUpdate))).as(columnNameToUpdate)
else col(s"details.$x")
})
df.withColumn("details", struct(updatedStructColumns: _*))
})
updatedDF.show()
/*
+--------------+---+------------------+
| name|age| details|
+--------------+---+------------------+
| Darth Vader| 80| {170, 170, male}|
|Luke Skywalker| 25| {170, 170, male}|
|Obi-Wan Kenobe| 50| {190, 190, male}|
| Princess Leia| 23|{190, 190, female}|
+--------------+---+------------------+
*/
Note: The use of UDF's is not recommended as they are a not visible to Spark's optimizer.

Related

How to save output of multiple queries under single JSON file in appended mode using spark scala

I have 5 queries like below:
select * from table1
select * from table2
select * from table3
select * from table4
select * from table5
Now, what I want is I have to execute these queries in the sequential fashion and then keep on storing the output in the single JSON file in the appended mode. I wrote the below code but it stores the output for each query in different part files instead of one.
Below is my code:
def store(jobEntity: JobDetails, jobRunId: Int): Unit = {
UDFUtil.registerUdfFunctions()
var outputTableName: String = null
val jobQueryMap = jobEntity.jobQueryList.map(jobQuery => (jobQuery.sequenceId, jobQuery))
val sortedQueries = scala.collection.immutable.TreeMap(jobQueryMap.toSeq: _*).toMap
LOGGER.debug("sortedQueries ===>" + sortedQueries)
try {
outputTableName = jobEntity.destinationEntity
var resultDF: DataFrame = null
sortedQueries.values.foreach(jobQuery => {
LOGGER.debug(s"jobQuery.query ===> ${jobQuery.query}")
resultDF = SparkSession.builder.getOrCreate.sqlContext.sql(jobQuery.query)
if (jobQuery.partitionColumn != null && !jobQuery.partitionColumn.trim.isEmpty) {
resultDF = resultDF.repartition(jobQuery.partitionColumn.split(",").map(col): _*)
}
if (jobQuery.isKeepInMemory) {
resultDF = resultDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
if (jobQuery.isCheckpointEnabled) {
val checkpointDir = ApplicationConfig.getAppConfig(JobConstants.CHECKPOINT_DIR)
val fs = FileSystem.get(new Storage(JsonUtil.toMap[String](jobEntity.sourceConnection)).asHadoopConfig())
val path = new Path(checkpointDir)
if (!fs.exists(path)) {
fs.mkdirs(path)
}
resultDF.explain(true)
SparkSession.builder.getOrCreate.sparkContext.setCheckpointDir(checkpointDir)
resultDF = resultDF.checkpoint
}
resultDF = {
if (jobQuery.isBroadCast) {
import org.apache.spark.sql.functions.broadcast
broadcast(resultDF)
} else
resultDF
}
tempViewsList.+=(jobQuery.queryAliasName)
resultDF.createOrReplaceTempView(jobQuery.queryAliasName)
// resultDF.explain(true)
val map: Map[String, String] = JsonUtil.toMap[String](jobEntity.sinkConnection)
LOGGER.debug("sink details :: " + map)
if (resultDF != null && !resultDF.take(1).isEmpty) {
resultDF.show(false)
val sinkDetails = new Storage(JsonUtil.toMap[String](jobEntity.sinkConnection))
val path = sinkDetails.basePath + File.separator + jobEntity.destinationEntity
println("path::: " + path)
resultDF.repartition(1).write.mode(SaveMode.Append).json(path)
}
}
)
Just ignore the other things(Checkpointing, Logging, Auditing) that I am doing in this method along with reading and writing.
Use the below example as a reference for your problem.
I have three tables with Json data (with different schema) as below:
table1 --> Personal Data Table
table2 --> Company Data Table
table3 --> Salary Data Table
I am reading these three tables one by one in the sequential mode as per your requirement and doing few transformations over data (exploding Json array Column) with the help of List TableColList which contains Array column Name corresponding to table with a semicolon (":") separator.
OutDFList is the list of all transformed DataFrames.
At the end, I am reducing all DataFrames from OutDFList into a single dataframe and writing it into one JSON file.
Note: I have used join to reduced all DataFrames, You can also use
union(if have same columns) or else as per requirement.
Check below code:
scala> spark.sql("select * from table1").printSchema
root
|-- Personal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- DOB: string (nullable = true)
| | |-- EmpID: string (nullable = true)
| | |-- Name: string (nullable = true)
scala> spark.sql("select * from table2").printSchema
root
|-- Company: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- EmpID: string (nullable = true)
| | |-- JoinDate: string (nullable = true)
| | |-- Project: string (nullable = true)
scala> spark.sql("select * from table3").printSchema
root
|-- Salary: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- EmpID: string (nullable = true)
| | |-- Monthly: string (nullable = true)
| | |-- Yearly: string (nullable = true)
scala> val TableColList = List("table1:Personal", "table2:Company", "table3:Salary")
TableColList: List[String] = List(table1:Personal, table2:Company, table3:Salary)
scala> val OutDFList = TableColList.map{ X =>
| val table = X.split(":")(0)
| val arrayColumn = X.split(":")(1)
| val df = spark.sql(s"""SELECT * FROM """ + table).select(explode(col(arrayColumn)) as "data").select("data.*")
| df}
OutDFList: List[org.apache.spark.sql.DataFrame] = List([DOB: string, EmpID: string ... 1 more field], [EmpID: string, JoinDate: string ... 1 more field], [EmpID: string, Monthly: string ... 1 more field])
scala> val FinalOutDF = OutDFList.reduce((df1, df2) => df1.join(df2, "EmpID"))
FinalOutDF: org.apache.spark.sql.DataFrame = [EmpID: string, DOB: string ... 5 more fields]
scala> FinalOutDF.printSchema
root
|-- EmpID: string (nullable = true)
|-- DOB: string (nullable = true)
|-- Name: string (nullable = true)
|-- JoinDate: string (nullable = true)
|-- Project: string (nullable = true)
|-- Monthly: string (nullable = true)
|-- Yearly: string (nullable = true)
scala> FinalOutDF.write.json("/FinalJsonOut")
First thing first, you need to union all the schemas:
import org.apache.spark.sql.functions._
val df1 = sc.parallelize(List(
(42, 11),
(43, 21)
)).toDF("foo", "bar")
val df2 = sc.parallelize(List(
(44, true, 1.0),
(45, false, 3.0)
)).toDF("foo", "foo0", "foo1")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
val total = df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*))
total.show()
And obvs save to the single JSON file:
df.coalesce(1).write.mode('append').json("/some/path")
UPD
If you are not using DFs, just come along with plain SQL queries (writing to single file remains the same - coalesce(1) or repartition(1)):
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT first.id, first.name, FROM first
| UNION
| SELECT second.id, second.name FROM second
| ORDER BY second.name
| ) t
""".stripMargin).show()

Change to empty array if another column is false

I am trying to create a dataframe that returns an empty array for a nested struct type if another column is false. I created a dummy dataframe to illustrate my problem.
import spark.implicits._
val newDf = spark.createDataFrame(Seq(
("user1","true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None))).toDF("userId","flag", "amount", "currency", "transactionId")
val amountStruct = struct("amount"
,"currency").alias("amount")
val transactionStruct = struct("transactionId"
, "amount").alias("transactions")
val dataStruct = struct("flag","transactions").alias("data")
val finalDf = newDf.
withColumn("amount", amountStruct).
withColumn("transactions", transactionStruct).
select("userId", "flag","transactions").
groupBy("userId", "flag").
agg(collect_list("transactions").alias("transactions")).
withColumn("data", dataStruct).
drop("transactions","flag")
This is the output:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, [[, [,]]]]|
| user1|[true, [[tx1, [8,...|
+------+--------------------+
and schema:
root
|-- userId: string (nullable = true)
|-- data: struct (nullable = false)
| |-- flag: string (nullable = true)
| |-- transactions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- transactionId: string (nullable = true)
| | | |-- amount: struct (nullable = false)
| | | | |-- amount: integer (nullable = true)
| | | | |-- currency: string (nullable = true)
The output I want:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, []] |
| user1|[true, [[tx1, [8,...|
+------+--------------------+
I've tried doing this before doing collect_list but no luck.
import org.apache.spark.sql.functions.typedLit
val emptyArray = typedLit(Array.empty[(String, Array[(Int, String)])])
testDf.withColumn("transactions", when($"flag" === "false", emptyArray).otherwise($"transactions")).show()
You were moments from victory. The approach with collect_list is the way to go, it just needs a little nudge.
TL;DR Solution
val newDf = spark
.createDataFrame(
Seq(
("user1", "true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None)
)
)
.toDF("userId", "flag", "amount", "currency", "transactionId")
val dataStruct = struct("flag", "transactions")
val finalDf2 = newDf
.groupBy("userId", "flag")
.agg(
collect_list(
when(
$"transactionId".isNotNull && $"amount".isNotNull && $"currency".isNotNull,
struct(
$"transactionId",
struct($"amount", $"currency").alias("amount")
)
)).alias("transactions"))
.withColumn("data", dataStruct)
.drop("transactions", "flag")
Explanation
SQL Aggregate Function Behavior
First of all, when it comes to behavior Spark follows SQL conventions. All the SQL aggregation functions (and collect_list is an aggregate function) ignore NULL on input as if it never was there.
Let's take a look at how does collect_list behave:
Seq(
("a", Some(1)),
("a", Option.empty[Int]),
("a", Some(3)),
("b", Some(10)),
("b", Some(20)),
("b", Option.empty[Int])
)
.toDF("col1", "col2")
.groupBy($"col1")
.agg(collect_list($"col2") as "col2_list")
.show()
And the result is:
+----+---------+
|col1|col2_list|
+----+---------+
| b| [10, 20]|
| a| [1, 3]|
+----+---------+
Tracking Down Nullability
It looks like collect_list behaves properly. So the reason you are seeing those blanks in your output is that the column that gets passed to the collect_list is not nullable.
To prove it let's examine the schema of the DataFrame just before it gets aggregated:
newDf
.withColumn("amount", amountStruct)
.withColumn("transactions", transactionStruct)
.printSchema()
root
|-- userId: string (nullable = true)
|-- flag: string (nullable = true)
|-- amount: struct (nullable = false)
| |-- amount: integer (nullable = true)
| |-- currency: string (nullable = true)
|-- currency: string (nullable = true)
|-- transactionId: string (nullable = true)
|-- transactions: struct (nullable = false)
| |-- transactionId: string (nullable = true)
| |-- amount: struct (nullable = false)
| | |-- amount: integer (nullable = true)
| | |-- currency: string (nullable = true)
Note the transactions: struct (nullable = false) part. It proves the suspicion.
If we translate all the nested NULLables to Scala here's what you got:
case class Row(
transactions: Transactions,
// other fields
)
case class Transactions(
transactionId: Option[String],
amount: Option[Amount],
)
case class Amount(
amount: Option[Int],
currency: Option[String]
)
And here's what you want instead:
case class Row(
transactions: Option[Transactions], // this is optional now
// other fields
)
case class Transactions(
transactionId: String, // while this is not optional
amount: Amount, // neither is this
)
case class Amount(
amount: Int, // neither is this
currency: String // neither is this
)
Fixing the Nullability
Now the last step is simple. To make the column that is the input to collect_list "properly" nullable you have to check the nullability of all the amount, currency and transactionId columns.
The result will be NOT NULL if and only if all the input columns are NOT NULL.
You can use the same when API method to construct the result. The otherwise clause if omitted implicitly returns NULL which is exactly what you need.

Extracting values from Scala WrappedArray

I'm working with Apache Spark's ALS model, and the recommendForAllUsers method returns a dataframe with the schema
root
|-- user_id: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = true)
| | |-- rating: float (nullable = true)
In practice, the recommendations are a WrappedArray like:
WrappedArray([636958,0.32910484], [995322,0.31974298], [1102140,0.30444127], [1160820,0.27908015], [1208899,0.26943958])
I'm trying to extract just the item_ids and return them as a 1D array. So the above example would be [636958,995322,1102140,1160820,1208899]
This is what's giving me trouble. So far I have:
val numberOfRecs = 20
val userRecs = model.recommendForAllUsers(numberOfRecs).cache()
val strippedScores = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[Seq[Row]](1)
val item_ids = new Array[Int](numberOfRecs)
recs.toArray.foreach(x => {
item_ids :+ x.get(0)
})
item_ids
})
But this just returns [I#2f318251, and if I get the string value of it via mkString(","), it returns 0,0,0,0,0,0
Any thoughts on how I can extract the item_ids and return them as a separate, 1D array?
Found in the Spark ALSModel docs that recommendForAllUsers returns
"a DataFrame of (userCol: Int, recommendations), where recommendations
are stored as an array of (itemCol: Int, rating: Float) Rows"
(https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.recommendation.ALSModel)
By array, it means WrappedArray, so instead of trying to to cast it to Seq[Row], I cast it to mutable.WrappedArray[Row]. I was then able to get each item_id like:
val userRecItems = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[mutable.WrappedArray[Row]](1)
for (rec <- recs) {
val item_id = rec.getInt(0)
userRecommendatinos += game_id
}
})
where userRecommendations was a mutable ArrayBuffer
You can use a fully qualified name to access a structure element in the array:
scala> case class Recommendation(item_id: Int, rating: Float)
defined class Recommendation
scala> val userReqs = Seq(Array(Recommendation(636958,0.32910484f), Recommendation(995322,0.31974298f), Recommendation(1102140,0.30444127f), Recommendation(1160820,0.27908015f), Recommendation(1208899,0.26943958f))).toDF
userReqs: org.apache.spark.sql.DataFrame = [value: array<struct<item_id:int,rating:float>>]
scala> userReqs.printSchema
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = false)
| | |-- rating: float (nullable = false)
scala> userReqs.select("value.item_id").show(false)
+-------------------------------------------+
|item_id |
+-------------------------------------------+
|[636958, 995322, 1102140, 1160820, 1208899]|
+-------------------------------------------+
scala> val ids = userReqs.select("value.item_id").collect().flatMap(_.getAs[Seq[Int]](0))
ids: Array[Int] = Array(636958, 995322, 1102140, 1160820, 1208899)

Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below.
root
|-- id: string (nullable = true)
|-- desc: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6:
An example would be:
Key . Value
1010 . [[George,21],null,[MARIE,13],null]
1023 . [null,[Watson,11],[John,35],null,[Kyle,33]]
I want the final dataframe as:
Key . Value
1010 . [[George,21],[MARIE,13]]
1023 . [[Watson,11],[John,35],[Kyle,33]]
I tried doing this with UDF and case class but got
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to....
Any help is greatly appreciated and I would prefer doing it without converting to RDDs if needed. Also I am new to spark and scala so thanks in advance!!!
Here is another version:
case class Person(name: String, age: Int)
root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = false)
+----+-----------------------------------------------+
|id |desc |
+----+-----------------------------------------------+
|1010|[[George,21], null, [MARIE,13], null] |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|
+----+-----------------------------------------------+
val filterOutNull = udf((xs: Seq[Row]) => {
xs.flatMap {
case null => Nil
// convert the Row back to your specific struct:
case Row(s: String,i: Int) => List(Person(s, i))
}
})
val result = df.withColumn("filteredListDesc", filterOutNull($"desc"))
+----+-----------------------------------------------+-----------------------------------+
|id |desc |filteredListDesc |
+----+-----------------------------------------------+-----------------------------------+
|1010|[[George,21], null, [MARIE,13], null] |[[George,21], [MARIE,13]] |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------------------+-----------------------------------+
Given that the original dataframe has following schema
root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
Defining a udf function to remove the null values from the array should work for you
import org.apache.spark.sql.functions._
def removeNull = udf((array: Seq[Row])=> array.filterNot(_ == null).map(x => element(x.getAs[String]("name"), x.getAs[Long]("age"))))
df.withColumn("desc", removeNull(col("desc")))
where element is a case class
case class element(name: String, age: Long)
and you should get
+----+-----------------------------------+
|id |desc |
+----+-----------------------------------+
|1010|[[George,21], [MARIE,13]] |
|1010|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------+

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}
If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.
Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.
Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)
tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!