Renaming column names of a DataFrame in Spark Scala

Renaming column names of a DataFrame in Spark Scala - scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}

If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.

Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)

tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!

Related

Convert a JSON string to a struct column without schema in Spark

Spark: 3.0.0
Scala: 2.12.8
My data frame has a column with JSON string, and I want to create a new column from it with the StructType.
temp_json_string
{"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""}
root
|-- temp_json_string: string (nullable = true)
Formatted JSON:
{
"name":"test",
"id":"12",
"category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""
}
I want to create a new column of type Struct so I tried:
dataFrame
.withColumn("temp_json_struct", struct(col("temp_json_string")))
.select("temp_json_struct")
Now, I get the schema as:
root
|-- temp_json_struct: struct (nullable = false)
| |-- temp_json_string: string (nullable = true)
Desired result:
root
|-- temp_json_struct: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- category: array (nullable = true)
| | |-- products: array (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- displayLabel: string (nullable = true)
| |-- createdAt: timestamp (nullable = true)
| |-- updatedAt: timestamp (nullable = true)

json_str_col is the column that has JSON string. I had multiple files so that's why the fist line is iterating through each row to extract the schema. If you know your schema up front then just replace json_schema with that.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_str_col)).schema
df = df.withColumn('new_col', from_json(col('json_str_col'), json_schema))

// import spark implicits for conversion to dataset (.as[String])
import spark.implicits._
val df = ??? //create your dataframe having the 'temp_json_string' column
//convert Dataset[Row] aka Dataframe to Dataset[String]
val ds = df.select("temp_json_string").as[String]
//read as json
spark.read.json(ds)
Documentation

There at least two different ways to retrieve/discover the schema for a given JSON.
For the illustration, let's create some data first:
import org.apache.spark.sql.types.StructType
val jsData = Seq(
("""{
"name":"test","id":"12","category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""}""")
)
Option 1: schema_of_json
The first option is to use the built-in function schema_of_json. The function will return the schema for the given JSON in DDL format:
val json = jsData.toDF("js").collect()(0).getString(0)
val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')")
.collect()(0) //get 1st row
.getString(0) //get 1st col of the row as string
.replace("null", "string") //replace type with string, this occurs since you have "createdAt":""
// struct<category:array<struct<displayLabel:string,displayName:string,products:array<string>>>,createdAt:null,createdBy:null,id:string,name:string>
val schema: StructType = StructType.fromDDL(s"js_schema $ddlSchema")
Note that you would expect that schema_of_json would also work on the column level i.e: schema_of_json(js_col), unfortunately, this doesn't work as expected therefore we are forced to pass a string instead.
Option 2: use Spark JSON reader (recommended)
import org.apache.spark.sql.functions.from_json
val schema: StructType = spark.read.json(jsData.toDS).schema
// schema.printTreeString
// root
// |-- category: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- displayLabel: string (nullable = true)
// | | |-- displayName: string (nullable = true)
// | | |-- products: array (nullable = true)
// | | | |-- element: string (containsNull = true)
// |-- createdAt: string (nullable = true)
// |-- createdBy: string (nullable = true)
// |-- id: string (nullable = true)
// |-- name: string (nullable = true)
As you can see, here we are producing a schema based on StructType and not a DDL string as in the previous case.
After discovering the schema we can move on to the next step which is converting the JSON data into a struct. To achieve that we will use from_json built-in function:
jsData.toDF("js")
.withColumn("temp_json_struct", from_json($"js", schema))
.printSchema()
// root
// |-- js: string (nullable = true)
// |-- temp_json_struct: struct (nullable = true)
// | |-- category: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- displayLabel: string (nullable = true)
// | | | |-- displayName: string (nullable = true)
// | | | |-- products: array (nullable = true)
// | | | | |-- element: string (containsNull = true)
// | |-- createdAt: string (nullable = true)
// | |-- createdBy: string (nullable = true)
// | |-- id: string (nullable = true)
// | |-- name: string (nullable = true)

Rename nested struct columns in a Spark DataFrame [duplicate]

This question already has answers here:
Rename nested field in spark dataframe
(5 answers)
Closed 3 years ago.
I am trying to change the names of a DataFrame columns in scala. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- _VkjLmnVop: string (nullable = true)
|-- _KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- _MnoPqrstUv: string (nullable = true)
| | |-- _ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkj_lmn_vop: string (nullable = true)
|-- ka_tas_lop: string (nullable = true)
|-- abc_def: struct (nullable = true)
| |-- uvw_xyz: struct (nullable = true)
| | |-- mno_pqrst_uv: string (nullable = true)
| | |-- man_devy_ixyz: string (nullable = true)
For Non Struct columns I'm changing column names by below
def aliasAllColumns(df: DataFrame): DataFrame = {
df.select(df.columns.map { c =>
df.col(c)
.as(
c.replaceAll("_", "")
.replaceAll("([A-Z])", "_$1")
.toLowerCase
.replaceFirst("_", ""))
}: _*)
}
aliasAllColumns(file_data_df).show(1)
How I can change Struct column names dynamically?

You can create a recursive method to traverse the DataFrame schema for renaming the columns:
import org.apache.spark.sql.types._
def renameAllCols(schema: StructType, rename: String => String): StructType = {
def recurRename(schema: StructType): Seq[StructField] = schema.fields.map{
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(rename(name), StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype: ArrayType, nullable, meta) if dtype.elementType.isInstanceOf[StructType] =>
StructField(rename(name), ArrayType(StructType(recurRename(dtype.elementType.asInstanceOf[StructType])), true), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(rename(name), dtype, nullable, meta)
}
StructType(recurRename(schema))
}
Testing it with the following example:
import org.apache.spark.sql.functions._
import spark.implicits._
val renameFcn = (s: String) =>
s.replace("_", "").replaceAll("([A-Z])", "_$1").toLowerCase.dropWhile(_ == '_')
case class C(A_Bc: Int, D_Ef: Int)
val df = Seq(
(10, "a", C(1, 2), Seq(C(11, 12), C(13, 14)), Seq(101, 102)),
(20, "b", C(3, 4), Seq(C(15, 16)), Seq(103))
).toDF("_VkjLmnVop", "_KaTasLop", "AbcDef", "ArrStruct", "ArrInt")
val newDF = spark.createDataFrame(df.rdd, renameAllCols(df.schema, renameFcn))
newDF.printSchema
// root
// |-- vkj_lmn_vop: integer (nullable = false)
// |-- ka_tas_lop: string (nullable = true)
// |-- abc_def: struct (nullable = true)
// | |-- a_bc: integer (nullable = false)
// | |-- d_ef: integer (nullable = false)
// |-- arr_struct: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a_bc: integer (nullable = false)
// | | |-- d_ef: integer (nullable = false)
// |-- arr_int: array (nullable = true)
// | |-- element: integer (containsNull = false)

as far as I know, it's not possible to rename nested fields directly.
From one side, you could try moving to a flat object.
However, if you need to keep the structure, you can play with spark.sql.functions.struct(*cols).
Creates a new struct column.
Parameters: cols – list of column names (string) or list of Column expressions
You will need to decompose all the schema, generate the aliases that you need and then compose it again using the struct function.
It's not the best solution. But it's something :)
Pd: I'm attaching the PySpark doc since it contains a better explanation than the Scala one.

Spark SQL nested withColumn

I have a DataFrame that has multiple columns of which some of them are structs. Something like this
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
I want to apply a UserDefinedFunction on the column baz to replace baz with a function of baz, but I cannot figure out how to do that. Here is an example of the desired output (note that baz is now an int)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
It looks like DataFrame.withColumn only works on top level columns but not on nested columns. I'm using Scala for this problem.
Can someone help me out with this?
Thanks

that's easy, just use a dot to select nested structures, e.g. $"foo.baz" :
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
If you want to add the result of your UDF to the existing struct foo, i.e. to get:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
there are two options:
with withColumn:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
with select:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
EDIT:
Replacing the existing attribute in the struct with the result from the UDF:
unfortunately, this does not work:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
but can be done like this:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)

You can do this using the struct function as Raphael Roth has already been demonstrated in their answer above. There is an easier way to do this though using the Make Structs Easy* library. The library adds a withField method to the Column class allowing you to add/replace Columns inside a StructType column, in much the same way as the withColumn method on the DataFrame class allows you to add/replace columns inside a DataFrame. For your specific use-case, you could do something like this:
import org.apache.spark.sql.functions._
import com.github.fqaiser94.mse.methods._
// generate some fake data
case class Foo(bar: String, baz: String)
case class Record(foo: Foo, arrayOfFoo: Seq[Foo])
val df = Seq(
Record(Foo("Hello", "World"), Seq(Foo("Blue", "Red"), Foo("Green", "Yellow")))
).toDF
df.printSchema
// root
// |-- foo: struct (nullable = true)
// | |-- bar: string (nullable = true)
// | |-- baz: string (nullable = true)
// |-- arrayOfFoo: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- bar: string (nullable = true)
// | | |-- baz: string (nullable = true)
df.show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
// example user defined function that capitalizes a given string
val myUdf = udf((s: String) => s.toUpperCase)
// capitalize value of foo.baz
df.withColumn("foo", $"foo".withField("baz", myUdf($"foo.baz"))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, WORLD]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
I noticed you had a follow-up question about replacing a Column nested inside a struct nested inside of an array.
This can also be done by combining the functions provided by the Make Structs Easy library with the functions provided by spark-hofs library, as follows:
import za.co.absa.spark.hofs._
// capitalize the value of foo.baz in each element of arrayOfFoo
df.withColumn("arrayOfFoo", transform($"arrayOfFoo", foo => foo.withField("baz", myUdf(foo.getField("baz"))))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, RED], [Green, YELLOW]]|
// +--------------+------------------------------+
*Full disclosure: I am the author of the Make Structs Easy library that is referenced in this answer.

Nested JSON in Spark

I have the following JSON loaded as a DataFrame:
root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)
I want to get the following RDD from this DataFrame:
RDD[(more1, more2, more3, field1, field2)]
How can I do this? I think I have to use flatMap for the nested JSON?

A combination of explode and dot syntax should do the trick:
import org.apache.spark.sql.functions.explode
case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)
val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")
df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)
val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")
val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd
aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])
Depending on your requirements you can follow this with map to extract values from the rows:
import org.apache.spark.sql.Row
aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}
See also Querying Spark SQL DataFrame with complex types

How can I create a Spark DataFrame from a nested array of struct element?

I have read a JSON file into Spark. This file has the following structure:
scala> tweetBlob.printSchema
root
|-- related: struct (nullable = true)
| |-- next: struct (nullable = true)
| | |-- href: string (nullable = true)
|-- search: struct (nullable = true)
| |-- current: long (nullable = true)
| |-- results: long (nullable = true)
|-- tweets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cde: struct (nullable = true)
...
...
| | |-- cdeInternal: struct (nullable = true)
...
...
| | |-- message: struct (nullable = true)
...
...
What I would ideally want is a DataFrame with columns "cde", "cdeInternal", "message"... as shown below
root
|-- cde: struct (nullable = true)
...
...
|-- cdeInternal: struct (nullable = true)
...
...
|-- message: struct (nullable = true)
...
...
I have managed to use "explode" to extract elements from the "tweets" array into a column called "tweets"
scala> val tweets = tweetBlob.select(explode($"tweets").as("tweets"))
tweets: org.apache.spark.sql.DataFrame = [tweets: struct<cde:struct<author:struct<gender:string,location:struct<city:string,country:string,state:string>,maritalStatus:struct<evidence:string,isMarried:string>,parenthood:struct<evidence:string,isParent:string>>,content:struct<sentiment:struct<evidence:array<struct<polarity:string,sentimentTerm:string>>,polarity:string>>>,cdeInternal:struct<compliance:struct<isActive:boolean,userProtected:boolean>,tracks:array<struct<id:string>>>,message:struct<actor:struct<displayName:string,favoritesCount:bigint,followersCount:bigint,friendsCount:bigint,id:string,image:string,languages:array<string>,link:string,links:array<struct<href:string,rel:string>>,listedCount:bigint,location:struct<displayName:string,objectType:string>,objectType:string,postedTime...
scala> tweets.printSchema
root
|-- tweets: struct (nullable = true)
| |-- cde: struct (nullable = true)
...
...
| |-- cdeInternal: struct (nullable = true)
...
...
| |-- message: struct (nullable = true)
...
...
How can I select all columns inside the struct and create a DataFrame out of it? Explode does not work on a struct if my understanding is correct.
Any help is appreciated.

One possible way to handle this is to extract required information from the schema. Lets start with some dummy data:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
case class Bar(x: Int, y: String)
case class Foo(bar: Bar)
val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df.printSchema
// root
// |-- bar: struct (nullable = true)
// | |-- x: integer (nullable = false)
// | |-- y: string (nullable = true)
and a helper function:
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}"))
}
Finally the results:
df.select(children("bar", df): _*).printSchema
// root
// |-- x: integer (nullable = true)
// |-- y: string (nullable = true)

You can use
df
.select(explode(col("path_to_collection")).as("collection"))
.select(col("collection.*"))`:
Example:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> val inline = sqlContext.read.json(sc.parallelize(json :: Nil)).select(explode(col("schools")).as("collection")).select(col("collection.*"))
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
Or, you can also use SQL function inline:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> sqlContext.read.json(sc.parallelize(json :: Nil)).registerTempTable("tmp")
scala> val inline = sqlContext.sql("SELECT inline(schools) FROM tmp")
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+

scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> case class Bar(x: Int, y: String)
defined class Bar
scala> case class Foo(bar: Bar)
defined class Foo
scala> val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df: org.apache.spark.sql.DataFrame = [bar: struct<x: int, y: string>]
scala> df.printSchema
root
|-- bar: struct (nullable = true)
| |-- x: integer (nullable = false)
| |-- y: string (nullable = true)
scala> df.select("bar.*").printSchema
root
|-- x: integer (nullable = true)
|-- y: string (nullable = true)
scala>

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Renaming column names of a DataFrame in Spark Scala - scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i), df.columns(i).toLowerCase ); }

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2 val list = List("id2", "name2", "price2") import spark.implicits._ val df2 = df.toDF(list:_*) df2.columns.foreach(println) I found this approach useful in many cases.

Related

Convert a JSON string to a struct column without schema in Spark

Rename nested struct columns in a Spark DataFrame [duplicate]

Spark SQL nested withColumn

Nested JSON in Spark

How can I create a Spark DataFrame from a nested array of struct element?

Categories

Resources