Programatically create aliases for groupBy-Max function in scala spark - scala

var exprs = dfx.columns.map(max(_))
var df2 = df1.groupBy("x","y","z").agg(exprs.head, exprs.tail: _*)
df2.printSchema()
The output of this creates a dataframe
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- z: double (nullable = true)
|-- max(a): double (nullable = true)
|-- max(b): double (nullable = true)
|-- max(c): double (nullable = true)
How does one programtically remove the max() and rename the columns with as "a" instead of max(a)

Replace
var exprs = dfx.columns.map(max(_))
with (and yeah, don't use var when val works fine):
val exprs = dfx.columns.map(c => max(c).alias(c))

Related

Spark - Scala Remove special character from the beginning and end from columns in a dataframe

I have a dataframe like this,
scala> df.printSchema
root
|-- Protocol ID: decimal(12,0) (nullable = true)
|-- Protocol #: string (nullable = true)
|-- Eudract #: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- # Non-US Count: decimal(7,0) (nullable = true)
|-- # US Count: decimal(7,0) (nullable = true)
here the data columns have space and special characters in it. I wanted to replace that with underscore like this,
scala> newdf.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol: string (nullable = true)
|-- Eudract: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- Non-US_Count: decimal(7,0) (nullable = true)
|-- US_Count: decimal(7,0) (nullable = true)
So I used the below steps,
val df=spark.read.format("parquet").load("<s3 path>")
val regex_string="""[+._(),!#$%&"*./:;<-> ]+"""
val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(df.columns).foldLeft(df){
(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)
}
resultDF.printSchema
But Iam getting the df like this.
scala> resultDF.printSchema
root
|-- Protocol_ID: decimal(12,0) (nullable = true)
|-- Protocol_: string (nullable = true)
|-- Eudract_: string (nullable = true)
|-- STDY_MIGRATED_INDC: string (nullable = true)
|-- _Non-US_Count: decimal(7,0) (nullable = true)
|-- _US_Count: decimal(7,0) (nullable = true)
If the space or special character is in the beginning or end then I dont want the underscore.
In python I can use,
starts_with = [i.replace("_","",1) if i.startswith("_") else i for i in df.columns]
[(i[::-1].replace("_","",1)[::-1]) if i.endswith("_") else i for i in starts_with]
As I am new to scala I am not sure how to fix this. Any help would be appreciated.
You can use (^_|_$) regex to replace beginning or ending _ with empty string.
val regex_string = """[+._(),!#$%&"*./:;<-> ]+"""
val col = regex_string.r.replaceAllIn("#Non-US Count##", "_")
println(col)
println("(^_|_$)".r.replaceAllIn(col, ""))
// _Non-US_Count_
// Non-US_Count
val replacingColumns = df.columns.map(s=>"(^_|_$)".r.replaceAllIn(regex_string.r.replaceAllIn(s, "_"),""))

How to make matching schema for two data frame in join without hard coding for every columns

I have two data frame on which I perform join and some time i get below error
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`IsAnnualReported_1` IS NOT NULL) THEN `IsAnnualReported_1` ELSE CAST(`IsAnnualReported` AS BOOLEAN) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
Now to overcome this i have to manually cast into to matching data types like below for all machinating data type columns .
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
This is how i perform join on two data frames .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/MAIN")
val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1resultFinal=df1result.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
val df2 = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/INCR")
val df2With_ = df2.toDF(df2.columns.map(_.replace(".", "_")): _*)
val df2column_to_keep = df2With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df2result = df2With_.select(df2column_to_keep.head, df2column_to_keep.tail: _*)
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("FinancialPeriod_organizationId", "FinancialPeriod_periodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
df1resultFinalWithYear.printSchema()
latestForEachKey.printSchema()
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("FinancialPeriod_organizationId", "FinancialPeriod_periodId"), "outer")
.select($"FinancialPeriod_organizationId", $"FinancialPeriod_periodId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"PartitionYear_1".isNotNull, $"PartitionYear_1").otherwise($"PartitionYear".cast(DataTypes.StringType)).as("PartitionYear"),
when($"FinancialPeriod_periodEndDate_1".isNotNull, $"FinancialPeriod_periodEndDate_1").otherwise($"FinancialPeriod_periodEndDate").as("FinancialPeriod_periodEndDate"),
when($"FinancialPeriod_periodStartDate_1".isNotNull, $"FinancialPeriod_periodStartDate_1").otherwise($"FinancialPeriod_periodStartDate").as("FinancialPeriod_periodStartDate"),
when($"FinancialPeriod_periodDuration_1".isNotNull, $"FinancialPeriod_periodDuration_1").otherwise($"FinancialPeriod_periodDuration").as("FinancialPeriod_periodDuration"),
when($"FinancialPeriod_nonStandardPeriod_1".isNotNull, $"FinancialPeriod_nonStandardPeriod_1").otherwise($"FinancialPeriod_nonStandardPeriod").as("FinancialPeriod_nonStandardPeriod"),
when($"FinancialPeriod_periodType_1".isNotNull, $"FinancialPeriod_periodType_1").otherwise($"FinancialPeriod_periodType").as("FinancialPeriod_periodType"),
when($"PeriodFiscalYear_1".isNotNull, $"PeriodFiscalYear_1").otherwise($"PeriodFiscalYear").as("PeriodFiscalYear"),
when($"PeriodFiscalEndMonth_1".isNotNull, $"PeriodFiscalEndMonth_1").otherwise($"PeriodFiscalEndMonth").as("PeriodFiscalEndMonth"),
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
when($"IsTransitional_1".isNotNull, $"IsTransitional_1").otherwise($"IsTransitional".cast(DataTypes.StringType)).as("IsTransitional"),
when($"CumulativeType_1".isNotNull, $"CumulativeType_1").otherwise($"CumulativeType").as("CumulativeType"),
when($"CalendarizedPeriodEndDate_1".isNotNull, $"CalendarizedPeriodEndDate_1").otherwise($"CalendarizedPeriodEndDate").as("CalendarizedPeriodEndDate"),
when($"EarliestAnnouncementDateTime_1".isNotNull, $"EarliestAnnouncementDateTime_1").otherwise($"EarliestAnnouncementDateTime").as("EarliestAnnouncementDateTime"),
when($"EADUTCOffset_1".isNotNull, $"EADUTCOffset_1").otherwise($"EADUTCOffset").as("EADUTCOffset"),
when($"PeriodPermId_1".isNotNull, $"PeriodPermId_1").otherwise($"PeriodPermId").as("PeriodPermId"),
when($"PeriodPermId_objectTypeId_1".isNotNull, $"PeriodPermId_objectTypeId_1").otherwise($"PeriodPermId_objectTypeId").as("PeriodPermId_objectTypeId"),
when($"PeriodPermId_objectType_1".isNotNull, $"PeriodPermId_objectType_1").otherwise($"PeriodPermId_objectType").as("PeriodPermId_objectType"),
when($"CumulativeTypeId_1".isNotNull, $"CumulativeTypeId_1").otherwise($"CumulativeTypeId").as("CumulativeTypeId"),
when($"PeriodTypeId_1".isNotNull, $"PeriodTypeId_1").otherwise($"PeriodTypeId").as("PeriodTypeId"),
when($"PeriodFiscalEndMonthId_1".isNotNull, $"PeriodFiscalEndMonthId_1").otherwise($"PeriodFiscalEndMonthId").as("PeriodFiscalEndMonthId"),
when($"PeriodLengthUnitId_1".isNotNull, $"PeriodLengthUnitId_1").otherwise($"PeriodLengthUnitId").as("PeriodLengthUnitId"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Now what I need is that, how can I create second data frame with the schema of first data frame so i will never get any error like data type mismatch .
Here is the schema of first and second data frame
root
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod: string (nullable = true)
|-- FinancialPeriod_periodType: string (nullable = true)
|-- PeriodFiscalYear: integer (nullable = true)
|-- PeriodFiscalEndMonth: integer (nullable = true)
|-- IsAnnualReported: boolean (nullable = true)
|-- IsTransitional: boolean (nullable = true)
|-- CumulativeType: string (nullable = true)
|-- CalendarizedPeriodEndDate: string (nullable = true)
|-- EarliestAnnouncementDateTime: timestamp (nullable = true)
|-- EADUTCOffset: string (nullable = true)
|-- PeriodPermId: string (nullable = true)
|-- PeriodPermId_objectTypeId: string (nullable = true)
|-- PeriodPermId_objectType: string (nullable = true)
|-- CumulativeTypeId: integer (nullable = true)
|-- PeriodTypeId: integer (nullable = true)
|-- PeriodFiscalEndMonthId: integer (nullable = true)
|-- PeriodLengthUnitId: integer (nullable = true)
|-- FFAction: string (nullable = true)
|-- DataPartition: string (nullable = true)
|-- PartitionYear: string (nullable = true)
root
|-- DataPartition_1: string (nullable = true)
|-- PartitionYear_1: integer (nullable = true)
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration_1: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod_1: string (nullable = true)
|-- FinancialPeriod_periodType_1: string (nullable = true)
|-- PeriodFiscalYear_1: string (nullable = true)
|-- PeriodFiscalEndMonth_1: string (nullable = true)
|-- IsAnnualReported_1: string (nullable = true)
|-- IsTransitional_1: string (nullable = true)
|-- CumulativeType_1: string (nullable = true)
|-- CalendarizedPeriodEndDate_1: string (nullable = true)
|-- EarliestAnnouncementDateTime_1: string (nullable = true)
|-- EADUTCOffset_1: string (nullable = true)
|-- PeriodPermId_1: string (nullable = true)
|-- PeriodPermId_objectTypeId_1: string (nullable = true)
|-- PeriodPermId_objectType_1: string (nullable = true)
|-- CumulativeTypeId_1: string (nullable = true)
|-- PeriodTypeId_1: string (nullable = true)
|-- PeriodFiscalEndMonthId_1: string (nullable = true)
|-- PeriodLengthUnitId_1: string (nullable = true)
|-- FFAction_1: string (nullable = true)
You already have a good solution .
Here I am going to show you how you can avoid writing manually each columns for typecasting.
Lets say you have two dataframes (as you already have them) as
df1
root
|-- col1: integer (nullable = false)
|-- col2: string (nullable = true)
df2
root
|-- cl2: integer (nullable = false)
|-- cl1: integer (nullable = false)
suppose you want to change the dataTypes of df2 as that of df1. and as you said you know the mapping of each columns of both dataframes. You have to create Map of the relationship of the columns
val columnMaps = Map("col1" -> "cl1", "col2"->"cl2")
When you have the map as above, you can set the dataTypes to be casted to each columns of df2 as below
val schema1 = df1.schema
val toBeChangedDataTypes =df1.schema.map(x => if(columnMaps.keySet.contains(x.name)) (columnMaps(x.name), x.dataType) else (x.name, x.dataType)).toList
Then you can change the dataTypes of the columns of df2 to match with df1 by calling a recursive function
val finalDF = castingFunction(toBeChangedDataTypes, df2)
where castingFunction is a recursive function defined as
import org.apache.spark.sql.functions.col
def castingFunction(typeList: List[Tuple2[String, DataType]], df: DataFrame) : DataFrame = typeList match {
case x :: y => castingFunction(y, df.withColumn(x._1, col(x._1).cast(x._2)))
case Nil => df
}
You will see that finalDF will have schema as
root
|-- cl2: string (nullable = false)
|-- cl1: integer (nullable = false)
You can do the same for your dataframes.
I hope the answer is helpful

Create another dataframe from existing dataframe with different schema in spark

I have a dataframe which look like this
root
|-- A1: string (nullable = true)
|-- A2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- A3 : string (nullable = true)
|-- A4 : array (nullable = true)
| |-- element: string (containsNull = true)
I have a schema which looks like this-
StructType(StructField(A1,ArrayType(StringType,true),true), StructField(A2,StringType,true), StructField(A3,IntegerType,true),StructField(A4,ArrayType(StringType,true),true)
I want to convert this dataframe to schema defined above.
Can someone help me how can i do this ?
Note:- The schema and dataframe are loaded at runtime and they are not fix
you can use the org.apache.spark.sql.expressions.UserDefinedFunction to transform a string to an array and an arry to string, like this.
val string_to_array_udf = udf((s:String) => Array(s))
val array_to_string_udf = udf((a: Seq[String]) => a.head)
val string_to_int_udf = udf((s:String) => s.toInt)
val newDf = df.withColumn("a12", string_to_array_udf(col("a1"))).drop("a1").withColumnRenamed("a12", "a1")
.withColumn("a32", string_to_int_udf(col("a3"))).drop("a3").withColumnRenamed("a32", "a3")
.withColumn("a22", array_to_string_udf(col("a2"))).drop("a2").withColumnRenamed("a22", "a2")
newDf.printSchema
root
|-- a4: array (nullable = true)
| |-- element: string (containsNull = true)
|-- a1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- a3: integer (nullable = true)
|-- a2: string (nullable = true)

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}
If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.
Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.
Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)
tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!

How can I create a Spark DataFrame from a nested array of struct element?

I have read a JSON file into Spark. This file has the following structure:
scala> tweetBlob.printSchema
root
|-- related: struct (nullable = true)
| |-- next: struct (nullable = true)
| | |-- href: string (nullable = true)
|-- search: struct (nullable = true)
| |-- current: long (nullable = true)
| |-- results: long (nullable = true)
|-- tweets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cde: struct (nullable = true)
...
...
| | |-- cdeInternal: struct (nullable = true)
...
...
| | |-- message: struct (nullable = true)
...
...
What I would ideally want is a DataFrame with columns "cde", "cdeInternal", "message"... as shown below
root
|-- cde: struct (nullable = true)
...
...
|-- cdeInternal: struct (nullable = true)
...
...
|-- message: struct (nullable = true)
...
...
I have managed to use "explode" to extract elements from the "tweets" array into a column called "tweets"
scala> val tweets = tweetBlob.select(explode($"tweets").as("tweets"))
tweets: org.apache.spark.sql.DataFrame = [tweets: struct<cde:struct<author:struct<gender:string,location:struct<city:string,country:string,state:string>,maritalStatus:struct<evidence:string,isMarried:string>,parenthood:struct<evidence:string,isParent:string>>,content:struct<sentiment:struct<evidence:array<struct<polarity:string,sentimentTerm:string>>,polarity:string>>>,cdeInternal:struct<compliance:struct<isActive:boolean,userProtected:boolean>,tracks:array<struct<id:string>>>,message:struct<actor:struct<displayName:string,favoritesCount:bigint,followersCount:bigint,friendsCount:bigint,id:string,image:string,languages:array<string>,link:string,links:array<struct<href:string,rel:string>>,listedCount:bigint,location:struct<displayName:string,objectType:string>,objectType:string,postedTime...
scala> tweets.printSchema
root
|-- tweets: struct (nullable = true)
| |-- cde: struct (nullable = true)
...
...
| |-- cdeInternal: struct (nullable = true)
...
...
| |-- message: struct (nullable = true)
...
...
How can I select all columns inside the struct and create a DataFrame out of it? Explode does not work on a struct if my understanding is correct.
Any help is appreciated.
One possible way to handle this is to extract required information from the schema. Lets start with some dummy data:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
case class Bar(x: Int, y: String)
case class Foo(bar: Bar)
val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df.printSchema
// root
// |-- bar: struct (nullable = true)
// | |-- x: integer (nullable = false)
// | |-- y: string (nullable = true)
and a helper function:
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}"))
}
Finally the results:
df.select(children("bar", df): _*).printSchema
// root
// |-- x: integer (nullable = true)
// |-- y: string (nullable = true)
You can use
df
.select(explode(col("path_to_collection")).as("collection"))
.select(col("collection.*"))`:
Example:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> val inline = sqlContext.read.json(sc.parallelize(json :: Nil)).select(explode(col("schools")).as("collection")).select(col("collection.*"))
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
Or, you can also use SQL function inline:
scala> val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
scala> sqlContext.read.json(sc.parallelize(json :: Nil)).registerTempTable("tmp")
scala> val inline = sqlContext.sql("SELECT inline(schools) FROM tmp")
scala> inline.printSchema
root
|-- sname: string (nullable = true)
|-- year: long (nullable = true)
scala> inline.show
+--------+----+
| sname|year|
+--------+----+
|stanford|2010|
|berkeley|2012|
+--------+----+
scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> case class Bar(x: Int, y: String)
defined class Bar
scala> case class Foo(bar: Bar)
defined class Foo
scala> val df = sc.parallelize(Seq(Foo(Bar(1, "first")), Foo(Bar(2, "second")))).toDF
df: org.apache.spark.sql.DataFrame = [bar: struct<x: int, y: string>]
scala> df.printSchema
root
|-- bar: struct (nullable = true)
| |-- x: integer (nullable = false)
| |-- y: string (nullable = true)
scala> df.select("bar.*").printSchema
root
|-- x: integer (nullable = true)
|-- y: string (nullable = true)
scala>