On Spark version 2.4.5:
I have DataFrames that can have arbitrary amounts of nesting:
[scalatest] root
[scalatest] |-- nestedStruct: struct (nullable = true)
[scalatest] | |-- a: string (nullable = true)
[scalatest] | |-- b: integer (nullable = true)
and I have another StructType schema that will always be a subset of the DataFrame's schema, e.g.
[scalatest] root
[scalatest] |-- nestedStruct: struct (nullable = true)
[scalatest] | |-- a: string (nullable = true)
how would I go about dropping column b or any level of nested columns in the initial DataFrame when comparing it to the subset schema?
I've done something like this so far:
def dropColumnsNotInSubsetSchema(df: DataFrame, subsetSchema: StructType): DataFrame = {
val missingColumns = new ArrayBuffer[String]()
findMissingColumns(df.schema, subsetSchema, missingColumns)
df.drop(missingColumns.toArray: _*)
}
private def findMissingColumns(nestedSchema: StructType,
subsetSchema: StructType,
missingColumns: ArrayBuffer[String]): Unit = {
nestedSchema.fields.foreach(field => {
field.dataType match {
case structType: StructType =>
findMissingColumns(structType, subsetSchema, missingColumns)
case _ =>
if (!subsetSchema.fields.contains(field)) {
missingColumns += field.name
}
}
})
}
but it seems subsetSchema also has to be recursively gone through since nested columns won't exist in the outer most layer of the subset schema. And even if I accomplish that and identify that b is not in the subset schema, the column name identified would just be b and doing df.drop("b") wouldn't work..
Related
I am currently having some problems with creating a Spark Row object and converting it to a spark dataframe. What i am trying to achieve is,
I have a two lists of custom types that look more or less like the classes below,
case class MyObject(name:String,age:Int)
case class MyObject2(xyz:String,abc:Double)
val listOne = List(MyObject("aaa",22),MyObject("sss",223)),
val listTwo = List(MyObject2("bbb",23),MyObject2("abc",2332))
Using these two lists I want to create a Dataframe which has one row and two fields (fieldOne and fieldTwo),
fieldOne --> is a List of StructType (similar to MyObject)
fieldTwo --> is a list of StructType (similar to MyObject2)
In order to achieve this i created my custom structtypes for MyObject, MyObject2 and my ResultType.
val myObjSchema = StructType(List(
StructField("name",StringType),
StructField("age",IntegerType)
))
val myObjSchema2 = StructType(List(
StructField("xyz",StringType),
StructField("abc",DoubleType)
))
val myRecType = StructType(
List(
StructField("myField",ArrayType(myObjSchema)),
StructField("myField2",ArrayType(myObjSchema2))
)
)
I populated my data within the spark Row object and created a dataframe
val data = Row(
List(MyObject("aaa",22),MyObject("sss",223)),
List(MyObject2("bbb",23),MyObject2("abc",2332))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
when i call printSchema on the dataframe, the output is exactly what i would expect
root
|-- myField: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
|-- myField2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- xyz: string (nullable = true)
| | |-- abc: double (nullable = true)
However when i do a show, i get a runtime exception
Caused by: java.lang.RuntimeException: spark_utilities.example.MyObject is not a valid external type for schema of struct<name:string,age:int>
It looks like something is wrong with the Row object, can you please explain what is going wrong here?
Thanks a lot for your help!
ps: I know i can create a custom case class like case class PH(ls:List[MyObject],ls2:List[MyObject2]) populate it and convert it to a dataset. But due to some limitations i cannot use this approach and would like to solve it in the way mentioned above.
You can not simply insert your case class objects inside a row, you need to convert those objects to rows
val data = Row(
List(Row("aaa",22.toInt),Row("sss",223.toInt)),
List(Row("bbb",23d),Row("abc",2332d))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
I have a dataframe without schema and every column stored as StringType such as:
ID | LOG_IN_DATE | USER
1 | 2017-11-01 | Johns
Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.
I already tried:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
There's no error while running this but afterwards when I call the df.schema, nothing is changed.
Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.
If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list
Your dataframe
val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")
Your schema
val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))
Cast all the columns to its type from the list
val newColumns = schema.map(c => col(c._1).cast(c._2))
select all te casted columns
val newDF = df.select(newColumns:_*)
Print Schema
newDF.printSchema()
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Show Dataframe
newDF.show()
Output:
+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+
My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd
Yes, foldLeft is the way to go
This is the schema before using foldLeft
root
|-- ID: string (nullable = true)
|-- LOG_IN_DATE: string (nullable = true)
|-- USER: string (nullable = true)
Using foldLeft
val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))
import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()
and this is the schema after foldLeft
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
I hope the answer is helpful
If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.
Below is the code to create new data frame of modified schema by casting column.
1.Create a new DataFrame
val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")
2.Register DataFrame as temp table
df.registerTempTable("user")
3.Now create new DataFrame by casting column data type
val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")
4. Display schema
new_df.printSchema
root
|-- ID: integer (nullable = false)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Actually what you did:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
could work but you need to define your dataframe as a var and do like this:
for((name, type) <- schema) {
df = df.withColumn(name, col(name).cast(type)))
}
Also you could try reading your dataframe like this:
case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)
//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]
Hope this helps!
I'm working on a zeppelin notebook and try to load data from a table using sql.
In the table, each row has one column which is a JSON blob. For example, [{'timestamp':12345,'value':10},{'timestamp':12346,'value':11},{'timestamp':12347,'value':12}]
I want to select the JSON blob as a string, like the original string. But spark automatically load it as a WrappedArray.
It seems that I have to write a UDF to convert the WrappedArray to a string. The following is my code.
I first define a Scala function and then register the function. And then use the registered function on the column.
val unwraparr = udf ((x: WrappedArray[(Int, Int)]) => x.map { case Row(val1: String) => + "," + val2 })
sqlContext.udf.register("fwa", unwraparr)
It doesn't work. I would really appreciate if anyone can help.
The following is the schema of the part I'm working on. There will be many amount and timeStamp pairs.
-- targetColumn: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- value: long (nullable = true)
| |-- timeStamp: string (nullable = true)
UPDATE:
I come up with the following code:
val f = (x: Seq[Row]) => x.map { case Row(val1: Long, val2: String) => x.mkString("+") }
I need it to concat the objects/struct/row (not sure how to call the struct) to a single string.
If your loaded data as dataframe/dataset in spark is as below with schema as
+------------------------------------+
|targetColumn |
+------------------------------------+
|[[12345,10], [12346,11], [12347,12]]|
|[[12345,10], [12346,11], [12347,12]]|
+------------------------------------+
root
|-- targetColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timeStamp: string (nullable = true)
| | |-- value: long (nullable = true)
Then you can write the dataframe as json to a temporary json file and read it as text file and parse the String line and convert it to dataframe as below (/home/testing/test.json is the temporary json file location)
df.write.mode(SaveMode.Overwrite).json("/home/testing/test.json")
val data = sc.textFile("/home/testing/test.json")
val rowRdd = data.map(jsonLine => Row(jsonLine.split(":\\[")(1).replace("]}", "")))
val stringDF = sqlContext.createDataFrame(rowRdd, StructType(Array(StructField("targetColumn", StringType, true))))
Which should leave you with following dataframe and schema
+--------------------------------------------------------------------------------------------------+
|targetColumn |
+--------------------------------------------------------------------------------------------------+
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
|{"timeStamp":"12345","value":10},{"timeStamp":"12346","value":11},{"timeStamp":"12347","value":12}|
+--------------------------------------------------------------------------------------------------+
root
|-- targetColumn: string (nullable = true)
I hope the answer is helpful
read initially as text not dataframe
You can use my second phase of answer i.e. reading from json file and parsing, into your first phase of getting dataframe.
I'm working through a Databricks example. The schema for the dataframe looks like:
> parquetDF.printSchema
root
|-- department: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- salary: integer (nullable = true)
In the example, they show how to explode the employees column into 4 additional columns:
val explodeDF = parquetDF.explode($"employees") {
case Row(employee: Seq[Row]) => employee.map{ employee =>
val firstName = employee(0).asInstanceOf[String]
val lastName = employee(1).asInstanceOf[String]
val email = employee(2).asInstanceOf[String]
val salary = employee(3).asInstanceOf[Int]
Employee(firstName, lastName, email, salary)
}
}.cache()
display(explodeDF)
How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:
val explodeDF = parquetDF.select("department.id","department.name")
display(explodeDF)
If I try:
val explodeDF = parquetDF.explode($"department") {
case Row(dept: Seq[String]) => dept.map{dept =>
val id = dept(0)
val name = dept(1)
}
}.cache()
display(explodeDF)
I get the warning and error:
<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
case Row(dept: Seq[String]) => dept.map{dept =>
^
<console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product]
val explodeDF = parquetDF.explode($"department") {
^
In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:
var explodedDf2 = explodedDf.select("department.*","*")
https://docs.databricks.com/spark/latest/spark-sql/complex-types.html
You could use something like that:
var explodeDF = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDeptDF = explodeDeptDF.withColumn("name", explodeDeptDF("department.name"))
which you helped me into and these questions:
Flattening Rows in Spark
Spark 1.4.1 DataFrame explode list of JSON objects
This seems to work (though maybe not the most elegant solution).
var explodeDF2 = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDF2 = explodeDF2.withColumn("name", explodeDF2("department.name"))
I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:
I have schema defined as:
StructType(
StructField(name,StringType,true),
StructField(data,ArrayType(
StructType(
StructField(name,StringType,true),
StructField(values,
MapType(StringType,StringType,true),
true)
),
true
),
true)
)
I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:
df.withColumn("data.values", functions.array(functions.lit(null)))
but this ultimately creates a new column of data.values and does not modify the values element of the data array.
Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:
case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])
val nullableDF = df.as[Root].map { root =>
val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
NullableRoot(root.name, nullableData)
}.toDF()
The resulting schema of nullableDF will be:
root
|-- name: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame