I would like to update the schema of an spark dataframe by first converting it to a dataset which contains less columns. Background: i would like to remove some deeply nested fields from a schema.
I tried the following but the schema does not change:
import org.apache.spark.sql.functions._
val initial_df = spark.range(10).withColumn("foo", lit("foo!")).withColumn("bar", lit("bar!"))
case class myCaseClass(bar: String)
val reduced_ds = initial_df.as[myCaseClass]
The schema still includes the other fields:
reduced_ds.schema // StructType(StructField(id,LongType,false),StructField(foo,StringType,false),StructField(bar,StringType,false))
Is there a way to update the schema that way?`
It also confuses me that when i collect the dataset it only returns the fields defined in the case class:
reduced_ds.limit(1).collect() // Array(myCaseClass(bar!))
Add a fake map operation to force the projection using the predefined identity function:
import org.apache.spark.sql.functions._
val initial_df = spark.range(10).withColumn("foo", lit("foo!")).withColumn("bar", lit("bar!"))
case class myCaseClass(bar: String)
val reduced_ds = initial_df.as[myCaseClass].map(identity)
This yields
reduced_ds.schema // StructType(StructField(bar,StringType,true))
in the doc: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#as%5BU%5D(implicitevidence$2:org.apache.spark.sql.Encoder%5BU%5D):org.apache.spark.sql.Dataset%5BU%5D
it says:
Note that as[] only changes the view of the data that is passed into
typed operations, such as map(), and does not eagerly project away any
columns that are not present in the specified class.
To achieve what you want to do you need to
initial_df.select(the columns in myCaseClass).as[myCaseClass]
It is normal since when u collect reduced_ds it returns record of Type myCaseClass, myCaseClass has only one attribute named bar. That's not conflicting with the fact that the dataset schema is something else
Related
I am trying to convert a dataframe into a dataset corresponding to the model EmailToSend.
My models object:
object Models {
case class EmailToSend(
a1: String,
promoCodeTemplate: Option[PromoCodeTemplate]
)
case class PromoCodeTemplate(
b1: String
)
}
My code:
val myDataset: Dataset[Models.EmailToSend] = myDf.as[Models.EmailToSend]
myDf contains all columns required by EmailToSend, except promoCodeTemplate. As a consequence, this code fails at runtime:
cannot resolve '`promoCodeTemplate`' given input columns: [a1];
promoCodeTemplate is missing from that dataframe, which is what I expect. It will be filled later, but for now it has to be empty: there is no promo code template, this is normal.
The problem is that I cannot make this work without filling it with a promo code template. I tried to add an empty value with a withColumn but no value I tried worked.
val myDataset: Dataset[Models.EmailToSend] = myDf
// this is one of the many values I tried
.withColumn("promoCodeTemplate", lit(null.asInstanceOf[Models.PromoCodeTemplate]).cast(Models.PromoCodeTemplate))
.as[Models.EmailToSend]
How do I assign an empty value to the column promoCodeTemplate?
You should create an empty struct field to match spark type with case class PromoCodeTemplate.
val myDataset = myDF.withColumn("promoCodeTemplate", struct(lit("").as("b1"))).as[EmailToSend]
Or, you can use below line also,
myDF.withColumn("promoCodeTemplate", typedLit(PromoCodeTemplate(""))).as[EmailToSend]
To simply add null value,
myDF.withColumn("promoCodeTemplate", typedLit(null.asInstanceOf[PromoCodeTemplate])).as[EmailToSend]
I am little bit confused on Spark's dataframe .as[] function,
in the documentation it says
returns a new Dataset where each record has been mapped to the specified type.
but for example, if I do:
case class Person(id: Int, name: String)
case class NewPerson(id: Int)
val person1 = Person(1, "a")
val df = Seq(person1).toDF()
val ds = df.as[NewPerson]
the ds dataset I get will still have the two columns id and name of the class Person. I would expect to have only the id column of the class NewPerson.
What did the function do here?
Actually, as method only changes the view of the data, not the data itself, as explained in the documentation:
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
So as does not remove columns that are not present in your case class, it just creates a view of your rows that you can use in typed operation.
Adding to Vincent Doba's answer, if you have used a case class A to create a value ds of type Dataset[A] then you can truncate it to the fields you need with the following:
val ds_clean: Dataset[A] = ds.map(identity)
Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)
I'd like to create a Row with a schema from a case class to test one of my map functions. The most straightforward way I can think of doing this is:
import org.apache.spark.sql.Row
case class MyCaseClass(foo: String, bar: Option[String])
def buildRowWithSchema(record: MyCaseClass): Row = {
sparkSession.createDataFrame(Seq(record)).collect.head
}
However, this seemed like a lot of overhead to just get a single Row, so I looked into how I could directly create a Row with a schema. This led me to:
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.{Encoders, Row}
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
field.get(record)
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
Unfortunately, the Row that the second version returns is different from the first Row. Option fields in the first version are reduced to their primitive values, while they are still Options in the second version. Also, the second version is quite unwieldy.
Is there a better way to do this?
The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
val returnValue = field.get(record)
if(returnValue.isInstanceOf[Option[String]]){
returnValue.asInstanceOf[Option[String]].get
}
else
returnValue
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
But meanwhile I would suggest you to use DataFrame or DataSet.
DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class
For example:
lets say you have input data as
val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))
If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code
import sqlContext.implicits._
val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF
.toDS would generate dataset
Thus you have collection of Rows with schema
for validation you can do the followings
println(dataFrame.schema) //for checking if there is schema
println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows
Hope you have the right answer.
I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}