How to Validate contents of Spark Dataframe - scala

I have below Scala Spark code base, which works well, but should not.
The 2nd column has data of mixed types, whereas in Schema I have defined it of IntegerType. My actual program has over 100 columns, and keep deriving multiple child DataFrames after transformations.
How can I validate that contents of RDD or DataFrame fields have correct datatype values, and thus ignore invalid rows or change contents of column to some default value. Any more pointers for data quality checks with DataFrame or RDD are appreciated.
var theSeq = Seq(("X01", "41"),
("X01", 41),
("X01", 41),
("X02", "ab"),
("X02", "%%"))
val newRdd = sc.parallelize(theSeq)
val rowRdd = newRdd.map(r => Row(r._1, r._2))
val theSchema = StructType(Seq(StructField("ID", StringType, true),
StructField("Age", IntegerType, true)))
val theNewDF = sqc.createDataFrame(rowRdd, theSchema)
theNewDF.show()

First of all passing schema is simply a way to avoid type inference. It is not validated or enforced during DataFrame creation. On a side note I wouldn't describe ClassCastException as working well. For a moment I thought you actually found a bug.
I think the important question is how you get data like theSeq / newRdd in the first place. Is it something you parse by yourself, is it received from an external component? Simply looking at the type (Seq[(String, Any)] / RDD[(String, Any)] respectively) you already know it is not a valid input for a DataFrame. Probably the way to handle things at this level is to embrace static typing. Scala provides quite a few neat ways to handle unexpected conditions (Try, Either, Option) where the last one is the simplest one, and as a bonus works well with Spark SQL. Rather simplistic way to handle things could look like this
def validateInt(x: Any) = x match {
case x: Int => Some(x)
case _ => None
}
def validateString(x: Any) = x match {
case x: String => Some(x)
case _ => None
}
val newRddOption: RDD[(Option[String], Option[Int])] = newRdd.map{
case (id, age) => (validateString(id), validateInt(age))}
Since Options can be easily composed you can add additional checks like this:
def validateAge(age: Int) = {
if(age >= 0 && age < 150) Some(age)
else None
}
val newRddValidated: RDD[(Option[String], Option[Int])] = newRddOption.map{
case (id, age) => (id, age.flatMap(validateAge))}
Next instead of Row which is a very crude container I would use cases classes:
case class Record(id: Option[String], age: Option[Int])
val records: RDD[Record] = newRddValidated.map{case (id, age) => Record(id, age)}
At this moment all you have to do is call toDF:
import org.apache.spark.sql.DataFrame
val df: DataFrame = records.toDF
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- age: integer (nullable = true)
This was the hard but arguably a more elegant way. A faster is to let SQL casting system to do a job for you. First lets convert everything to Strings:
val stringRdd: RDD[(String, String)] = sc.parallelize(theSeq).map(
p => (p._1.toString, p._2.toString))
Next create a DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.col
val df: DataFrame = stringRdd.toDF("id", "age")
val expectedTypes = Seq(StringType, IntegerType)
val exprs: Seq[Column] = df.columns.zip(expectedTypes).map{
case (c, t) => col(c).cast(t).alias(c)}
val dfProcessed: DataFrame = df.select(exprs: _*)
And the result:
dfProcessed.printSchema
// root
// |-- id: string (nullable = true)
// |-- age: integer (nullable = true)
dfProcessed.show
// +---+----+
// | id| age|
// +---+----+
// |X01| 41|
// |X01| 41|
// |X01| 41|
// |X02|null|
// |X02|null|
// +---+----+

In version 1.4 or older
import org.apache.spark.sql.execution.debug._
theNewDF.typeCheck
It was removed via SPARK-9754 though. I haven't checked but I think typeCheck becomes sqlContext.debug beforehand

Related

Creating column types for schema

I have a text file that I read from and parse to create a dataframe. However, the columns amount and code should be IntegerTypes. Here's what I have:
def getSchema: StructType = {
StructType(Seq(
StructField("carrier", StringType, false),
StructField("amount", StringType, false),
StructField("currency", StringType, false),
StructField("country", StringType, false),
StructField("code", StringType, false),
))
}
def getRow(x: String): Row = {
val columnArray = new Array[String](5)
columnArray(0) = x.substring(40, 43)
columnArray(1) = x.substring(43, 46)
columnArray(2) = x.substring(46, 51)
columnArray(3) = x.substring(51, 56)
columnArray(4) = x.substring(56, 64)
Row.fromSeq(columnArray)
}
Because I have Array[String] defined, the columns can only be StringTypes and not a variety of both String and Integer. To explain in detail my problem, here's what happens:
First I create an empty dataframe:
var df = spark.sqlContext.createDataFrame(spark.sparkContext.emptyRDD[Row], getSchema)
Then I have a for loop that goes through each file in all the directories. Note: I need to validate every file and cannot read all at once.
for (each file parse):
df2 = spark.sqlContext.createDataFrame(spark.sparkContext.textFile(inputPath)
.map(x => getRow(x)), schema)
df = df.union(df2)
I now have a complete dataframe of all the files. However, columns amount and code are StringTypes still. How can I make it so that they are IntegerTypes?
Please note: I cannot cast the columns during the for-loop process because it takes a lot of time. I'd like to keep the current structure I have as similar as possible. At the end of the for loop, I could cast the columns as IntegerTypes, however, what if the column contains a value that is not an Integer? I'd like for the columns to be not NULL.
Is there a way to make the 2 specified columns IntegerTypes without adding a lot of change to the code?
What about using datasets?
First create a case class modelling your data:
case class MyObject(
carrier: String,
amount: Double,
currency: String,
country: String,
code: Int)
create an other case class wrapping the first one with additional infos (potential errors, source file):
case class MyObjectWrapper(
myObject: Option[MyObject],
someError: Option[String],
source: String
)
Then create a parser, transforming a line from your file in myObject:
object Parser {
def parse(line: String, file: String): MyObjectWrapper = {
Try {
MyObject(
carrier = line.substring(40, 43),
amount = line.substring(43, 46).toDouble,
currency = line.substring(46, 51),
country = line.substring(51, 56),
code = line.substring(56, 64).toInt)
} match {
case Success(objectParsed) => MyObjectWrapper(Some(objectParsed), None, file)
case Failure(error) => MyObjectWrapper(None, Some(error.getLocalizedMessage), file)
}
}
}
Finally, parse your files:
import org.apache.spark.sql.functions._
val ds = files
.filter( {METHOD TO SELECT CORRECT FILES})
.map( { GET INPUT PATH FROM FILES} )
.map(path => spark.read.textFile(_).map(Parser.parse(_, path))
.reduce(_.union(_))
This should give you a Dataset[MyObjectWrapper] with the types and APIs you wish.
Afterwards you can take those you could parse:
ds.filter(_.someError == None)
Or take those you failed to parse (for investigation):
ds.filter(_.someError != None)

Scala Spark Dataframe new column from object column [duplicate]

I am trying to implement a custom UDT and be able to reference it from Spark SQL (as explained in the Spark SQL whitepaper, section 4.4.2).
The real example is to have a custom UDT backed by an off-heap data structure using Cap'n Proto, or similar.
For this posting, I have made up a contrived example. I know that I could just use Scala case classes and not have to do any work at all, but that isn't my goal.
For example, I have a Person containing several attributes and I want to be able to SELECT person.first_name FROM person. I'm running into the error Can't extract value from person#1 and I'm not sure why.
Here is the full source (also available at https://github.com/andygrove/spark-sql-udt)
package com.theotherandygrove
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Example {
def main(arg: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Example")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val schema = StructType(List(
StructField("person_id", DataTypes.IntegerType, true),
StructField("person", new MockPersonUDT, true)))
// load initial RDD
val rdd = sc.parallelize(List(
MockPersonImpl(1),
MockPersonImpl(2)
))
// convert to RDD[Row]
val rowRdd = rdd.map(person => Row(person.getAge, person))
// convert to DataFrame (RDD + Schema)
val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
// register as a table
dataFrame.registerTempTable("person")
// selecting the whole object works fine
val results = sqlContext.sql("SELECT person.first_name FROM person WHERE person.age < 100")
val people = results.collect
people.map(row => {
println(row)
})
}
}
trait MockPerson {
def getFirstName: String
def getLastName: String
def getAge: Integer
def getState: String
}
class MockPersonUDT extends UserDefinedType[MockPerson] {
override def sqlType: DataType = StructType(List(
StructField("firstName", StringType, nullable=false),
StructField("lastName", StringType, nullable=false),
StructField("age", IntegerType, nullable=false),
StructField("state", StringType, nullable=false)
))
override def userClass: Class[MockPerson] = classOf[MockPerson]
override def serialize(obj: Any): Any = obj.asInstanceOf[MockPersonImpl].getAge
override def deserialize(datum: Any): MockPerson = MockPersonImpl(datum.asInstanceOf[Integer])
}
#SQLUserDefinedType(udt = classOf[MockPersonUDT])
#SerialVersionUID(123L)
case class MockPersonImpl(n: Integer) extends MockPerson with Serializable {
def getFirstName = "First" + n
def getLastName = "Last" + n
def getAge = n
def getState = "AK"
}
If I simply SELECT person FROM person then the query works. I just can't reference the attributes in SQL, even though they are defined in the schema.
You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types.
You can access individual attributes using UDFs but first lets show that the internal structure is indeed not exposed:
dataFrame.printSchema
// root
// |-- person_id: integer (nullable = true)
// |-- person: mockperso (nullable = true)
To create UDF we need functions which take as an argument an object of a type represented by a given UDT:
import org.apache.spark.sql.functions.udf
val getFirstName = (person: MockPerson) => person.getFirstName
val getLastName = (person: MockPerson) => person.getLastName
val getAge = (person: MockPerson) => person.getAge
which can be wrapped using udf function:
val getFirstNameUDF = udf(getFirstName)
val getLastNameUDF = udf(getLastName)
val getAgeUDF = udf(getAge)
dataFrame.select(
getFirstNameUDF($"person").alias("first_name"),
getLastNameUDF($"person").alias("last_name"),
getAgeUDF($"person").alias("age")
).show()
// +----------+---------+---+
// |first_name|last_name|age|
// +----------+---------+---+
// | First1| Last1| 1|
// | First2| Last2| 2|
// +----------+---------+---+
To use these with raw SQL you have register functions through SQLContext:
sqlContext.udf.register("first_name", getFirstName)
sqlContext.udf.register("last_name", getLastName)
sqlContext.udf.register("age", getAge)
sqlContext.sql("""
SELECT first_name(person) AS first_name, last_name(person) AS last_name
FROM person
WHERE age(person) < 100""").show
// +----------+---------+
// |first_name|last_name|
// +----------+---------+
// | First1| Last1|
// | First2| Last2|
// +----------+---------+
Unfortunately it comes with a price tag attached. First of all every operation requires deserialization. It also substantially limits the ways in which query can be optimized. In particular any join operation on one of these fields requires a Cartesian product.
In practice if you want to encode a complex structure, which contains attributes that can be expressed using built-in types, it is better to use StructType:
case class Person(first_name: String, last_name: String, age: Int)
val df = sc.parallelize(
(1 to 2).map(i => (i, Person(s"First$i", s"Last$i", i)))).toDF("id", "person")
df.printSchema
// root
// |-- id: integer (nullable = false)
// |-- person: struct (nullable = true)
// | |-- first_name: string (nullable = true)
// | |-- last_name: string (nullable = true)
// | |-- age: integer (nullable = false)
df
.where($"person.age" < 100)
.select($"person.first_name", $"person.last_name")
.show
// +----------+---------+
// |first_name|last_name|
// +----------+---------+
// | First1| Last1|
// | First2| Last2|
// +----------+---------+
and reserve UDTs for actual types extensions like built-in VectorUDT or things that can benefit from a specific representation like enumerations.

How to create a Row from a given case class?

Imagine that you have the following case classes:
case class B(key: String, value: Int)
case class A(name: String, data: B)
Given an instance of A, how do I create a Spark Row? e.g.
val a = A("a", B("b", 0))
val row = ???
NOTE: Given row I need to be able to get data with:
val name: String = row.getAs[String]("name")
val b: Row = row.getAs[Row]("data")
The following seems to match what you're looking for.
scala> spark.version
res0: String = 2.3.0
scala> val a = A("a", B("b", 0))
a: A = A(a,B(b,0))
import org.apache.spark.sql.Encoders
val schema = Encoders.product[A].schema
scala> schema.printTreeString
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: integer (nullable = false)
val values = a.productIterator.toSeq.toArray
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row: Row = new GenericRowWithSchema(values, schema)
scala> val name: String = row.getAs[String]("name")
name: String = a
// the following won't work since B =!= Row
scala> val b: Row = row.getAs[Row]("data")
java.lang.ClassCastException: B cannot be cast to org.apache.spark.sql.Row
... 55 elided
Very short but probably not the fastest as it first creates a dataframe and then collects it again :
import session.implicits._
val row = Seq(a).toDF().first()
#Jacek Laskowski answer is great!
To complete:
Here some syntactic sugar:
val row = Row(a.productIterator.toSeq: _*)
And a recursive method if you happen to have nested case classes
def productToRow(product: Product): Row = {
val sequence = product.productIterator.toSeq.map {
case product : Product => productToRow(product)
case e => e
}
Row(sequence : _*)
}
I don't think there exist a public API that can do it directly. Internally Spark uses Encoder.toRow method to convert objects org.apache.spark.sql.catalyst.expressions.UnsafeRow, but this method is private. You could try to:
Obtain Encoder for the class:
val enc: Encoder[A] = ExpressionEncoder()
Use reflection to access toRow method and set it to accessible.
Call it to convert object to UnsafeRow.
Obtain RowEncoder for the expected schema (enc.schema).
Convert UnsafeRow to Row.
I haven't tried this, so I cannot guarantee it will work or not.

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)