Creating column types for schema - scala

I have a text file that I read from and parse to create a dataframe. However, the columns amount and code should be IntegerTypes. Here's what I have:
def getSchema: StructType = {
StructType(Seq(
StructField("carrier", StringType, false),
StructField("amount", StringType, false),
StructField("currency", StringType, false),
StructField("country", StringType, false),
StructField("code", StringType, false),
))
}
def getRow(x: String): Row = {
val columnArray = new Array[String](5)
columnArray(0) = x.substring(40, 43)
columnArray(1) = x.substring(43, 46)
columnArray(2) = x.substring(46, 51)
columnArray(3) = x.substring(51, 56)
columnArray(4) = x.substring(56, 64)
Row.fromSeq(columnArray)
}
Because I have Array[String] defined, the columns can only be StringTypes and not a variety of both String and Integer. To explain in detail my problem, here's what happens:
First I create an empty dataframe:
var df = spark.sqlContext.createDataFrame(spark.sparkContext.emptyRDD[Row], getSchema)
Then I have a for loop that goes through each file in all the directories. Note: I need to validate every file and cannot read all at once.
for (each file parse):
df2 = spark.sqlContext.createDataFrame(spark.sparkContext.textFile(inputPath)
.map(x => getRow(x)), schema)
df = df.union(df2)
I now have a complete dataframe of all the files. However, columns amount and code are StringTypes still. How can I make it so that they are IntegerTypes?
Please note: I cannot cast the columns during the for-loop process because it takes a lot of time. I'd like to keep the current structure I have as similar as possible. At the end of the for loop, I could cast the columns as IntegerTypes, however, what if the column contains a value that is not an Integer? I'd like for the columns to be not NULL.
Is there a way to make the 2 specified columns IntegerTypes without adding a lot of change to the code?

What about using datasets?
First create a case class modelling your data:
case class MyObject(
carrier: String,
amount: Double,
currency: String,
country: String,
code: Int)
create an other case class wrapping the first one with additional infos (potential errors, source file):
case class MyObjectWrapper(
myObject: Option[MyObject],
someError: Option[String],
source: String
)
Then create a parser, transforming a line from your file in myObject:
object Parser {
def parse(line: String, file: String): MyObjectWrapper = {
Try {
MyObject(
carrier = line.substring(40, 43),
amount = line.substring(43, 46).toDouble,
currency = line.substring(46, 51),
country = line.substring(51, 56),
code = line.substring(56, 64).toInt)
} match {
case Success(objectParsed) => MyObjectWrapper(Some(objectParsed), None, file)
case Failure(error) => MyObjectWrapper(None, Some(error.getLocalizedMessage), file)
}
}
}
Finally, parse your files:
import org.apache.spark.sql.functions._
val ds = files
.filter( {METHOD TO SELECT CORRECT FILES})
.map( { GET INPUT PATH FROM FILES} )
.map(path => spark.read.textFile(_).map(Parser.parse(_, path))
.reduce(_.union(_))
This should give you a Dataset[MyObjectWrapper] with the types and APIs you wish.
Afterwards you can take those you could parse:
ds.filter(_.someError == None)
Or take those you failed to parse (for investigation):
ds.filter(_.someError != None)

Related

Load constraints from csv-file (amazon deequ)

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.

Spark SQL convert dataset to dataframe

How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?
The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")

How do I save a file in a Spark PairRDD using the key as the filename and the value as the contents?

In Spark, I have downloaded multiple files from s3 using sc.binaryFiles. The RDD that results has the key as the filename and the value has the contents of the file. I have decompressed the file contents, csv parsed it, and converted it to a dataframe. So, now I have a PairRDD[String, DataFrame]. The problem I have is that I want to save the file to HDFS using the key as the filename and save the value as a parquet file overwriting one if it already exists. This is what I got so far.
val files = sc.binaryFiles(lFiles.mkString(","), 250).mapValues(stream => sc.parallelize(readZipStream(new ZipInputStream(stream.open))))
val tables = files.mapValues(file => {
val header = file.first.split(",")
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
val lines = file.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.flatMap(x => x.split("\n"))
val rowRDD = lines.map(x => Row.fromSeq(x.split(",")))
sqlContext.createDataFrame(rowRDD, schema)
})
If you have any advice, please let me know. I would appreciate it.
Thanks,
Ben
the way to save files to HDFS in spark is the same to hadoop. So you need to create a class which extends MultipleTextOutputFormat, in custom class you can define output filename yourself.the example is below:
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
"realtime-" + new SimpleDateFormat("yyyyMMddHHmm").format(new Date()) + "00-" + name
}
}
the called code is below:
RDD.rddToPairRDDFunctions(rdd.map { case (key, list) =>
(NullWritable.get, key)
}).saveAsHadoopFile(input, classOf[NullWritable], classOf[String], classOf[RDDMultipleTextOutputFormat])

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select
Example JSON schema:
{
"a": {
"b": 1,
"c": 2
}
}
This is what I want to do:
potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))
but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:
scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)
Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:
import scala.util.Try
import org.apache.spark.sql.DataFrame
def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
val df = sqlContext.read.json(sc.parallelize(
"""{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))
hasColumn(df, "foobar")
// Boolean = false
hasColumn(df, "foo")
// Boolean = true
hasColumn(df, "foo.bar")
// Boolean = true
hasColumn(df, "foo.bar.foobar")
// Boolean = true
hasColumn(df, "foo.bar.foobaz")
// Boolean = false
Or even simpler:
val columns = Seq(
"foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")
columns.flatMap(c => Try(df(c)).toOption)
// Seq[org.apache.spark.sql.Column] = List(
// foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)
Python equivalent:
from pyspark.sql.utils import AnalysisException
from pyspark.sql import Row
def has_column(df, col):
try:
df[col]
return True
except AnalysisException:
return False
df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()
has_column(df, "foobar")
## False
has_column(df, "foo")
## True
has_column(df, "foo.bar")
## True
has_column(df, "foo.bar.foobar")
## True
has_column(df, "foo.bar.foobaz")
## False
Another option which I normally use is
df.columns.contains("column-name-to-check")
This returns a boolean
Actually you don't even need to call select in order to use columns, you can just call it on the dataframe itself
// define test data
case class Test(a: Int, b: Int)
val testList = List(Test(1,2), Test(3,4))
val testDF = sqlContext.createDataFrame(testList)
// define the hasColumn function
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName)
// then you can just use it on the DF with a given column name
hasColumn(testDF, "a") // <-- true
hasColumn(testDF, "c") // <-- false
Alternatively you can define an implicit class using the pimp my library pattern so that the hasColumn method is available on your dataframes directly
implicit class DataFrameImprovements(df: org.apache.spark.sql.DataFrame) {
def hasColumn(colName: String) = df.columns.contains(colName)
}
Then you can use it as:
testDF.hasColumn("a") // <-- true
testDF.hasColumn("c") // <-- false
Try is not optimal as the it will evaluate the expression inside Try before it takes the decision.
For large data sets, use the below in Scala:
df.schema.fieldNames.contains("column_name")
For those who stumble across this looking for a Python solution, I use:
if 'column_name_to_check' in df.columns:
# do something
When I tried #Jai Prakash's answer of df.columns.contains('column-name-to-check') using Python, I got AttributeError: 'list' object has no attribute 'contains'.
Your other option for this would be to do some array manipulation (in this case an intersect) on the df.columns and your potential_columns.
// Loading some data (so you can just copy & paste right into spark-shell)
case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(Document("a", "b", "c")), 2).toDF
// The columns we want to extract
val potential_columns = Seq("b", "c", "d")
// Get the intersect of the potential columns and the actual columns,
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(potential_columns.intersect(df.columns).map(df(_)): _*).show
Alas this will not work for you inner object scenario above. You will need to look at the schema for that.
I'm going to change your potential_columns to fully qualified column names
val potential_columns = Seq("a.b", "a.c", "a.d")
// Our object model
case class Document( a: String, b: String, c: String)
case class Document2( a: Document, b: String, c: String)
// And some data...
val df = sc.parallelize(Seq(Document2(Document("a", "b", "c"), "c2")), 2).toDF
// We go through each of the fields in the schema.
// For StructTypes we return an array of parentName.fieldName
// For everything else we return an array containing just the field name
// We then flatten the complete list of field names
// Then we intersect that with our potential_columns leaving us just a list of column we want
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(df.schema.map(a => a.dataType match { case s : org.apache.spark.sql.types.StructType => s.fieldNames.map(x => a.name + "." + x) case _ => Array(a.name) }).flatMap(x => x).intersect(potential_columns).map(df(_)) : _*).show
This only goes one level deep, so to make it generic you would have to do more work.
in pyspark you can simply run
'field' in df.columns
If you shred your json using a schema definition when you load it then you don't need to check for the column. if it's not in the json source it will appear as a null column.
val schemaJson = """
{
"type": "struct",
"fields": [
{
"name": field1
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": field2
"type": "string",
"nullable": true,
"metadata": {}
}
]
}
"""
val schema = DataType.fromJson(schemaJson).asInstanceOf[StructType]
val djson = sqlContext.read
.schema(schema )
.option("badRecordsPath", readExceptionPath)
.json(dataPath)
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) =
Try(df.select(colName)).isSuccess
Use the above mentioned function to check the existence of column including nested column name.
In PySpark, df.columns gives you a list of columns in the dataframe, so
"colName" in df.columns
would return a True or False. Give a try on that. Good luck!
For nested columns you can use
df.schema.simpleString().find('column_name')

How to Validate contents of Spark Dataframe

I have below Scala Spark code base, which works well, but should not.
The 2nd column has data of mixed types, whereas in Schema I have defined it of IntegerType. My actual program has over 100 columns, and keep deriving multiple child DataFrames after transformations.
How can I validate that contents of RDD or DataFrame fields have correct datatype values, and thus ignore invalid rows or change contents of column to some default value. Any more pointers for data quality checks with DataFrame or RDD are appreciated.
var theSeq = Seq(("X01", "41"),
("X01", 41),
("X01", 41),
("X02", "ab"),
("X02", "%%"))
val newRdd = sc.parallelize(theSeq)
val rowRdd = newRdd.map(r => Row(r._1, r._2))
val theSchema = StructType(Seq(StructField("ID", StringType, true),
StructField("Age", IntegerType, true)))
val theNewDF = sqc.createDataFrame(rowRdd, theSchema)
theNewDF.show()
First of all passing schema is simply a way to avoid type inference. It is not validated or enforced during DataFrame creation. On a side note I wouldn't describe ClassCastException as working well. For a moment I thought you actually found a bug.
I think the important question is how you get data like theSeq / newRdd in the first place. Is it something you parse by yourself, is it received from an external component? Simply looking at the type (Seq[(String, Any)] / RDD[(String, Any)] respectively) you already know it is not a valid input for a DataFrame. Probably the way to handle things at this level is to embrace static typing. Scala provides quite a few neat ways to handle unexpected conditions (Try, Either, Option) where the last one is the simplest one, and as a bonus works well with Spark SQL. Rather simplistic way to handle things could look like this
def validateInt(x: Any) = x match {
case x: Int => Some(x)
case _ => None
}
def validateString(x: Any) = x match {
case x: String => Some(x)
case _ => None
}
val newRddOption: RDD[(Option[String], Option[Int])] = newRdd.map{
case (id, age) => (validateString(id), validateInt(age))}
Since Options can be easily composed you can add additional checks like this:
def validateAge(age: Int) = {
if(age >= 0 && age < 150) Some(age)
else None
}
val newRddValidated: RDD[(Option[String], Option[Int])] = newRddOption.map{
case (id, age) => (id, age.flatMap(validateAge))}
Next instead of Row which is a very crude container I would use cases classes:
case class Record(id: Option[String], age: Option[Int])
val records: RDD[Record] = newRddValidated.map{case (id, age) => Record(id, age)}
At this moment all you have to do is call toDF:
import org.apache.spark.sql.DataFrame
val df: DataFrame = records.toDF
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- age: integer (nullable = true)
This was the hard but arguably a more elegant way. A faster is to let SQL casting system to do a job for you. First lets convert everything to Strings:
val stringRdd: RDD[(String, String)] = sc.parallelize(theSeq).map(
p => (p._1.toString, p._2.toString))
Next create a DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.col
val df: DataFrame = stringRdd.toDF("id", "age")
val expectedTypes = Seq(StringType, IntegerType)
val exprs: Seq[Column] = df.columns.zip(expectedTypes).map{
case (c, t) => col(c).cast(t).alias(c)}
val dfProcessed: DataFrame = df.select(exprs: _*)
And the result:
dfProcessed.printSchema
// root
// |-- id: string (nullable = true)
// |-- age: integer (nullable = true)
dfProcessed.show
// +---+----+
// | id| age|
// +---+----+
// |X01| 41|
// |X01| 41|
// |X01| 41|
// |X02|null|
// |X02|null|
// +---+----+
In version 1.4 or older
import org.apache.spark.sql.execution.debug._
theNewDF.typeCheck
It was removed via SPARK-9754 though. I haven't checked but I think typeCheck becomes sqlContext.debug beforehand