Load constraints from csv-file (amazon deequ) - scala

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true

I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()

It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.

Related

Store Schema of Read File Into csv file in spark scala

i am reading a csv file using inferschema option enabled in data frame using below command.
df2 = spark.read.options(Map("inferSchema"->"true","header"->"true")).csv("s3://Bucket-Name/Fun/Map/file.csv")
df2.printSchema()
Output:
root
|-- CC|Fun|Head|Country|SendType: string (nullable = true)
Now I would like to store the above output only into a csv file having just these column names and datatype of these columns like below.
column_name,datatype
CC,string
Fun,string
Head,string
Country,string
SendType,string
I tried writing this into a csv using below option, but this is writing the file with entire data.
df2.coalesce(1).write.format("csv").mode("append").save("schema.csv")
regards
mahi
df.schema.fields to get fields & its datatype.
Check below code.
scala> val schema = df.schema.fields.map(field => (field.name,field.dataType.typeName)).toList.toDF("column_name","datatype")
schema: org.apache.spark.sql.DataFrame = [column_name: string, datatype: string]
scala> schema.show(false)
+---------------+--------+
|column_name |datatype|
+---------------+--------+
|applicationName|string |
|id |string |
|requestId |string |
|version |long |
+---------------+--------+
scala> schema.write.format("csv").save("/tmp/schema")
Try something like below use coalesce(1) and .option("header","true") to output with header
import java.io.FileWriter
object SparkSchema {
def main(args: Array[String]): Unit = {
val fw = new FileWriter("src/main/resources/csv.schema", true)
fw.write("column_name,datatype\n")
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("", "", "", 1l)).toDF("applicationName", "id", "requestId", "version")
val columnList : List[(String, String)] = df.schema.fields.map(field => (field.name, field.dataType.typeName))
.toList
try {
val outString = columnList.map(col => {
col._1 + "," + col._2
}).mkString("\n")
fw.write(outString)
}
finally fw.close()
val newColumnList : List[(String, String)] = List(("newColumn","integer"))
val finalColList = columnList ++ newColumnList
writeToS3("s3://bucket/newFileName.csv",finalColList)
}
def writeToS3(s3FileNameWithpath : String,finalColList : List[(String,String)]) {
val outString = finalColList.map(col => {
col._1 + "," + col._2
}).mkString("\\n")
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set("fs.s3a.access.key", "YOUR ACCESS KEY")
conf.set("fs.s3a.secret.key", "YOUR SECRET KEY")
val dest = new Path(s3FileNameWithpath)
val fs = dest.getFileSystem(conf)
val out = fs.create(dest, true)
out.write( outString.getBytes )
out.close()
}
}
An alternative to #QuickSilver's and #Srinivas' solutions, which they should both work, is to use the DDL representation of the schema. With df.schema.toDDL you get:
CC STRING, fun STRING, Head STRING, Country STRING, SendType STRING
which is the string representation of the schema then you can split and replace as shown next:
import java.io.PrintWriter
val schema = df.schema.toDDL.split(",")
// Array[String] = Array(`CC` STRING, `fun` STRING, `Head` STRING, `Country` STRING, `SendType` STRING)
val writer = new PrintWriter("/tmp/schema.csv")
writer.write("column_name,datatype\n")
schema.foreach{ r => writer.write(r.replace(" ", ",") + "\n") }
writer.close()
To write to S3 you can use Hadoop API as QuickSilver already implemented or a 3rd party library such as MINIO:
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
minioClient.putObject("YOUR_BUCKET","schema.csv", "/tmp/schema.csv", null)
Or even better by generating a string, storing it into a buffer and then send it via InputStream to S3:
import java.io.ByteArrayInputStream
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
val schema = df.schema.toDDL.split(",")
val schemaBuffer = new StringBuilder
schemaBuffer ++= "column_name,datatype\n"
schema.foreach{ r => schemaBuffer ++= r.replace(" ", ",") + "\n" }
val inputStream = new ByteArrayInputStream(schemaBuffer.toString.getBytes("UTF-8"))
minioClient.putObject("YOUR_BUCKET", "schema.csv", inputStream, new PutObjectOptions(inputStream.available(), -1))
inputStream.close
#PySpark
df_schema = spark.createDataFrame([(i.name, str(i.dataType)) for i in df.schema.fields], ['column_name', 'datatype'])
df_schema.show()
This will create new dataFrame for schema of existing dataframe
UseCase:
Useful when you want create table with Schema of the dataframe & you cannot use below code as pySpark user may not be authorized to execute DDL commands on database.
df.createOrReplaceTempView("tmp_output_table")
spark.sql("""drop table if exists schema.output_table""")
spark.sql("""create table schema.output_table as select * from tmp_output_table""")
In Pyspark - You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes. Follow this link for more details pyspark.sql.DataFrame.dtypes
Having said that, try using below code -
data = df.dtypes
cols = ["col_name", "datatype"]
df = spark.createDataFrame(data=data,schema=cols)
df.show()

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

How can I dynamically invoke the same scala function in cascading manner with output of previous call goes as input to the next call

I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}
I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)

Streaming CSV Source with AKKA-HTTP

I am trying to stream data from Mongodb using reactivemongo-akkastream 0.12.1 and return the result into a CSV stream in one of the routes (using Akka-http).
I did implement that following the exemple here:
http://doc.akka.io/docs/akka-http/10.0.0/scala/http/routing-dsl/source-streaming-support.html#simple-csv-streaming-example
and it looks working fine.
The only problem I am facing now is how to add the headers to the output CSV file. Any ideas?
Thanks
Aside from the fact that that example isn't really a robust method of generating CSV (doesn't provide proper escaping) you'll need to rework it a bit to add headers. Here's what I would do:
make a Flow to convert a Source[Tweet] to a source of CSV rows, e.g. a Source[List[String]]
concatenate it to a source containing your headers as a single List[String]
adapt the marshaller to render a source of rows rather than tweets
Here's some example code:
case class Tweet(uid: String, txt: String)
def getTweets: Source[Tweet, NotUsed] = ???
val tweetToRow: Flow[Tweet, List[String], NotUsed] =
Flow[Tweet].map { t =>
List(
t.uid,
t.txt.replaceAll(",", "."))
}
// provide a marshaller from a row (List[String]) to a ByteString
implicit val tweetAsCsv = Marshaller.strict[List[String], ByteString] { row =>
Marshalling.WithFixedContentType(ContentTypes.`text/csv(UTF-8)`, () =>
ByteString(row.mkString(","))
)
}
// enable csv streaming
implicit val csvStreaming = EntityStreamingSupport.csv()
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val tweets: Source[List[String], NotUsed] = getTweets.via(tweetToRow)
complete(headers.concat(tweets))
}
Update: if your getTweets method returns a Future you can just map over its source value and prepend the headers that way, e.g:
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val rows: Future[Source[List[String], NotUsed]] = getTweets
.map(tweets => headers.concat(tweets.via(tweetToRow)))
complete(rows)
}

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.