Store Schema of Read File Into csv file in spark scala - scala

i am reading a csv file using inferschema option enabled in data frame using below command.
df2 = spark.read.options(Map("inferSchema"->"true","header"->"true")).csv("s3://Bucket-Name/Fun/Map/file.csv")
df2.printSchema()
Output:
root
|-- CC|Fun|Head|Country|SendType: string (nullable = true)
Now I would like to store the above output only into a csv file having just these column names and datatype of these columns like below.
column_name,datatype
CC,string
Fun,string
Head,string
Country,string
SendType,string
I tried writing this into a csv using below option, but this is writing the file with entire data.
df2.coalesce(1).write.format("csv").mode("append").save("schema.csv")
regards
mahi

df.schema.fields to get fields & its datatype.
Check below code.
scala> val schema = df.schema.fields.map(field => (field.name,field.dataType.typeName)).toList.toDF("column_name","datatype")
schema: org.apache.spark.sql.DataFrame = [column_name: string, datatype: string]
scala> schema.show(false)
+---------------+--------+
|column_name |datatype|
+---------------+--------+
|applicationName|string |
|id |string |
|requestId |string |
|version |long |
+---------------+--------+
scala> schema.write.format("csv").save("/tmp/schema")

Try something like below use coalesce(1) and .option("header","true") to output with header
import java.io.FileWriter
object SparkSchema {
def main(args: Array[String]): Unit = {
val fw = new FileWriter("src/main/resources/csv.schema", true)
fw.write("column_name,datatype\n")
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("", "", "", 1l)).toDF("applicationName", "id", "requestId", "version")
val columnList : List[(String, String)] = df.schema.fields.map(field => (field.name, field.dataType.typeName))
.toList
try {
val outString = columnList.map(col => {
col._1 + "," + col._2
}).mkString("\n")
fw.write(outString)
}
finally fw.close()
val newColumnList : List[(String, String)] = List(("newColumn","integer"))
val finalColList = columnList ++ newColumnList
writeToS3("s3://bucket/newFileName.csv",finalColList)
}
def writeToS3(s3FileNameWithpath : String,finalColList : List[(String,String)]) {
val outString = finalColList.map(col => {
col._1 + "," + col._2
}).mkString("\\n")
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set("fs.s3a.access.key", "YOUR ACCESS KEY")
conf.set("fs.s3a.secret.key", "YOUR SECRET KEY")
val dest = new Path(s3FileNameWithpath)
val fs = dest.getFileSystem(conf)
val out = fs.create(dest, true)
out.write( outString.getBytes )
out.close()
}
}

An alternative to #QuickSilver's and #Srinivas' solutions, which they should both work, is to use the DDL representation of the schema. With df.schema.toDDL you get:
CC STRING, fun STRING, Head STRING, Country STRING, SendType STRING
which is the string representation of the schema then you can split and replace as shown next:
import java.io.PrintWriter
val schema = df.schema.toDDL.split(",")
// Array[String] = Array(`CC` STRING, `fun` STRING, `Head` STRING, `Country` STRING, `SendType` STRING)
val writer = new PrintWriter("/tmp/schema.csv")
writer.write("column_name,datatype\n")
schema.foreach{ r => writer.write(r.replace(" ", ",") + "\n") }
writer.close()
To write to S3 you can use Hadoop API as QuickSilver already implemented or a 3rd party library such as MINIO:
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
minioClient.putObject("YOUR_BUCKET","schema.csv", "/tmp/schema.csv", null)
Or even better by generating a string, storing it into a buffer and then send it via InputStream to S3:
import java.io.ByteArrayInputStream
import io.minio.MinioClient
val minioClient = new MinioClient("https://play.min.io", "ACCESS_KEY", "SECRET_KEY")
val schema = df.schema.toDDL.split(",")
val schemaBuffer = new StringBuilder
schemaBuffer ++= "column_name,datatype\n"
schema.foreach{ r => schemaBuffer ++= r.replace(" ", ",") + "\n" }
val inputStream = new ByteArrayInputStream(schemaBuffer.toString.getBytes("UTF-8"))
minioClient.putObject("YOUR_BUCKET", "schema.csv", inputStream, new PutObjectOptions(inputStream.available(), -1))
inputStream.close

#PySpark
df_schema = spark.createDataFrame([(i.name, str(i.dataType)) for i in df.schema.fields], ['column_name', 'datatype'])
df_schema.show()
This will create new dataFrame for schema of existing dataframe
UseCase:
Useful when you want create table with Schema of the dataframe & you cannot use below code as pySpark user may not be authorized to execute DDL commands on database.
df.createOrReplaceTempView("tmp_output_table")
spark.sql("""drop table if exists schema.output_table""")
spark.sql("""create table schema.output_table as select * from tmp_output_table""")

In Pyspark - You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes. Follow this link for more details pyspark.sql.DataFrame.dtypes
Having said that, try using below code -
data = df.dtypes
cols = ["col_name", "datatype"]
df = spark.createDataFrame(data=data,schema=cols)
df.show()

Related

Dynamic Query on Scala Spark

I'm basically trying to do something like this but spark doesn’t recognizes it.
val colsToLower: Array[String] = Array("col0", "col1", "col2")
val selectQry: String = colsToLower.map((x: String) => s"""lower(col(\"${x}\")).as(\"${x}\"), """).mkString.dropRight(2)
df
.select(selectQry)
.show(5)
Is there a way to do something like this in spark/scala?
If you need to lowercase the name of your columns there is a simple way of doing it. Here is one example:
df.columns.foreach(c => {
val newColumnName = c.toLowerCase
df = df.withColumnRenamed(c, newColumnName)
})
This will allow you to lowercase the column names, and update it in the spark dataframe.
I believe I found a way to build it:
def lowerTextColumns(cols: Array[String])(df: DataFrame): DataFrame = {
val remainingCols: String = (df.columns diff cols).mkString(", ")
val lowerCols: String = cols.map((x: String) => s"""lower(${x}) as ${x}, """).mkString.dropRight(2)
val selectQry: String =
if (colsToSelect.nonEmpty) lowerCols + ", " + remainingCols
else lowerCols
df
.selectExpr(selectQry.split(","):_*)
}

Load constraints from csv-file (amazon deequ)

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up:
Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
I chose to store the CSV in src/main/resources as it's very easy to read from there, and easy to maintain in parallel with the code being QA'ed.
def readCSV(spark: SparkSession, filename: String): DataFrame = {
import spark.implicits._
val inputFileStream = Try {
this.getClass.getResourceAsStream("/" + filename)
}
.getOrElse(
throw new Exception("Cannot find" + filename + "in src/main/resources")
)
val readlines =
scala.io.Source.fromInputStream(inputFileStream).getLines.toList
val csvData: Dataset[String] =
spark.sparkContext.parallelize(readlines).toDS
spark.read.option("header", true).option("inferSchema", true).csv(csvData)
}
This loads it as a DataFrame; this can easily be passed to code like gavincruick's example on GitHub, copied here for convenience:
//code to build verifier from DF that has a 'Constraint' column
type Verifier = DataFrame => VerificationResult
def generateVerifier(df: DataFrame, columnName: String): Try[Verifier] = {
val constraintCheckCodes: Seq[String] = df.select(columnName).collect().map(_(0).toString).toSeq
def checkSrcCode(checkCodeMethod: String, id: Int): String = s"""com.amazon.deequ.checks.Check(com.amazon.deequ.checks.CheckLevel.Error, "$id")$checkCodeMethod"""
val verifierSrcCode = s"""{
|import com.amazon.deequ.constraints.ConstrainableDataTypes
|import com.amazon.deequ.{VerificationResult, VerificationSuite}
|import org.apache.spark.sql.DataFrame
|
|val checks = Seq(
| ${constraintCheckCodes.zipWithIndex
.map { (checkSrcCode _).tupled }
.mkString(",\n ")}
|)
|
|(data: DataFrame) => VerificationSuite().onData(data).addChecks(checks).run()
|}
""".stripMargin.trim
println(s"Verification function source code:\n$verifierSrcCode\n")
compile[Verifier](verifierSrcCode)
}
/** Compiles the scala source code that, when evaluated, produces a value of type T. */
def compile[T](source: String): Try[T] =
Try {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(source)
val compiledCode = toolbox.compile(tree)
compiledCode().asInstanceOf[T]
}
//example usage...
//sample test data
val testDataDF = Seq(
("2020-02-12", "England", "E10000034", "Worcestershire", 1),
("2020-02-12", "Wales", "W11000024", "Powys", 0),
("2020-02-12", "Wales", null, "Unknown", 1),
("2020-02-12", "Canada", "MADEUP", "Ontario", 1)
).toDF("Date", "Country", "AreaCode", "Area", "TotalCases")
//constraints in a DF
val constraintsDF = Seq(
(".isComplete(\"Area\")"),
(".isComplete(\"Country\")"),
(".isComplete(\"TotalCases\")"),
(".isComplete(\"Date\")"),
(".hasCompleteness(\"AreaCode\", _ >= 0.80, Some(\"It should be above 0.80!\"))"),
(".isContainedIn(\"Country\", Array(\"England\", \"Scotland\", \"Wales\", \"Northern Ireland\"))")
).toDF("Constraint")
//Build Verifier from constraints DF
val verifier = generateVerifier(constraintsDF, "Constraint").get
//Run verifier against a sample DF
val result = verifier(testDataDF)
//display results
VerificationResult.checkResultsAsDataFrame(spark, result).show()
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.

Spark - scala - How to convert DataFrame to custom object?

Here is block code. In the code snippet I am reading multi line json and converting into Emp object.
def main(args: Array[String]): Unit = {
val filePath = Configuration.folderPath + "emp_unformatted.json"
val sparkConfig = new SparkConf().setMaster("local[2]").setAppName("findEmp")
val sparkContext = new SparkContext(sparkConfig)
val sqlContext = new SQLContext(sparkContext)
val formattedJsonData = sqlContext.read.option("multiline", "true").json(filePath)
val res = formattedJsonData.rdd.map(empParser)
for (e <- res.take(2)) println(e.name + " " + e.company + " " + e.about)
}
case class Emp(name: String, company: String, email: String, address: String, about: String)
def empParser(row: Row): Emp =
{
new Emp(row.getAs("name"), row.getAs("company"), row.getAs("email"), row.getAs("address"), row.getAs("about"))
}
My question is the line "formattedJsonData.rdd.map(empParser)" approach is correct? I am converting to RDD of Emp Object.
1. is that right approach.
2. Suppose I have 1L, 1M records, in that case any performance isssue.
3. have any better option to convert collection of emp
If you are using spark 2, You can use dataset which is also type-safe plus it provides performance benefits of DataFrames.
val df = sqlSession.read.option("multiline", "true").json(filePath)
import sqlSession.implicits._
val ds: Dataset[Emp] = df.as[Emp]

Union Dataframes based on condition in spark(scala)

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
sb.setLength(0)
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
sb.append(file.getPath.toString())
sb.append(",")
done = true
} else if (file.getPath.getName.contains(str2)) {
sb.append(file.getPath.toString())
}
}
}
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve
If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
df1.union(df2)
}
EDIT
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
.filter(fs.exists(_))
.map(_.toString)
if (existingFolders.isEmpty) {
sqlContext.emptyDataFrame
} else {
sqlContext.read.parquet(existingFolders: _*)
}
}

why remove header from csv file doesn't work

object test {
case class Caserne(x: String, y: String, Name: String, Description: String)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("BankDataAnalysis").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext= new SQLContext(sc)
import sqlContext.implicits._
// load caserne data
val caserneTxt = sc.parallelize(
IOUtils.toString(
new URL("http://donnees.ville.montreal.qc.ca/dataset/c69e78c6-e454-4bd9-9778-e4b0eaf8105b/resource/f6542ad1-31f5-458e-b33d-1a028fab3e98/download/casernessim.csv"),
Charset.forName("utf8")).split("\n"))
val header = caserneTxt.first()
val caserne = caserneTxt.map(s => s.split(",")).filter(s => s != header).map(
s => Caserne(s(0),
s(1),
s(2).replaceAll("[^\\d]", "").trim(),
s(3).replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*?>""", " ").trim()
)).toDF()
caserne.registerTempTable("caserne")
sqlContext.sql("Select * from caserne").show()
}
}
I have to remove csv file header. I used filter(s => s != header) but it did'nt work. Thank you for your help
Try using :-
val rows = data.filter(s=> header(s,"X") != "X")
reference :- How do I convert csv file to rdd
I found this convenient method
val header = caserneTxt.first()
val no_header = caserneTxt.filter(_(0) != header(0))
one way would be using the one of the header key and filter that from dataframe something like below
dataFrame.filter(row => row.getAs[String]("description") != "description").show